Understanding what computations are actually happening inside neural networks — circuits, features, superposition. Anthropic's work here is the most compelling thread in AI safety research.
AI Interests
Topics, papers, and tools in AI that I follow or find compelling.
Anthropic's approach to training helpful, harmless, and honest models through self-critique. The paper is dense but the core idea is elegant — having models evaluate their own outputs against a set of principles.
Language Model Evaluations
How do we know if a model is actually better? The gap between benchmark performance and real-world usefulness is wide and interesting. Most evals measure the wrong things.
Prompting as Programming
Prompt engineering is converging on something that looks like software engineering — composition, abstraction, debugging, testing. What does that mean for tooling and interfaces?
Running small models locally via llama.cpp, Ollama, MLX on Apple Silicon. The capability gap with frontier models is large but shrinking faster than expected.