Mechanistic Interpretability

concept safetycircuitsresearch

Understanding what computations are actually happening inside neural networks — circuits, features, superposition. Anthropic's work here is the most compelling thread in AI safety research.

Constitutional AI

paper safetyrlhfanthropic

Anthropic's approach to training helpful, harmless, and honest models through self-critique. The paper is dense but the core idea is elegant — having models evaluate their own outputs against a set of principles.

Language Model Evaluations

concept evalsbenchmarksmethodology

How do we know if a model is actually better? The gap between benchmark performance and real-world usefulness is wide and interesting. Most evals measure the wrong things.

Prompting as Programming

concept promptingtoolingux

Prompt engineering is converging on something that looks like software engineering — composition, abstraction, debugging, testing. What does that mean for tooling and interfaces?

Local Inference

tool toolingopen-sourcelocal

Running small models locally via llama.cpp, Ollama, MLX on Apple Silicon. The capability gap with frontier models is large but shrinking faster than expected.