AI, and especially large language models, are evolving at an increasingly fast pace. Yet, they’re still viewed as black boxes, notoriously difficult to interpret. Mechanistic interpretability challenges this by aiming to make LLMs more understandable - a key step toward safe AI. Although this field is still in its early stages, this month’s newsletter highlights two resources to give you an overview.
Meet the curator of the month
Michael is a lead developer at Futurice who has spent over a decade navigating the full software lifecycle from frontend wizardry to backend, mobile, and full-stack adventures. With more than 15 years of closely following the evolution of software engineering, he’s also passionate about deep reinforcement learning, the captivating world of 3D graphics, and neural networks. In his free time, he enjoys bouldering and balcony gardening.
The Transformer circuits thread is a collection of detailed articles enriched with high-quality visualizations and interactive content, research updates, and insights into reverse engineering transformer models.
One standout resource, "Scaling Monosemanticity: Extracting interpretable features from Claude 3 Sonnet" from last year, offers an in-depth discussion with clear, illustrative examples. The article shows how different concepts, called features, "light up" within the production model of Claude 3 Sonnet and explains how, with sparse autoencoders, features such as deception,bias, and dangerous content can be observed and even used to steer model behavior.
Memorable posts don't need to be lengthy. "The Dark Matter of Neural Networks?" introduces a fascinating analogy: exploring the internals of large language models is much like observing the night sky with a telescope, where we can map only the brightest stars for now, while rare features might resemble the elusive dark matter of the universe.
In episode 452 of the Lex Fridman podcast, Chris Olah, Co-founder of Anthropic and AI researcher, discusses mechanistic interpretability with great passion. He delves into topics such as linear representation, superposition, polysemanticity, and sparse autoencoders, while also touching on the caloric theory and Geoffrey Hinton. Complex concepts are made accessible through tangible analogies and examples, leaving listeners optimistic about our growing ability to understand neural networks. Don't miss this engaging conversation that offers fresh insights into the safety and beauty of neural networks.
Tell us about a project you’ve worked on that you found particularly interesting or challenging.
One standout project was digitizing internal workflows for an electric utility company. Driven by continuous user interviews and closely tracking KPIs, we implemented a fast build-measure-learn cycle within a corporate context. What made this project both challenging and rewarding was the smooth collaboration among talented teams, which enabled us to meet our deadlines while delivering great user value. In the end, the positive user feedback was well worth the effort.
Any tips for maintaining a healthy work-life balance?
A few simple habits have made a big difference for me:
Cycling to work. It clears my mind in the morning and helps me unwind in the evening.
Ritual coffee breaks. Whether I’m hand-grinding coffee or chatting with a colleague in the kitchen, a mindful five-minute break goes a long way.
Focus. Tackling one task at a time is both efficient and rewarding. Bullet journaling helps me stay on track.
Are there any emerging technologies or trends you’re particularly excited about?
I'm looking forward to the insights that research on AI safety and interpretability will bring. For mechanistic interpretability, it seems only the surface has been scratched, and I expect more breakthroughs in 2025.
Join us!
Senior/Lead Full-stack Developer
As a Full-Stack Developer at Futurice, you'll get to work across industries to help clients find and create the best digital solutions for their needs now, and in the future. You’ll design scalable solutions, guide cross-functional teams, and collaborate with clients while turning complex problems into impactful digital experiences.