Search a title or topic

Over 20 million podcasts, powered by 

Player FM logo
Artwork

Content provided by Turpentine, Erik Torenberg, and Nathan Labenz. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Turpentine, Erik Torenberg, and Nathan Labenz or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.
Player FM - Podcast App
Go offline with the Player FM app!

Untangling Neural Network Mechanisms: Goodfire's Lee Sharkey on Parameter-based Interpretability

2:02:11
 
Share
 

Manage episode 502819621 series 3452589
Content provided by Turpentine, Erik Torenberg, and Nathan Labenz. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Turpentine, Erik Torenberg, and Nathan Labenz or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Today Lee Sharkey of Goodfire joins The Cognitive Revolution to discuss his research on parameter decomposition methods that break down neural networks into interpretable computational components, exploring how his team's "stochastic parameter decomposition" approach addresses the limitations of sparse autoencoders and offers new pathways for understanding, monitoring, and potentially steering AI systems at the mechanistic level.

Check out our sponsors: Oracle Cloud Infrastructure, Shopify.

Shownotes below brought to you by Notion AI Meeting Notes - try one month for free at ⁠⁠⁠https://⁠⁠notion.com/lp/nathan

  • Parameter vs. Activation Decomposition: Traditional interpretability methods like Sparse Autoencoders (SAEs) focus on analyzing activations, while parameter decomposition focuses on understanding the parameters themselves - the actual "algorithm" of the neural network.

  • No "True" Decomposition: None of the decompositions (whether sparse dictionary learning or parameter decomposition) are objectively "right" because they're all attempting to discretize a fundamentally continuous object, inevitably introducing approximations.

  • Tradeoff in Interpretability: There's a balance between reconstruction loss and causal importance - as you decompose networks more, reconstruction loss may worsen, but interpretability might improve up to a certain point.

  • Potential Unlearning Applications: Parameter decomposition may make unlearning more straightforward than with SAEs because researchers are already working in parameter space and can directly modify vectors that perform specific functions.

  • Function Detection vs. Input Direction: A function like "deception" might manifest in many different input directions that SAEs struggle to identify as a single concept, while parameter decomposition might better isolate such functionality.

  • Knowledge Extraction Goal: A key aim is to extract knowledge from models by understanding how they "think," especially for tasks where models demonstrate superhuman capabilities.


Sponsors:

Oracle Cloud Infrastructure:

Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

Shopify:

Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive


  continue reading

281 episodes

Artwork
iconShare
 
Manage episode 502819621 series 3452589
Content provided by Turpentine, Erik Torenberg, and Nathan Labenz. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Turpentine, Erik Torenberg, and Nathan Labenz or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://podcastplayer.com/legal.

Today Lee Sharkey of Goodfire joins The Cognitive Revolution to discuss his research on parameter decomposition methods that break down neural networks into interpretable computational components, exploring how his team's "stochastic parameter decomposition" approach addresses the limitations of sparse autoencoders and offers new pathways for understanding, monitoring, and potentially steering AI systems at the mechanistic level.

Check out our sponsors: Oracle Cloud Infrastructure, Shopify.

Shownotes below brought to you by Notion AI Meeting Notes - try one month for free at ⁠⁠⁠https://⁠⁠notion.com/lp/nathan

  • Parameter vs. Activation Decomposition: Traditional interpretability methods like Sparse Autoencoders (SAEs) focus on analyzing activations, while parameter decomposition focuses on understanding the parameters themselves - the actual "algorithm" of the neural network.

  • No "True" Decomposition: None of the decompositions (whether sparse dictionary learning or parameter decomposition) are objectively "right" because they're all attempting to discretize a fundamentally continuous object, inevitably introducing approximations.

  • Tradeoff in Interpretability: There's a balance between reconstruction loss and causal importance - as you decompose networks more, reconstruction loss may worsen, but interpretability might improve up to a certain point.

  • Potential Unlearning Applications: Parameter decomposition may make unlearning more straightforward than with SAEs because researchers are already working in parameter space and can directly modify vectors that perform specific functions.

  • Function Detection vs. Input Direction: A function like "deception" might manifest in many different input directions that SAEs struggle to identify as a single concept, while parameter decomposition might better isolate such functionality.

  • Knowledge Extraction Goal: A key aim is to extract knowledge from models by understanding how they "think," especially for tasks where models demonstrate superhuman capabilities.


Sponsors:

Oracle Cloud Infrastructure:

Oracle Cloud Infrastructure (OCI) is the next-generation cloud that delivers better performance, faster speeds, and significantly lower costs, including up to 50% less for compute, 70% for storage, and 80% for networking. Run any workload, from infrastructure to AI, in a high-availability environment and try OCI for free with zero commitment at https://oracle.com/cognitive

Shopify:

Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive


  continue reading

281 episodes

Semua episod

×
 
Loading …

Welcome to Player FM!

Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.

 

Copyright 2025 | Privacy Policy | Terms of Service | | Copyright
Listen to this show while you explore
Play