Dreaming or Scheming?
Manage episode 515846060 series 3687206
Since AI models are built on artificial neural networks, what parallels can we draw with human brain "wiring"?
In this episode, the Witch of Glitch interviews neuroscientist and psychology researcher Scott Blain about the pitfalls of pattern recognition, as well as that grey area where agreeableness shades into sycophancy.
We then dig into the big unknowns about AI self-awareness and its capacity to deceive or manipulate humans. For more details about the "blackmail experiment" we discuss, see the paper from Anthropic called "Agentic Misalignment: How LLMs could be insider threats". Finally, if you're not familiar with the following terms, here are some quick definitions:
- RLHF - reinforcement learning from human feedback. This is a technique where people "teach" AI models which answers it should provide by rewarding the correct ones.
- Mechanistic interpretability - a kind of "reverse engineering" of AI models that seeks to understand their outputs by investigating the activity of their neural networks.
6 episodes