Afleveringen

  • A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.

    Patreon: patreon.com/axrpodcast

    Ko-fi: ko-fi.com/axrpodcast

    Topics we discuss, and timestamps:

    0:00:31 - What is AI control?

    0:16:16 - Protocols for AI control

    0:22:43 - Which AIs are controllable?

    0:29:56 - Preventing dangerous coded AI communication

    0:40:42 - Unpredictably uncontrollable AI

    0:58:01 - What control looks like

    1:08:45 - Is AI control evil?

    1:24:42 - Can red teams match misaligned AI?

    1:36:51 - How expensive is AI monitoring?

    1:52:32 - AI control experiments

    2:03:50 - GPT-4's aptitude at inserting backdoors

    2:14:50 - How AI control relates to the AI safety field

    2:39:25 - How AI control relates to previous Redwood Research work

    2:49:16 - How people can work on AI control

    2:54:07 - Following Buck and Ryan's research

    The transcript: axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html

    Links for Buck and Ryan:

    - Buck's twitter/X account: twitter.com/bshlgrs

    - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt

    - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com

    Main research works we talk about:

    - The case for ensuring that powerful AIs are controlled: lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled

    - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942

    Other things we mention:

    - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root

    - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512

    - Improving the Welfare of AIs: A Nearcasted Proposal: lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal

    - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938

    - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

    Episode art by Hamish Doodles: hamishdoodles.com

  • The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions.

    Patreon: patreon.com/axrpodcast

    Ko-fi: ko-fi.com/axrpodcast

    Topics we discuss, and timestamps:

    - 0:00:40 - What kinds of AI?

    - 0:01:30 - Democratizing AI

    - 0:04:44 - How people talk about democratizing AI

    - 0:09:34 - Is democratizing AI important?

    - 0:13:31 - Links between types of democratization

    - 0:22:43 - Democratizing profits from AI

    - 0:27:06 - Democratizing AI governance

    - 0:29:45 - Normative underpinnings of democratization

    - 0:44:19 - Open-sourcing AI

    - 0:50:47 - Risks from open-sourcing

    - 0:56:07 - Should we make AI too dangerous to open source?

    - 1:00:33 - Offense-defense balance

    - 1:03:13 - KataGo as a case study

    - 1:09:03 - Openness for interpretability research

    - 1:15:47 - Effectiveness of substitutes for open sourcing

    - 1:20:49 - Offense-defense balance, part 2

    - 1:29:49 - Making open-sourcing safer?

    - 1:40:37 - AI governance research

    - 1:41:05 - The state of the field

    - 1:43:33 - Open questions

    - 1:49:58 - Distinctive governance issues of x-risk

    - 1:53:04 - Technical research to help governance

    - 1:55:23 - Following Elizabeth's research

    The transcript: https://axrp.net/episode/2023/11/26/episode-26-ai-governance-elizabeth-seger.html

    Links for Elizabeth:

    - Personal website: elizabethseger.com

    - Centre for the Governance of AI (AKA GovAI): governance.ai

    Main papers:

    - Democratizing AI: Multiple Meanings, Goals, and Methods: arxiv.org/abs/2303.12642

    - Open-sourcing highly capable foundation models: an evaluation of risks, benefits, and alternative methods for pursuing open source objectives: papers.ssrn.com/sol3/papers.cfm?abstract_id=4596436

    Other research we discuss:

    - What Do We Mean When We Talk About "AI democratisation"? (blog post): governance.ai/post/what-do-we-mean-when-we-talk-about-ai-democratisation

    - Democratic Inputs to AI (OpenAI): openai.com/blog/democratic-inputs-to-ai

    - Collective Constitutional AI: Aligning a Language Model with Public Input (Anthropic): anthropic.com/index/collective-constitutional-ai-aligning-a-language-model-with-public-input

    - Against "Democratizing AI": johanneshimmelreich.net/papers/against-democratizing-AI.pdf

    - Adversarial Policies Beat Superhuman Go AIs: goattack.far.ai

    - Structured access: an emerging paradigm for safe AI deployment: arxiv.org/abs/2201.05159

    - Universal and Transferable Adversarial Attacks on Aligned Language Models (aka Adversarial Suffixes): arxiv.org/abs/2307.15043

    Episode art by Hamish Doodles: hamishdoodles.com

  • Zijn er afleveringen die ontbreken?

    Klik hier om de feed te vernieuwen.

  • Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaking with Caspar Oesterheld about some of his research on this very topic.

    Patreon: patreon.com/axrpodcast

    Ko-fi: ko-fi.com/axrpodcast

    Episode art by Hamish Doodles: hamishdoodles.com

    Topics we discuss, and timestamps:

    - 0:00:34 - Cooperative AI

    - 0:06:21 - Cooperative AI vs standard game theory

    - 0:19:45 - Do we need cooperative AI if we get alignment?

    - 0:29:29 - Cooperative AI and agent foundations

    - 0:34:59 - A Theory of Bounded Inductive Rationality

    - 0:50:05 - Why it matters

    - 0:53:55 - How the theory works

    - 1:01:38 - Relationship to logical inductors

    - 1:15:56 - How fast does it converge?

    - 1:19:46 - Non-myopic bounded rational inductive agents?

    - 1:24:25 - Relationship to game theory

    - 1:30:39 - Safe Pareto Improvements

    - 1:30:39 - What they try to solve

    - 1:36:15 - Alternative solutions

    - 1:40:46 - How safe Pareto improvements work

    - 1:51:19 - Will players fight over which safe Pareto improvement to adopt?

    - 2:06:02 - Relationship to program equilibrium

    - 2:11:25 - Do safe Pareto improvements break themselves?

    - 2:15:52 - Similarity-based Cooperation

    - 2:23:07 - Are similarity-based cooperators overly cliqueish?

    - 2:27:12 - Sensitivity to noise

    - 2:29:41 - Training neural nets to do similarity-based cooperation

    - 2:50:25 - FOCAL, Caspar's research lab

    - 2:52:52 - How the papers all relate

    - 2:57:49 - Relationship to functional decision theory

    - 2:59:45 - Following Caspar's research

    The transcript: axrp.net/episode/2023/10/03/episode-25-cooperative-ai-caspar-oesterheld.html

    Links for Caspar:

    - FOCAL at CMU: www.cs.cmu.edu/~focal/

    - Caspar on X, formerly known as Twitter: twitter.com/C_Oesterheld

    - Caspar's blog: casparoesterheld.com/

    - Caspar on Google Scholar: scholar.google.com/citations?user=xeEcRjkAAAAJ&hl=en&oi=ao

    Research we discuss:

    - A Theory of Bounded Inductive Rationality: arxiv.org/abs/2307.05068

    - Safe Pareto improvements for delegated game playing: link.springer.com/article/10.1007/s10458-022-09574-6

    - Similarity-based Cooperation: arxiv.org/abs/2211.14468

    - Logical Induction: arxiv.org/abs/1609.03543

    - Program Equilibrium: citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e1a060cda74e0e3493d0d81901a5a796158c8410

    - Formalizing Objections against Surrogate Goals: www.alignmentforum.org/posts/K4FrKRTrmyxrw5Dip/formalizing-objections-against-surrogate-goals

    - Learning with Opponent-Learning Awareness: arxiv.org/abs/1709.04326

  • Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces.

    Patreon: patreon.com/axrpodcast

    Ko-fi: ko-fi.com/axrpodcast

    Episode art by Hamish Doodles: hamishdoodles.com/

    Topics we discuss, and timestamps:

    - 0:00:37 - The superalignment team

    - 0:02:10 - What's a human-level automated alignment researcher?

    - 0:06:59 - The gap between human-level automated alignment researchers and superintelligence

    - 0:18:39 - What does it do?

    - 0:24:13 - Recursive self-improvement

    - 0:26:14 - How to make the AI AI alignment researcher

    - 0:30:09 - Scalable oversight

    - 0:44:38 - Searching for bad behaviors and internals

    - 0:54:14 - Deliberately training misaligned models

    - 1:02:34 - Four year deadline

    - 1:07:06 - What if it takes longer?

    - 1:11:38 - The superalignment team and...

    - 1:11:38 - ... governance

    - 1:14:37 - ... other OpenAI teams

    - 1:18:17 - ... other labs

    - 1:26:10 - Superalignment team logistics

    - 1:29:17 - Generalization

    - 1:43:44 - Complementary research

    - 1:48:29 - Why is Jan optimistic?

    - 1:58:32 - Long-term agency in LLMs?

    - 2:02:44 - Do LLMs understand alignment?

    - 2:06:01 - Following Jan's research

    The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html

    Links for Jan and OpenAI:

    - OpenAI jobs: openai.com/careers

    - Jan's substack: aligned.substack.com

    - Jan's twitter: twitter.com/janleike

    Links to research and other writings we discuss:

    - Introducing Superalignment: openai.com/blog/introducing-superalignment

    - Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050

    - Planning for AGI and beyond:
    openai.com/blog/planning-for-agi-and-beyond

    - Self-critiquing models for assisting human evaluators: arxiv.org/abs/2206.05802

    - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143

    - Language models can explain neurons in language models https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

    - Our approach to alignment research: openai.com/blog/our-approach-to-alignment-research

    - Training language models to follow instructions with human feedback (aka the Instruct-GPT paper): arxiv.org/abs/2203.02155

  • Is there some way we can detect bad behaviour in our AI system without having to know exactly what it looks like? In this episode, I speak with Mark Xu about mechanistic anomaly detection: a research direction based on the idea of detecting strange things happening in neural networks, in the hope that that will alert us of potential treacherous turns. We both talk about the core problems of relating these mechanistic anomalies to bad behaviour, as well as the paper "Formalizing the presumption of independence", which formulates the problem of formalizing heuristic mathematical reasoning, in the hope that this will let us mathematically define "mechanistic anomalies".

    Patreon: patreon.com/axrpodcast

    Ko-fi: ko-fi.com/axrpodcast

    Episode art by Hamish Doodles: hamishdoodles.com/

    Topics we discuss, and timestamps:

    - 0:00:38 - Mechanistic anomaly detection

    - 0:09:28 - Are all bad things mechanistic anomalies, and vice versa?

    - 0:18:12 - Are responses to novel situations mechanistic anomalies?

    - 0:39:19 - Formalizing "for the normal reason, for any reason"

    - 1:05:22 - How useful is mechanistic anomaly detection?

    - 1:12:38 - Formalizing the Presumption of Independence

    - 1:20:05 - Heuristic arguments in physics

    - 1:27:48 - Difficult domains for heuristic arguments

    - 1:33:37 - Why not maximum entropy?

    - 1:44:39 - Adversarial robustness for heuristic arguments

    - 1:54:05 - Other approaches to defining mechanisms

    - 1:57:20 - The research plan: progress and next steps

    - 2:04:13 - Following ARC's research

    The transcript: axrp.net/episode/2023/07/24/episode-23-mechanistic-anomaly-detection-mark-xu.html

    ARC links:

    - Website: alignment.org

    - Theory blog: alignment.org/blog

    - Hiring page: alignment.org/hiring

    Research we discuss:

    - Formalizing the presumption of independence: arxiv.org/abs/2211.06738

    - Eliciting Latent Knowledge (aka ELK): alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge

    - Mechanistic Anomaly Detection and ELK: alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk

    - Can we efficiently explain model behaviours? alignmentforum.org/posts/dQvxMZkfgqGitWdkb/can-we-efficiently-explain-model-behaviors

    - Can we efficiently distinguish different mechanisms? alignmentforum.org/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms

  • What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode's guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe.

    Patreon: patreon.com/axrpodcast

    Ko-fi: ko-fi.com/axrpodcast

    Episode art by Hamish Doodles: hamishdoodles.com

    Topics we discuss, and timestamps:

    - 0:00:42 - Why understand human value formation?

    - 0:19:59 - Why not design methods to align to arbitrary values?

    - 0:27:22 - Postulates about human brains

    - 0:36:20 - Sufficiency of the postulates

    - 0:44:55 - Reinforcement learning as conditional sampling

    - 0:48:05 - Compatibility with genetically-influenced behaviour

    - 1:03:06 - Why deep learning is basically what the brain does

    - 1:25:17 - Shard theory

    - 1:38:49 - Shard theory vs expected utility optimizers

    - 1:54:45 - What shard theory says about human values

    - 2:05:47 - Does shard theory mean we're doomed?

    - 2:18:54 - Will nice behaviour generalize?

    - 2:33:48 - Does alignment generalize farther than capabilities?

    - 2:42:03 - Are we at the end of machine learning history?

    - 2:53:09 - Shard theory predictions

    - 2:59:47 - The shard theory research community

    - 3:13:45 - Why do shard theorists not work on replicating human childhoods?

    - 3:25:53 - Following shardy research

    The transcript: axrp.net/episode/2023/06/15/episode-22-shard-theory-quintin-pope.html

    Shard theorist links:

    - Quintin's LessWrong profile: lesswrong.com/users/quintin-pope

    - Alex Turner's LessWrong profile: lesswrong.com/users/turntrout

    - Shard theory Discord: discord.gg/AqYkK7wqAG

    - EleutherAI Discord: discord.gg/eleutherai

    Research we discuss:

    - The Shard Theory Sequence: lesswrong.com/s/nyEFg3AuJpdAozmoX

    - Pretraining Language Models with Human Preferences: arxiv.org/abs/2302.08582

    - Inner alignment in salt-starved rats: lesswrong.com/posts/wcNEXDHowiWkRxDNv/inner-alignment-in-salt-starved-rats

    - Intro to Brain-like AGI Safety Sequence: lesswrong.com/s/HzcM2dkCq7fwXBej8

    - Brains and transformers:

    - The neural architecture of language: Integrative modeling converges on predictive processing: pnas.org/doi/10.1073/pnas.2105646118

    - Brains and algorithms partially converge in natural language processing: nature.com/articles/s42003-022-03036-1

    - Evidence of a predictive coding hierarchy in the human brain listening to speech: nature.com/articles/s41562-022-01516-2

    - Singular learning theory explainer: Neural networks generalize because of this one weird trick: lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick

    - Singular learning theory links: metauni.org/slt/

    - Implicit Regularization via Neural Feature Alignment, aka circles in the parameter-function map: arxiv.org/abs/2008.00938

    - The shard theory of human values: lesswrong.com/s/nyEFg3AuJpdAozmoX/p/iCfdcxiyr2Kj8m8mT

    - Predicting inductive biases of pre-trained networks: openreview.net/forum?id=mNtmhaDkAr

    - Understanding and controlling a maze-solving policy network, aka the cheese vector: lesswrong.com/posts/cAC4AXiNC5ig6jQnc/understanding-and-controlling-a-maze-solving-policy-network

    - Quintin's Research agenda: Supervising AIs improving AIs: lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais

    - Steering GPT-2-XL by adding an activation vector: lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector

    Links for the addendum on mesa-optimization skepticism:

    - Quintin's response to Yudkowsky arguing against AIs being steerable by gradient descent: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_argues_against_AIs_being_steerable_by_gradient_descent_

    - Quintin on why evolution is not like AI training: lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Edit__Why_evolution_is_not_like_AI_training

    - Evolution provides no evidence for the sharp left turn: lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn

    - Let's Agree to Agree: Neural Networks Share Classification Order on Real Datasets: arxiv.org/abs/1905.10854

  • Lots of people in the field of machine learning study 'interpretability', developing tools that they say give us useful information about neural networks. But how do we know if meaningful progress is actually being made? What should we want out of these tools? In this episode, I speak to Stephen Casper about these questions, as well as about a benchmark he's co-developed to evaluate whether interpretability tools can find 'Trojan horses' hidden inside neural nets.

    Patreon: patreon.com/axrpodcast

    Ko-fi: ko-fi.com/axrpodcast

    Topics we discuss, and timestamps:

    - 00:00:42 - Interpretability for engineers

    - 00:00:42 - Why interpretability?

    - 00:12:55 - Adversaries and interpretability

    - 00:24:30 - Scaling interpretability

    - 00:42:29 - Critiques of the AI safety interpretability community

    - 00:56:10 - Deceptive alignment and interpretability

    - 01:09:48 - Benchmarking Interpretability Tools (for Deep Neural Networks) (Using Trojan Discovery)

    - 01:10:40 - Why Trojans?

    - 01:14:53 - Which interpretability tools?

    - 01:28:40 - Trojan generation

    - 01:38:13 - Evaluation

    - 01:46:07 - Interpretability for shaping policy

    - 01:53:55 - Following Casper's work

    The transcript: axrp.net/episode/2023/05/02/episode-21-interpretability-for-engineers-stephen-casper.html

    Links for Casper:

    - Personal website: stephencasper.com/

    - Twitter: twitter.com/StephenLCasper

    - Electronic mail: scasper [at] mit [dot] edu

    Research we discuss:

    - The Engineer's Interpretability Sequence: alignmentforum.org/s/a6ne2ve5uturEEQK7

    - Benchmarking Interpretability Tools for Deep Neural Networks: arxiv.org/abs/2302.10894

    - Adversarial Policies beat Superhuman Go AIs: goattack.far.ai/

    - Adversarial Examples Are Not Bugs, They Are Features: arxiv.org/abs/1905.02175

    - Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974

    - Softmax Linear Units: transformer-circuits.pub/2022/solu/index.html

    - Red-Teaming the Stable Diffusion Safety Filter: arxiv.org/abs/2210.04610

    Episode art by Hamish Doodles: hamishdoodles.com

  • How should we scientifically think about the impact of AI on human civilization, and whether or not it will doom us all? In this episode, I speak with Scott Aaronson about his views on how to make progress in AI alignment, as well as his work on watermarking the output of language models, and how he moved from a background in quantum complexity theory to working on AI.

    Note: this episode was recorded before this story (vice.com/en/article/pkadgm/man-dies-by-suicide-after-talking-with-ai-chatbot-widow-says) emerged of a man committing suicide after discussions with a language-model-based chatbot, that included discussion of the possibility of him killing himself.

    Patreon: https://www.patreon.com/axrpodcast

    Ko-fi: https://ko-fi.com/axrpodcast

    Topics we discuss, and timestamps:

    - 0:00:36 - 'Reform' AI alignment

    - 0:01:52 - Epistemology of AI risk

    - 0:20:08 - Immediate problems and existential risk

    - 0:24:35 - Aligning deceitful AI

    - 0:30:59 - Stories of AI doom

    - 0:34:27 - Language models

    - 0:43:08 - Democratic governance of AI

    - 0:59:35 - What would change Scott's mind

    - 1:14:45 - Watermarking language model outputs

    - 1:41:41 - Watermark key secrecy and backdoor insertion

    - 1:58:05 - Scott's transition to AI research

    - 2:03:48 - Theoretical computer science and AI alignment

    - 2:14:03 - AI alignment and formalizing philosophy

    - 2:22:04 - How Scott finds AI research

    - 2:24:53 - Following Scott's research

    The transcript: axrp.net/episode/2023/04/11/episode-20-reform-ai-alignment-scott-aaronson.html

    Links to Scott's things:

    - Personal website: scottaaronson.com

    - Book, Quantum Computing Since Democritus: amazon.com/Quantum-Computing-since-Democritus-Aaronson/dp/0521199565/

    - Blog, Shtetl-Optimized: scottaaronson.blog

    Writings we discuss:

    - Reform AI Alignment: scottaaronson.blog/?p=6821

    - Planting Undetectable Backdoors in Machine Learning Models: arxiv.org/abs/2204.06974

  • How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.

    Topics we discuss, and timestamps:

    - 00:01:05 - What is mechanistic interpretability?

    - 00:24:16 - Types of AI cognition

    - 00:54:27 - Automating mechanistic interpretability

    - 01:11:57 - Summarizing the papers

    - 01:24:43 - 'A Mathematical Framework for Transformer Circuits'

    - 01:39:31 - How attention works

    - 01:49:26 - Composing attention heads

    - 01:59:42 - Induction heads

    - 02:11:05 - 'In-context Learning and Induction Heads'

    - 02:12:55 - The multiplicity of induction heads

    - 02:30:10 - Lines of evidence

    - 02:38:47 - Evolution in loss-space

    - 02:46:19 - Mysteries of in-context learning

    - 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'

    - 02:50:57 - How neural nets learn modular addition

    - 03:11:37 - The suddenness of grokking

    - 03:34:16 - Relation to other research

    - 03:43:57 - Could mechanistic interpretability possibly work?

    - 03:49:28 - Following Neel's research

    The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html

    Links to Neel's things:

    - Neel on Twitter: twitter.com/NeelNanda5

    - Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1

    - Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability

    - TransformerLens: github.com/neelnanda-io/TransformerLens

    - Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic

    - Neel on YouTube: youtube.com/@neelnanda2469

    - 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj

    - Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J

    Writings we discuss:

    - A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html

    - In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html

    - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217

    - Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052

    - interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

    - Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262

    - Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097

    - Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN

    - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143

    - Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593

    - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177

    - The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544

    - Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration

    - Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913

    - Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves

    - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635

    - Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a

  • I have a new podcast, where I interview whoever I want about whatever I want. It's called "The Filan Cabinet", and you can find it wherever you listen to podcasts. The first three episodes are about pandemic preparedness, God, and cryptocurrency. For more details, check out the podcast website (thefilancabinet.com), or search "The Filan Cabinet" in your podcast app.

  • Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending them sensibly to situations where things are different - like learning that the world works via special relativity, or seeing a picture of a novel sausage-bread combination. For a while, Stuart Armstrong has been thinking about concept extrapolation and how it relates to AI alignment. In this episode, we discuss where his thoughts are at on this topic, what the relationship to AI alignment is, and what the open questions are.

    Topics we discuss, and timestamps:

    - 00:00:44 - What is concept extrapolation

    - 00:15:25 - When is concept extrapolation possible

    - 00:30:44 - A toy formalism

    - 00:37:25 - Uniqueness of extrapolations

    - 00:48:34 - Unity of concept extrapolation methods

    - 00:53:25 - Concept extrapolation and corrigibility

    - 00:59:51 - Is concept extrapolation possible?

    - 01:37:05 - Misunderstandings of Stuart's approach

    - 01:44:13 - Following Stuart's work

    The transcript: axrp.net/episode/2022/09/03/episode-18-concept-extrapolation-stuart-armstrong.html

    Stuart's startup, Aligned AI: aligned-ai.com

    Research we discuss:

    - The Concept Extrapolation sequence: alignmentforum.org/s/u9uawicHx7Ng7vwxA

    - The HappyFaces benchmark: github.com/alignedai/HappyFaces

    - Goal Misgeneralization in Deep Reinforcement Learning: arxiv.org/abs/2105.14111

  • Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned.

    Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'Fanfiction' section of the transcript).

    Topics we discuss, and timestamps:

    - 00:00:40 - Summary of the paper

    - 00:02:23 - Alignment as scalable oversight and catastrophe minimization

    - 00:08:06 - Novel contribtions

    - 00:14:20 - Evaluating adversarial robustness

    - 00:20:26 - Adversary construction

    - 00:35:14 - The task

    - 00:38:23 - Fanfiction

    - 00:42:15 - Estimators to reduce labelling burden

    - 00:45:39 - Future work

    - 00:50:12 - About Redwood Research

    The transcript: axrp.net/episode/2022/08/21/episode-17-training-for-very-high-reliability-daniel-ziegler.html

    Daniel Ziegler on Google Scholar: scholar.google.com/citations?user=YzfbfDgAAAAJ

    Research we discuss:

    - Daniel's paper, Adversarial Training for High-Stakes Reliability: arxiv.org/abs/2205.01663

    - Low-stakes alignment: alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment

    - Red Teaming Language Models with Language Models: arxiv.org/abs/2202.03286

    - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472

    - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

  • Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 (axrp.net/episode/2021/04/08/episode-6-debate-beth-barnes.html) if you need a primer. But how do we get language models to the stage where they can usefully implement debate? In this episode, I talk to Geoffrey Irving about the role of language models in AI safety, as well as three projects he's done that get us closer to making debate happen: using language models to find flaws in themselves, getting language models to back up claims they make with citations, and figuring out how uncertain language models should be about the quality of various answers.

    Topics we discuss, and timestamps:

    - 00:00:48 - Status update on AI safety via debate

    - 00:10:24 - Language models and AI safety

    - 00:19:34 - Red teaming language models with language models

    - 00:35:31 - GopherCite

    - 00:49:10 - Uncertainty Estimation for Language Reward Models

    - 01:00:26 - Following Geoffrey's work, and working with him

    The transcript: axrp.net/episode/2022/07/01/episode-16-preparing-for-debate-ai-geoffrey-irving.html

    Geoffrey's twitter: twitter.com/geoffreyirving

    Research we discuss:

    - Red Teaming Language Models With Language Models: arxiv.org/abs/2202.03286

    - Teaching Language Models to Support Answers with Verified Quotes, aka GopherCite: arxiv.org/abs/2203.11147

    - Uncertainty Estimation for Language Reward Models: arxiv.org/abs/2203.07472

    - AI Safety via Debate: arxiv.org/abs/1805.00899

    - Writeup: progress on AI safety via debate: lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1

    - Eliciting Latent Knowledge: ai-alignment.com/eliciting-latent-knowledge-f977478608fc

    - Training Compute-Optimal Large Language Models, aka Chinchilla: arxiv.org/abs/2203.15556

  • Why does anybody care about natural abstractions? Do they somehow relate to math, or value learning? How do E. coli bacteria find sources of sugar? All these questions and more will be answered in this interview with John Wentworth, where we talk about his research plan of understanding agency via natural abstractions. Topics we discuss, and timestamps:

    - 00:00:31 - Agency in E. Coli

    - 00:04:59 - Agency in financial markets

    - 00:08:44 - Inferring agency in real-world systems

    - 00:16:11 - Selection theorems

    - 00:20:22 - Abstraction and natural abstractions

    - 00:32:42 - Information at a distance

    - 00:39:20 - Why the natural abstraction hypothesis matters

    - 00:44:48 - Unnatural abstractions used by humans?

    - 00:49:11 - Probability, determinism, and abstraction

    - 00:52:58 - Whence probabilities in deterministic universes?

    - 01:02:37 - Abstraction and maximum entropy distributions

    - 01:07:39 - Natural abstractions and impact

    - 01:08:50 - Learning human values

    - 01:20:47 - The shape of the research landscape

    - 01:34:59 - Following John's work

    The transcript: axrp.net/episode/2022/05/23/episode-15-natural-abstractions-john-wentworth.html

    John on LessWrong: lesswrong.com/users/johnswentworth

    Research that we discuss:

    - Alignment by default - contains the natural abstraction hypothesis: alignmentforum.org/posts/Nwgdq6kHke5LY692J/alignment-by-default#Unsupervised__Natural_Abstractions

    - The telephone theorem: alignmentforum.org/posts/jJf4FrfiQdDGg7uco/information-at-a-distance-is-mediated-by-deterministic

    - Generalizing Koopman-Pitman-Darmois: alignmentforum.org/posts/tGCyRQigGoqA4oSRo/generalizing-koopman-pitman-darmois

    - The plan: alignmentforum.org/posts/3L46WGauGpr7nYubu/the-plan

    - Understanding deep learning requires rethinking generalization - deep learning can fit random data: arxiv.org/abs/1611.03530

    - A closer look at memorization in deep networks - deep learning learns before memorizing: arxiv.org/abs/1706.05394

    - Zero-shot coordination: arxiv.org/abs/2003.02979

    - A new formalism, method, and open issues for zero-shot coordination: arxiv.org/abs/2106.06613

    - Conservative agency via attainable utility preservation: arxiv.org/abs/1902.09725

    - Corrigibility: intelligence.org/files/Corrigibility.pdf

    Errata:

    - E. coli has ~4,400 genes, not 30,000.

    - A typical adult human body has thousands of moles of water in it, and therefore must consist of well more than 10 moles total.

  • Late last year, Vanessa Kosoy and Alexander Appel published some research under the heading of "Infra-Bayesian physicalism". But wait - what was infra-Bayesianism again? Why should we care? And what does any of this have to do with physicalism? In this episode, I talk with Vanessa Kosoy about these questions, and get a technical overview of how infra-Bayesian physicalism works and what its implications are.

    Topics we discuss, and timestamps:

    - 00:00:48 - The basics of infra-Bayes

    - 00:08:32 - An invitation to infra-Bayes

    - 00:11:23 - What is naturalized induction?

    - 00:19:53 - How infra-Bayesian physicalism helps with naturalized induction

    - 00:19:53 - Bridge rules

    - 00:22:22 - Logical uncertainty

    - 00:23:36 - Open source game theory

    - 00:28:27 - Logical counterfactuals

    - 00:30:55 - Self-improvement

    - 00:32:40 - How infra-Bayesian physicalism works

    - 00:32:47 - World models

    - 00:39-20 - Priors

    - 00:42:53 - Counterfactuals

    - 00:50:34 - Anthropics

    - 00:54:40 - Loss functions

    - 00:56:44 - The monotonicity principle

    - 01:01:57 - How to care about various things

    - 01:08:47 - Decision theory

    - 01:19:53 - Follow-up research

    - 01:20:06 - Infra-Bayesian physicalist quantum mechanics

    - 01:26:42 - Infra-Bayesian physicalist agreement theorems

    - 01:29:00 - The production of infra-Bayesianism research

    - 01:35:14 - Bridge rules and malign priors

    - 01:45:27 - Following Vanessa's work

    The transcript: axrp.net/episode/2022/04/05/episode-14-infra-bayesian-physicalism-vanessa-kosoy.html

    Vanessa on the Alignment Forum: alignmentforum.org/users/vanessa-kosoy

    Research that we discuss:

    - Infra-Bayesian physicalism: a formal theory of naturalized induction: alignmentforum.org/posts/gHgs2e2J5azvGFatb/infra-bayesian-physicalism-a-formal-theory-of-naturalized

    - Updating ambiguous beliefs (contains the infra-Bayesian update rule): sciencedirect.com/science/article/abs/pii/S0022053183710033

    - Functional Decision Theory: A New Theory of Instrumental Rationality: arxiv.org/abs/1710.05060

    - Space-time embedded intelligence: cs.utexas.edu/~ring/Orseau,%20Ring%3B%20Space-Time%20Embedded%20Intelligence,%20AGI%202012.pdf

    - Attacking the grain of truth problem using Bayes-Savage agents (generating a simplicity prior with Knightian uncertainty using oracle machines): alignmentforum.org/posts/5bd75cc58225bf0670375273/attacking-the-grain-of-truth-problem-using-bayes-sa

    - Quantity of experience: brain-duplication and degrees of consciousness (the thick wires argument): nickbostrom.com/papers/experience.pdf

    - Online learning in unknown Markov games: arxiv.org/abs/2010.15020

    - Agreeing to disagree (contains the Aumann agreement theorem): ma.huji.ac.il/~raumann/pdf/Agreeing%20to%20Disagree.pdf

    - What does the universal prior actually look like? (aka "the Solomonoff prior is malign"): ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/

    - The Solomonoff prior is malign: alignmentforum.org/posts/Tr7tAyt5zZpdTwTQK/the-solomonoff-prior-is-malign

    - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

    - ELK Thought Dump, by Abram Demski: alignmentforum.org/posts/eqzbXmqGqXiyjX3TP/elk-thought-dump-1

  • How should we think about artificial general intelligence (AGI), and the risks it might pose? What constraints exist on technical solutions to the problem of aligning superhuman AI systems with human intentions? In this episode, I talk to Richard Ngo about his report analyzing AGI safety from first principles, and recent conversations he had with Eliezer Yudkowsky about the difficulty of AI alignment.

    Topics we discuss, and timestamps:

    - 00:00:40 - The nature of intelligence and AGI

    - 00:01:18 - The nature of intelligence

    - 00:06:09 - AGI: what and how

    - 00:13:30 - Single vs collective AI minds

    - 00:18:57 - AGI in practice

    - 00:18:57 - Impact

    - 00:20:49 - Timing

    - 00:25:38 - Creation

    - 00:28:45 - Risks and benefits

    - 00:35:54 - Making AGI safe

    - 00:35:54 - Robustness of the agency abstraction

    - 00:43:15 - Pivotal acts

    - 00:50:05 - AGI safety concepts

    - 00:50:05 - Alignment

    - 00:56:14 - Transparency

    - 00:59:25 - Cooperation

    - 01:01:40 - Optima and selection processes

    - 01:13:33 - The AI alignment research community

    - 01:13:33 - Updates from the Yudkowsky conversation

    - 01:17:18 - Corrections to the community

    - 01:23:57 - Why others don't join

    - 01:26:38 - Richard Ngo as a researcher

    - 01:28:26 - The world approaching AGI

    - 01:30:41 - Following Richard's work

    The transcript: axrp.net/episode/2022/03/31/episode-13-first-principles-agi-safety-richard-ngo.html

    Richard on the Alignment Forum: alignmentforum.org/users/ricraz

    Richard on Twitter: twitter.com/RichardMCNgo

    The AGI Safety Fundamentals course: eacambridge.org/agi-safety-fundamentals

    Materials that we mention:

    - AGI Safety from First Principles: alignmentforum.org/s/mzgtmmTKKn5MuCzFJ

    - Conversations with Eliezer Yudkowsky: alignmentforum.org/s/n945eovrA3oDueqtq

    - The Bitter Lesson: incompleteideas.net/IncIdeas/BitterLesson.html

    - Metaphors We Live By: en.wikipedia.org/wiki/Metaphors_We_Live_By

    - The Enigma of Reason: hup.harvard.edu/catalog.php?isbn=9780674237827

    - Draft report on AI timelines, by Ajeya Cotra: alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines

    - More is Different for AI: bounded-regret.ghost.io/more-is-different-for-ai/

    - The Windfall Clause: fhi.ox.ac.uk/windfallclause

    - Cooperative Inverse Reinforcement Learning: arxiv.org/abs/1606.03137

    - Imitative Generalisation: alignmentforum.org/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1

    - Eliciting Latent Knowledge: docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit

    - Draft report on existential risk from power-seeking AI, by Joseph Carlsmith: alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai

    - The Most Important Century: cold-takes.com/most-important-century

  • Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christiano about his views of how AI could be so dangerous, what bad AI scenarios could look like, and what he thinks about various techniques to reduce this risk.

    Topics we discuss, and timestamps:

    - 00:00:38 - How AI may pose an existential threat

    - 00:13:36 - AI timelines

    - 00:24:49 - Why we might build risky AI

    - 00:33:58 - Takeoff speeds

    - 00:51:33 - Why AI could have bad motivations

    - 00:56:33 - Lessons from our current world

    - 01:08:23 - "Superintelligence"

    - 01:15:21 - Technical causes of AI x-risk

    - 01:19:32 - Intent alignment

    - 01:33:52 - Outer and inner alignment

    - 01:43:45 - Thoughts on agent foundations

    - 01:49:35 - Possible technical solutions to AI x-risk

    - 01:49:35 - Imitation learning, inverse reinforcement learning, and ease of evaluation

    - 02:00:34 - Paul's favorite outer alignment solutions

    - 02:01:20 - Solutions researched by others

    - 2:06:13 - Decoupling planning from knowledge

    - 02:17:18 - Factored cognition

    - 02:25:34 - Possible solutions to inner alignment

    - 02:31:56 - About Paul

    - 02:31:56 - Paul's research style

    - 02:36:36 - Disagreements and uncertainties

    - 02:46:08 - Some favorite organizations

    - 02:48:21 - Following Paul's work

    The transcript: axrp.net/episode/2021/12/02/episode-12-ai-xrisk-paul-christiano.html

    Paul's blog posts on AI alignment: ai-alignment.com

    Material that we mention:

    - Cold Takes - The Most Important Century: cold-takes.com/most-important-century

    - Open Philanthropy reports on:

    - Modeling the human trajectory: openphilanthropy.org/blog/modeling-human-trajectory

    - The computational power of the human brain: openphilanthropy.org/blog/new-report-brain-computation

    - AI timelines (draft): alignmentforum.org/posts/KrJfoZzpSDpnrv9va/draft-report-on-ai-timelines

    - Whether AI could drive explosive economic growth: openphilanthropy.org/blog/report-advanced-ai-drive-explosive-economic-growth

    - Takeoff speeds: sideways-view.com/2018/02/24/takeoff-speeds

    - Superintelligence: Paths, Dangers, Strategies: en.wikipedia.org/wiki/Superintelligence:_Paths,_Dangers,_Strategies

    - Wei Dai on metaphilosophical competence:

    - Two neglected problems in human-AI safety: alignmentforum.org/posts/HTgakSs6JpnogD6c2/two-neglected-problems-in-human-ai-safety

    - The argument from philosophical difficulty: alignmentforum.org/posts/w6d7XBCegc96kz4n3/the-argument-from-philosophical-difficulty

    - Some thoughts on metaphilosophy: alignmentforum.org/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy

    - AI safety via debate: arxiv.org/abs/1805.00899

    - Iterated distillation and amplification: ai-alignment.com/iterated-distillation-and-amplification-157debfd1616

    - Scalable agent alignment via reward modeling: a research direction: arxiv.org/abs/1811.07871

    - Learning the prior: alignmentforum.org/posts/SL9mKhgdmDKXmxwE4/learning-the-prior

    - Imitative generalisation (AKA 'learning the prior'): alignmentforum.org/posts/JKj5Krff5oKMb8TjT/imitative-generalisation-aka-learning-the-prior-1

    - When is unaligned AI morally valuable?: ai-alignment.com/sympathizing-with-ai-e11a4bf5ef6e

  • Many scary stories about AI involve an AI system deceiving and subjugating humans in order to gain the ability to achieve its goals without us stopping it. This episode's guest, Alex Turner, will tell us about his research analyzing the notions of "attainable utility" and "power" that underlie these stories, so that we can better evaluate how likely they are and how to prevent them.

    Topics we discuss:

    - Side effects minimization

    - Attainable Utility Preservation (AUP)

    - AUP and alignment

    - Power-seeking

    - Power-seeking and alignment

    - Future work and about Alex

    The transcript: axrp.net/episode/2021/09/25/episode-11-attainable-utility-power-alex-turner.html

    Alex on the AI Alignment Forum: alignmentforum.org/users/turntrout

    Alex's Google Scholar page: scholar.google.com/citations?user=thAHiVcAAAAJ&hl=en&oi=ao

    Conservative Agency via Attainable Utility Preservation: arxiv.org/abs/1902.09725

    Optimal Policies Tend to Seek Power: arxiv.org/abs/1912.01683

    Other works discussed:

    - Avoiding Side Effects by Considering Future Tasks: arxiv.org/abs/2010.07877

    - The "Reframing Impact" Sequence: alignmentforum.org/s/7CdoznhJaLEKHwvJW

    - The "Risks from Learned Optimization" Sequence: alignmentforum.org/s/7CdoznhJaLEKHwvJW

    - Concrete Approval-Directed Agents: ai-alignment.com/concrete-approval-directed-agents-89e247df7f1b

    - Seeking Power is Convergently Instrumental in a Broad Class of Environments: alignmentforum.org/s/fSMbebQyR4wheRrvk/p/hzeLSQ9nwDkPc4KNt

    - Formalizing Convergent Instrumental Goals: intelligence.org/files/FormalizingConvergentGoals.pdf

    - The More Power at Stake, the Stronger Instumental Convergence Gets for Optimal Policies: alignmentforum.org/posts/Yc5QSSZCQ9qdyxZF6/the-more-power-at-stake-the-stronger-instrumental

    - Problem Relaxation as a Tactic: alignmentforum.org/posts/JcpwEKbmNHdwhpq5n/problem-relaxation-as-a-tactic

    - How I do Research: lesswrong.com/posts/e3Db4w52hz3NSyYqt/how-i-do-research

    - Math that Clicks: Look for Two-way Correspondences: lesswrong.com/posts/Lotih2o2pkR2aeusW/math-that-clicks-look-for-two-way-correspondences

    - Testing the Natural Abstraction Hypothesis: alignmentforum.org/posts/cy3BhHrGinZCp3LXE/testing-the-natural-abstraction-hypothesis-project-intro