AI Safety Fundamentals: Alignment – Podcast

Afleveringen

Being the (Pareto) Best in the World
4 mei· AI Safety Fundamentals: Alignment
This introduces the concept of Pareto frontiers. The top comment by Rob Miles also ties it to comparative advantage.
While reading, consider what Pareto frontiers your project could place you on.

Original text:
https://www.lesswrong.com/posts/XvN2QQpKTuEzgkZHY/being-the-pareto-best-in-the-world

Author:
John Wentworth
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
How to succeed as an early-stage researcher: the “lean startup” approach
23 apr· AI Safety Fundamentals: Alignment
I am approaching the end of my AI governance PhD, and I’ve spent about 2.5 years as a researcher at FHI. During that time, I’ve learnt a lot about the formula for successful early-career research.
This post summarises my advice for people in the first couple of years. Research is really hard, and I want people to avoid the mistakes I’ve made.

Original text:
https://forum.effectivealtruism.org/posts/jfHPBbYFzCrbdEXXd/how-to-succeed-as-an-early-stage-researcher-the-lean-startup#Conclusion

Author:
Toby Shevlane
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Zijn er afleveringen die ontbreken?

Klik hier om de feed te vernieuwen.
Become a person who Actually Does Things
17 apr· AI Safety Fundamentals: Alignment
The next four weeks of the course are an opportunity for you to actually build a thing that moves you closer to contributing to AI Alignment, and we're really excited to see what you do!
A common failure mode is to think "Oh, I can't actually do X" or to say "Someone else is probably doing Y."
You probably can do X, and it's unlikely anyone is doing Y! It could be you!

Original text:
https://www.neelnanda.io/blog/become-a-person-who-actually-does-things

Author:
Neel Nanda
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points
16 apr· AI Safety Fundamentals: Alignment
We took 10 years of research and what we’ve learned from advising 1,000+ people on how to build high-impact careers, compressed that into an eight-week course to create your career plan, and then compressed that into this three-page summary of the main points.
(It’s especially aimed at people who want a career that’s both satisfying and has a significant positive impact, but much of the advice applies to all career decisions.)

Original article:
https://80000hours.org/career-planning/summary/

Author:
Benjamin Todd
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Working in AI Alignment
14 apr· AI Safety Fundamentals: Alignment
This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course.

by Charlie Rogers-Smith, with minor updates by Adam Jones

Source:
https://aisafetyfundamentals.com/blog/alignment-careers-guide

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Computing Power and the Governance of AI
7 apr· AI Safety Fundamentals: Alignment
This post summarises a new report, “Computing Power and the Governance of Artificial Intelligence.” The full report is a collaboration between nineteen researchers from academia, civil society, and industry. It can be read here.
GovAI research blog posts represent the views of their authors, rather than the views of the organisation.

Source:
https://www.governance.ai/post/computing-power-and-the-governance-of-ai

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
AI Control: Improving Safety Despite Intentional Subversion
7 apr· AI Safety Fundamentals: Alignment
We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
We summarize the paper;We compare our methodology to the methodology of other safety papers.
Source:
https://www.alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Challenges in Evaluating AI Systems
7 apr· AI Safety Fundamentals: Alignment
Most conversations around the societal impacts of artificial intelligence (AI) come down to discussing some quality of an AI system, such as its truthfulness, fairness, potential for misuse, and so on. We are able to talk about these characteristics because we can technically evaluate models for their performance in these areas. But what many people working inside and outside of AI don’t fully appreciate is how difficult it is to build robust and reliable model evaluations. Many of today’s existing evaluation suites are limited in their ability to serve as accurate indicators of model capabilities or safety.

At Anthropic, we spend a lot of time building evaluations to better understand our AI systems. We also use evaluations to improve our safety as an organization, as illustrated by our Responsible Scaling Policy. In doing so, we have grown to appreciate some of the ways in which developing and running evaluations can be challenging.
Here, we outline challenges that we have encountered while evaluating our own models to give readers a sense of what developing, implementing, and interpreting model evaluations looks like in practice.

Source:
https://www.anthropic.com/news/evaluating-ai-systems

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Emerging Processes for Frontier AI Safety
7 apr· AI Safety Fundamentals: Alignment
The UK recognises the enormous opportunities that AI can unlock across our economy and our society. However, without appropriate guardrails, such technologies can pose significant risks. The AI Safety Summit will focus on how best to manage the risks from frontier AI such as misuse, loss of control and societal harms. Frontier AI organisations play an important role in addressing these risks and promoting the safety of the development and deployment of frontier AI.
The UK has therefore encouraged frontier AI organisations to publish details on their frontier AI safety policies ahead of the AI Safety Summit hosted by the UK on 1 to 2 November 2023. This will provide transparency regarding how they are putting into practice voluntary AI safety commitments and enable the sharing of safety practices within the AI ecosystem. Transparency of AI systems can increase public trust, which can be a significant driver of AI adoption.
This document complements these publications by providing a potential list of frontier AI organisations’ safety policies.

Source:
https://www.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety/emerging-processes-for-frontier-ai-safety

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
AI Watermarking Won’t Curb Disinformation
7 apr· AI Safety Fundamentals: Alignment
Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distinguish AI-generated content from human-generated content. It would help people avoid endlessly arguing with bots online, or believing what a fake image purports to show. One common proposal is that big companies should incorporate watermarks into the outputs of their AIs. For instance, this could involve taking an image and subtly changing many pixels in a way that’s undetectable to the eye but detectable to a computer program. Or it could involve swapping words for synonyms in a predictable way so that the meaning is unchanged, but a program could readily determine the text was generated by an AI.
Unfortunately, watermarking schemes are unlikely to work. So far most have proven easy to remove, and it’s likely that future schemes will have similar problems.

Source:
https://transformer-circuits.pub/2023/monosemantic-features/index.html

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
1 apr· AI Safety Fundamentals: Alignment
Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria–faithfulness, completeness, and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, pointing toward opportunities to scale our understanding to both larger models and more complex tasks. Code for all experiments is available at https://github.com/redwoodresearch/Easy-Transformer.

Source:
https://arxiv.org/pdf/2211.00593.pdf

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
31 mrt· AI Safety Fundamentals: Alignment
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic: they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1, a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.

Source:
https://transformer-circuits.pub/2023/monosemantic-features/index.html

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Zoom In: An Introduction to Circuits
31 mrt· AI Safety Fundamentals: Alignment
By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks.

Many important transition points in the history of science have been moments when science “zoomed in.” At these points, we develop a visualization or tool that allows us to see the world in a new level of detail, and a new field of science develops to study the world through this lens.
For example, microscopes let us see cells, leading to cellular biology. Science zoomed in. Several techniques including x-ray crystallography let us see DNA, leading to the molecular revolution. Science zoomed in. Atomic theory. Subatomic particles. Neuroscience. Science zoomed in.
These transitions weren’t just a change in precision: they were qualitative changes in what the objects of scientific inquiry are. For example, cellular biology isn’t just more careful zoology. It’s a new kind of inquiry that dramatically shifts what we can understand.
The famous examples of this phenomenon happened at a very large scale, but it can also be the more modest shift of a small research community realizing they can now study their topic in a finer grained level of detail.

Source:
https://distill.pub/2020/circuits/zoom-in/

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
26 mrt· AI Safety Fundamentals: Alignment
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively fine-tune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive fine-tuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.
We find that simple methods can often significantly improve weak-to-strong generalization: for example, when fine-tuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
Source:
https://arxiv.org/pdf/2312.09390.pdf

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Can We Scale Human Feedback for Complex AI Tasks?
26 mrt· AI Safety Fundamentals: Alignment
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.
This article briefly recaps some of the challenges faced with human feedback, and introduces the approaches to scalable oversight covered in session 4 of our AI Alignment course.

Source:
https://aisafetyfundamentals.com/blog/scalable-oversight-intro/

Narrated for AI Safety Fundamentals by Perrin Walker
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Machine Learning for Humans: Supervised Learning
13 mei 2023· AI Safety Fundamentals: Alignment
The two tasks of supervised learning: regression and classification. Linear regression, loss functions, and gradient descent.
How much money will we make by spending more dollars on digital advertising? Will this loan applicant pay back the loan or not? What’s going to happen to the stock market tomorrow?

Original article:
https://medium.com/machine-learning-for-humans/supervised-learning-740383a2feab

Author:
Vishal Maini
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Visualizing the Deep Learning Revolution
13 mei 2023· AI Safety Fundamentals: Alignment
The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ideas using a series of illustrative examples:
There have been huge jumps in the capabilities of AIs over the last decade, to the point where it’s becoming hard to specify tasks that AIs can’t do.This progress has been primarily driven by scaling up a handful of relatively simple algorithms (rather than by developing a more principled or scientific understanding of deep learning).Very few people predicted that progress would be anywhere near this fast; but many of those who did also predict that we might face existential risk from AGI in the coming decades.
I’ll focus on four domains: vision, games, language-based tasks, and science. The first two have more limited real-world applications, but provide particularly graphic and intuitive examples of the pace of progress.
Original article:
https://medium.com/@richardcngo/visualizing-the-deep-learning-revolution-722098eb9c5
Author:
Richard Ngo
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Four Background Claims
13 mei 2023· AI Safety Fundamentals: Alignment
MIRI’s mission is to ensure that the creation of smarter-than-human artificial intelligence has a positive impact. Why is this mission important, and why do we think that there’s work we can do today to help ensure any such thing? In this post and my next one, I’ll try to answer those questions. This post will lay out what I see as the four most important premises underlying our mission. Related posts include Eliezer Yudkowsky’s “Five Theses” and Luke Muehlhauser’s “Why MIRI?”; this is my attempt to make explicit the claims that are in the background whenever I assert that our mission is of critical importance. #### Claim #1: Humans have a very general ability to solve problems and achieve goals across diverse domains. We call this ability “intelligence,” or “general intelligence.” This isn’t a formal definition — if we knew exactly what general intelligence was, we’d be better able to program it into a computer — but we do think that there’s a real phenomenon of general intelligence that we cannot yet replicate in code. Alternative view: There is no such thing as general intelligence. Instead, humans have a collection of disparate special-purpose modules. Computers will keep getting better at narrowly defined tasks such as chess or driving, but at no point will they acquire “generality” and become significantly more useful, because there is no generality to acquire.
Source:
https://intelligence.org/2015/07/24/four-background-claims/
Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.
---
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
On the Opportunities and Risks of Foundation Models
13 mei 2023· AI Safety Fundamentals: Alignment
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

Original article:
https://arxiv.org/abs/2108.07258

Authors:
Bommasani et al.
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
More Is Different for AI
13 mei 2023· AI Safety Fundamentals: Alignment
Machine learning is touching increasingly many aspects of our society, and its effect will only continue to grow. Given this, I and many others care about risks from future ML systems and how to mitigate them. When thinking about safety risks from ML, there are two common approaches, which I'll call the Engineering approach and the Philosophy approach: The Engineering approach tends to be empirically-driven, drawing experience from existing or past ML systems and looking at issues that either: (1) are already major problems, or (2) are minor problems, but can be expected to get worse in the future. Engineering tends to be bottom-up and tends to be both in touch with and anchored on current state-of-the-art systems. The Philosophy approach tends to think more about the limit of very advanced systems. It is willing to entertain thought experiments that would be implausible with current state-of-the-art systems (such as Nick Bostrom's paperclip maximizer) and is open to considering abstractions without knowing many details. It often sounds more "sci-fi like" and more like philosophy than like computer science. It draws some inspiration from current ML systems, but often only in broad strokes. I'll discuss these approaches mainly in the context of ML safety, but the same distinction applies in other areas. For instance, an Engineering approach to AI + Law might focus on how to regulate self-driving cars, while Philosophy might ask whether using AI in judicial decision-making could undermine liberal democracy.
Original text:
https://bounded-regret.ghost.io/more-is-different-for-ai/
Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.
---
- Luisteren Nogmaals beluisteren Doorgaan Wordt afgespeeld...
- Later beluisteren Later beluisteren
Laat meer zien

Afleveringen

Being the (Pareto) Best in the World

How to succeed as an early-stage researcher: the “lean startup” approach

Become a person who Actually Does Things

Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points

Working in AI Alignment

Computing Power and the Governance of AI

AI Control: Improving Safety Despite Intentional Subversion

Challenges in Evaluating AI Systems

Emerging Processes for Frontier AI Safety

AI Watermarking Won’t Curb Disinformation

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zoom In: An Introduction to Circuits

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Can We Scale Human Feedback for Complex AI Tasks?

Machine Learning for Humans: Supervised Learning

Visualizing the Deep Learning Revolution

Four Background Claims

On the Opportunities and Risks of Foundation Models

More Is Different for AI