Notes
Posted 10 months ago

AI aligment by analogy

A brief sketch of AI alignment as an engineering problem

TL,DR; It is tempting to think of AI alignment as primarily an ethical problem, but this presupposes that we have some means of making an ethical scheme binding on an AI, which we do not have. At best we have only unreliable means of constraining AI computation or outputs, which are not sufficient to meet any commonly-accepted standards of reliability for critical systems. By analogy with common computing systems such as disk drives and file systems, we can see that we lack the technology and engineering practice to ensure equivalent levels of reliability, and so much of our talk about AI ethics is akin to designing file hierarchies on top of a storage system that randomly forgets or rewrites data.


When discussing the possible risks and benefits of AI technology, the “alignment problem” is frequently raised as a source of negative risk. Unaligned AI, the theory goes, is AI which will have the capacity to act in ways that are harmful to human interests in ways that we cannot fully anticipate or mitigate. But this is only a consequence of mis- or non-alignment. First, what do we mean by “alignment”?

It is common to talk about an “alignment of interests” between people, such that what is good for person A is also good for person B. Such alignment serves as a solid basis for A and B to cooperate with each other. Or we can talk about an “alignment of intent”, such that A and B have decided that they both wish to pursue some particular outcome. We might also talk about an “alignment of values”, where A and B have some general agreement about the kinds of things that they consider to be good and bad.

Such alignment can also be pursued in larger groups. We could talk about alignment within an organisation. A well-aligned organisation is one in which most people are pursuing a common goal, in ways that are mutually supportive. A poorly-aligned organisation is one in which the actions of individuals conflict with each other, or conflict with the organisation’s goals or purpose.

We can address these problems in a variety of ways. If we wish to achieve an alignment of intent—that is, we want to get people to work together for a shared outcome—then we might seek first to align their interests. If we have a team of people who could work together to achieve something, we should make sure that it’s in each person’s interest to do so, by offering to pay each person something if the outcome is achieved. Or, if payment is not appropriate, we might try to rationally persuade people to collaborate by appealing to their shared values, showing how cooperation is the right thing to do in the circumstances.

There is a copious literature on the various means of aligning people, either by extrinsic incentives or intrinsic cultivation of virtues and ethical norms. It seems reasonable, then, to assume that we can use the same kinds of techniques for aligning AI with human interests.

As a general aim, this is not a bad one. In practice, though, things are a bit more complicated.


“Alignment” in the context of AI has a particular meaning. It is not the question of how we find an ethical system that humans and AIs can share, such that the AIs will behave in a way that the humans find to be generally “good”. It is also not the question of how we incentivise the AI to seek particular outcomes that the humans also regard as valuable. These things are undoubtedly important, but they both rest on a subtle but crucial assumption: that we know how to get the AI to care reliably about the incentives or ethical positions we want it to care about.

The alignment problem, therefore, is not a question of creating, finding, adopting or revising an ethical scheme or incentive system. It’s about creating the kind of AI that can reliably be aligned to such a system in the first place.

We can compare this to other problems in software engineering, at least by analogy. For example, it’s very useful for us to be able to think of a computer as having some storage device that contains “files” and “folders”. One of the most common jobs of an operating system is to make it possible for you to browse, search, open, and modify files. And yet, an unformatted drive contains no files or folders, or any structure to indicate where the files and folders ought to go. The file system is an abstraction layered on top of the physical storage, and it’s the operating system’s job to figure out how to apply that abstraction to a wide variety of different storage media: flash drives, SSDs, spinning hard drives, CDs/DVDs and optical media, even old-school floppy disks.

The operating system does this by imposing a pattern on the storage device: that’s what “formatting a disk” means. Once formatted, a disk becomes a highly reliable means of storing files and folders of all sizes and kinds. There are, of course, many different storage formats: FAT, FAT32, NFTS, ext3, APFS, ZFS, and so on, with different strengths and weaknesses. Each one has a similar job, though: to ensure that when we write some data to a file, and read that data back later, we will get the same data back in the same order we wrote it in. We are confident that we can impose a pattern on the disk, and have that pattern “stick”. The file system “aligns” the different storage media to a scheme for organising bits and bytes into files and folders.

We can think of our ethical and incentive systems as being a bit like the files and folder structures we want to work with. Maybe you want a bit of consequentialism for this kind of problem, or some strong market incentive for that problem, or virtue ethics for something else. There’s a lively philosophical debate about which system is best for which kinds of problem, and you probably have opinions about that yourself - I certainly do! But our ideas are ultimately meaningless if the system we are trying to impose them on does not, in fact, respect them. If you want to store some data in files and folders, you still need to be confident that the underlying mechanism of storing bytes at fixed locations on the drive will work reliably. If you want to get an AI to help you with a problem, and you determine that virtue ethics is the right way for the AI to reason about the uncertainties it encounters while performing the task, then you need to be confident that the underlying machinery of artificially intelligent reasoning will, in fact, stick to the format of virtue ethics. This is what is meant by “AI alignment”.

As things stand, we don’t know how to do this reliably. And “reliability” is a high bar. To continue our analogy a bit: disk drive failures are both very rare, and are often recoverable, further reducing the impact if a failure does occur. Redundant drives can greatly increase reliability, and backups can mitigate any risk of data loss. File systems are also able to recognise data corruption, so the worst-case scenario is losing data that you probably have backed up somewhere. Getting incorrect data almost never happens. This is important, because data loss is readily apparent, but data corruption would be a subtle and difficult problem to solve. It might appear that your spreadsheet is fine, but if some of the numbers are wrong then you might not notice, and you might end up taking decisions based on those incorrect numbers. Disk drives and file systems are heavily engineered to avoid this outcome.

In contrast, when we tell AI systems things that we think are important, they often seem to either forget or to misunderstand them. This isn’t because our ethical or incentive systems are poor or incomplete, but because we do not know how to enforce their pattern on the underlying system in a reliable way. A typical 2023 generative AI model is a giant network of weighted connections, and we use this network to compute responses to stimuli (such as prompts). We want to be able to shape the network such that only connections which are consistent with a system of incentives or ethical principles are valid. It sounds simple, but we do not yet know how to do this reliably. We know how to do something like this, such as by using reinforcement learning to encourage the model to produce certain kinds of outputs, or avoid other kinds of outputs. But this gets nowhere close to the kinds of reliability that we are used to in our computing systems. It is not a stable foundation on which to build higher-order systems of intrinsic or extrinsic motivation or ethical reasoning.

It can be easy to ignore this with the current generation of generative models. When they work well, they can do things that seem magical. The errors seem forgivable and are easily forgotten in comparison with the things that work. And if all we want to do is to generate some images and small quantities of text, that seems fine. Reinforcement learning seems to be capable of mitigating many kinds of error, and we are happy to simply discard outputs that don’t really serve our purposes. So what if ChatGPT occasionally gets the wrong idea about what you want, or Stable Diffusion sometimes ignores parts of the prompt? Just tweak the prompt and move on!

In the future, though, this might not be so easy. First of all, current models produce very simple outputs. GPT-4 produces only text, and StableDiffusion produces only images. Yes, there can be many kinds of text, including mathematical equations and software code, and many kinds of images too. But these are very narrow outputs, and relatively easy to inspect. It’s easy for a person to tell, often instinctively, if a generated image doesn’t seem quite right. Text is a bit trickier, and sometimes GPT-4 will hallucinate plausible-looking things that only seem wrong on closer inspection. But at least you can generally do the inspection and notice the error, at the cost of some time spent checking the details.

Future models will produce more complex outputs, such that it will be harder to find the errors. Text outputs will get longer, to the size of scientific papers, or even novels, or beyond. Generated software code will not be snippets or functions, but entire libraries and applications. Models will produce both text and images, and possibly other data too. Why not have the AI directly output spreadsheets, databases, Powerpoints, project plans, product roadmaps, legal contracts? These things are much more subtle, and errors are much less likely to be detected casually.

We could also imagine AIs that send output directly to other systems, such that they can send email, access web services, make payments, or interface with smart devices. Perhaps the AI can also receive input or responses from these sources. In these cases, the output is in the form of actions we cannot easily inspect.

Errors might also creep in due to inconsistency between different kinds of output: perhaps the AI generates a comic book, where the script and dialogue describe certain events but the visual art doesn’t fit correctly. Again, these issues might be subtle and hard to detect. This makes reinforcement learning with human feedback harder, because the humans have to work a lot harder to produce good feedback. Eventually the outputs become complex and subtle enough that we need special tools to examine them, which leaves us vulnerable both to bugs in these tools, and to the possibility that the AI learns to create outputs that satisfy the constraints encoded in the tools without actually being “good” from a human perspective.

This is a crucial difference between aligning an AI and aligning something like a storage device. Checking to see if the storage device is working properly is pretty easy: you write some bytes, and later check to see if you can read them again. If you don’t get the original bytes back, something is wrong. An AI is a bit like a storage device, except that we’re requesting things that we didn’t previously store, so when we get the output we have nothing to compare it to. Now we either have to verify that the outputs match our intentions, something that becomes practically impossible as the outputs become larger and more complex, or we have to trust the process by which those outputs were created. This brings us back to the fact that we already know that the process is unreliable, which is to say that it’s untrustworthy.

All of this is perhaps more annoying than catastrophic. Imagine that instead of creating a spreadsheet yourself, you ask an AI to create one for you. And sometimes it gets the numbers wrong, and maybe you don’t notice and send it to your investors or your accountants or the tax office, and maybe some inconvenience results. This is bad for you, but it’s not an existential risk for anyone. Probably you pay a fine or spend some time having some annoying conversations you’d rather not have, but the world will keep on turning.

Things do become a bit more concerning when we consider future likely evolution of AI use. Instead of acting solely in response to prompts, we will expect our AIs to engage in long-running activities or to act in response to environmental triggers - activity on the web, readings from sensors, timed actions, and so on. Instead of receiving text or speech or images from the user, and outputting a combination of text, images, and speech in response, AIs would effectively output plans. Given the intent to perform some action, the AI will come up with a sequence of actions to take. Or, rather, it will come up with a rationale for action, and then continually revise its plan as it acts and receives feedback. This process is necessarily unsupervised to some extent, because nobody wants to be an AI babysitter.

By this point, AI models will likely be much larger than they are now. They will have the same fantastical “imagination” that GPT-4 has, but bigger, better, and capable of much greater complexity. We will ask AI to do things for us precisely because it will think of strategies that we wouldn’t have thought of ourselves. The actions that it takes, and the rationale it uses to justify them, may well be too complex for us to evaluate. Again, this is partly what we want - a clever assistant who can do the Bond villain planning necessary to get ahead in the modern world, planning that we don’t have capacity for ourselves.

The potential for things to go awry is obvious: AI agents acting in ways that are contrary to human interests are a staple of film and literature, so this looks like a familiar problem. You might think “I know, I’ll implement an ethical system for my AI to keep it from misbehaving”. But now you have two problems! You must be able to specify your ethical system, and you must be able to engineer your AI in such a way that the ethical system is fully binding on every action that it takes, in all circumstances. That second problem is the alignment problem.

At the risk of repeating myself: this is not a matter of coming up with a good ethical system. It’s a matter of coming up with the mind that will obey it. This is, in truth, one of the most fascinating problems one could imagine. It may ultimately end up being a small matter of programming, but specifying what this program must do is fiendishly difficult. Nobody has a comprehensive answer to the problem.

When we design file systems, we build on decades of tradition, engineering experience, industry folklore, and scientific research. The SSD in your Macbook is a lineal descendent of punched card storage systems from the 1950s, tape from the 1960s, and the magnetic drives of the 1970s. Many of the principles of the file system go back to 70s-era UNIX, which itself has clear antecedents in the 50s and 60s. The core architecture of the computer is recognisably the same thing imagined by John von Neumann in the 1940s - faster, more complex, more optimised, but not categorically different. We have a lot of depth of knowledge to draw upon when we want to design reliable new file systems, as Apple did when designing APFS in the mid-2010s. Our confidence that our file systems won’t corrupt our data and lie to us about it comes in part from the sheer depth of experience required to get to this point.

AIs are not like this. We can’t start out by saying that our new AI is “like” some preceding thing, in the way that APFS is “like” HFS, or that SSDs are “like” magnetic hard drives, because there just are no preceding things. It’s for this reason that some people believe that the long-term future for AI is going to involve something like a simulation of the human brain, because at least that will allow us to import the inferential structure we have for reasoning about human behaviour and motivations. Taking some of our current machine learning techniques and scaling them up, or giving them new feedback loops, will probably create something very unlikely the human brain in important ways, so we will be flying blind in terms of our heuristics about how well it will stick to the ethical model we design for it.

Again, this is both scary and exhilirating. To be clear: the problem is unsolved! There is a tremendous reward, in both money and prestige, for anyone who can make decent progress in solving it! Current approaches have known and unsolved failure modes, and we don’t know for sure whether to evolve those approaches or try something radically new. More than usual, it might be possible for people coming from outside the discipline of computer science and its close relations to make a major contribution, if only by framing the problem in a new and more productive way.

In the meantime, the risks of giving unreliable systems lots of leverage over resources on our behalf seem large. A misfiring AI could easily end up triggering bizarre actions, and some of those could be very dangerous. The same capacity that allows an AI to envisage grand plans on our behalf could enable it to envisage complex sequences of actions that harm us. Again, this is not really a matter of bad intent, incentives, or ethics, but mostly of unreliable alignment. Whereas file systems are rigorously engineered to avoid data corruption, current AI systems are not engineered to avoid corruption of their intent or behavioural principles, because we don’t yet know how to do that. While any given AI taking a humanity-ending course of action is statistically unlikely, if we have enough of them acting frequently enough, then eventually something bad becomes likely to happen. We ought to be very confident in our alignment mechanisms before we allow such an experiment to occur.

The courses of action open to us are all difficult. Solving the alignment problem is hard. Preventing the existence of unaligned AI may end up requiring deeply unpleasant political choices, which might go against some of our most deeply-held convictions about liberty, decentralisation, and the freedom inherent in general-purpose computing. If we want to avoid those compromises, we must find technical solutions.


If you made it to the end of this, I would appreciate your feedback. I’ve read a lot of arguments for taking the alignment problem seriously that open by spelling out the most alarming possible consequences of failure, and I wanted to see how it would feel to explain the problem in more mundane terms first, only mentioning the x-risk scenarios briefly or in passing. If you found this approach more or less persuasive than other accounts you have read, perhaps you could let me know on Twitter or Bluesky.

The Moonlit Garden is the personal website of Rob Knight.