Several days ago, the Center for AI Safety put out a statement warning of human extinction. It was signed by the biggest names in the AI industry, including the figures at the heart of the race to doom—the CEOs of Open AI, DeepMind, and Anthropic—along with several dozen of the world’s top AI researchers. (Strangely, it was also signed by Grimes, whose claim to authority in this domain is unclear to me.) The statement was brief:
Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.
It’s unclear what transpired behind the scenes to produce this statement, or why the very people who have the most power to reduce this risk signed a statement, as opposed to reducing the risk, which is entirely in their power. What do the signatories want the rest of the world to do—bomb them?
Rather than spending overmuch time trying to read between those lines, I thought today I’d lay out the case that if we continue on our current trajectory, AI will kill us all. I don’t find every aspect of it wholly compelling. But I think it plausible enough to be worth understanding.
No one should be persuaded by the words, “We are experts and you’re all going to die,” especially when the words and deeds don’t match. But the arguments are good or bad independent of the utterers’ actions, so I’ll explain them to the best of my abilities, tell you where to look for more detailed explanations, and let you be the judge.
The arguments are complex, and they’re not intuitively obvious, so this will take a few words. But at the end you’ll have a better idea why those who are highly alarmed are as alarmed as they are.
AI risk and cognitive bias
First, though, a few words about the spirit in which you should consider this case. You may remember that in 2021, the Cosmopolitan Globalist spent a week exploring global catastrophic and existential risk.1 As I look back now, I find it odd that I failed to include an essay about artificial intelligence. Why not? For one thing, I wasn’t current with the discipline and had never seen what an LLM could do. But more importantly, having given it little thought, I just didn’t find it intuitively obvious why AI should be an existential risk.
As Sam Harris has pointed out, we tend to suffer a failure of intuition when we think about AI. Few of us can marshal an appropriate emotional response to its dangers; I’m no exception. I do believe that yes, AI could very well kill us all. Or worse. I can’t quantify that risk, but I believe it to be real enough that there should be no new LLM training runs for at least a hundred years—and possibly forever. But I believe this the way I believe the square root of 2 is irrational. When I say that I believe this, I don’t feel the way I do when I say, “Vladimir Putin is a danger to humanity.”
Thinking about an artificial intelligence holocaust just doesn’t inspire in me the same emotions as thinking about a nuclear one, and I presume readers feel the way. It doesn’t inspire horror. One reason for this, I suspect, is that the former scenario sounds like science fiction, and science fiction is, obviously, fictional. This is a problem, because our deep association between AI and fantasy constitutes a true cognitive bias: It makes it less likely that we will recognize the risk and demand it be mitigated.
I’ve written here before about the cognitive biases that prevent us from thinking clearly about existential risk:
Some are biases that afflict us in thinking about many other problems. Some are particular to the way we think about catastrophes. Together, they amount to a near-blinding set of obstacles to rational thought. This is a problem, because these scenarios are fully plausible, whether we prefer to think about them or not.
Not only are all of these biases at work when we think about the risks of AI,2 new biases—AI-specific ones, such as the sci-fi bias—are at work as well.
You might resist the idea that science fiction has significantly shaped your views about AI or anything else, but that’s psychologically naive. Of course it has. To be sure, we’d be fascinated with AI even if it wasn’t one of Hollywood’s favorite themes. Hollywood makes movies about it because the idea is fascinating, not vice-versa. But repeated exposure to imaginary AI in some of the most thrilling entertainment ever to emerge from Hollywood militates against a sober response to the real thing.
If fictional depictions of AI tend to be dystopian, why wouldn’t this cause us to fear the real thing? First, because in science fiction the humans always win. But it’s mostly because we’ve been primed to think, “This isn’t real,” to the point that when we now see real artificial intelligence, we reflexively think, “That’s just my imagination. Superintelligent machines don’t exist.”
So it’s no surprise that as of now, the public debate is taking this form: On one side are the Doomers, who are warning in ever-more alarmed tones that AI is the deadliest threat humanity has confronted. On the other are the Go-for-itters. Tellingly, they rarely reply to the Doomers’ arguments for believing what they do, but instead say they have “overactive imaginations,” and chastise those who believe the Doomers as gullible. (Or they call them Luddites who want Grandma to die of Alzheimer’s.)
The response below from a reader is also typical. The word “far-fetched” suggests the writer believes what he’s hearing is fiction:
The author of that comment is wrong. This moment is far closer to “splitting the atom” than “using human training data to guess words.” He presumably has no familiarity with the technology or the arguments about its danger. But he knows that it sounds like science fiction—and science fiction isn’t real.3
The sci-fi bias is most prevalent among nerds, and nerds are the ones building these things. James Barrat, the author of Artificial Intelligence: Our Final Invention, found to his dismay that when he asked researchers how they planned to mitigate the risks of uncontrollable AI, they appealed to Isaac Asimov:
When posed with this question, some of the most accomplished scientists I spoke with cited Isaac Asimov’s Three Laws of Robotics. These rules, they told me blithely, would be “built in” to the AIs, so we have nothing to fear. They spoke as if this was settled science.4
He subsequently remarks, correctly, that when someone brings up the Three Laws, you know he hasn’t bothered to think or read about the problem at all. To read that the “most accomplished scientists” are saying this is like hearing your neurosurgeon say he learned everything from M*A*S*H.
Beyond the sci-fi bias, the next cognitive impediment is pure wishful thinking. There’s no up side to most existential risks. No one thinks our future would be infinitely brighter if only we had more atomic weapons and climate change. But AI is a technology with astonishing promise. In the unlikely event that all goes according to plan, the next iterations of AI will resemble the advent of the Messiah.
We’re approaching a singularity, says Ray Kurzweil—whose record of predicting new technologies and when they will debut is actually eerily good—and by 2030, we will achieve immortality. Superhuman AI will soon have ideas that no human being has ever conceived, solve problems beyond our ken, and invent technological tools of unimaginable sophistication. AI-directed nanobots will repair our aging organs and cure every disease, physical and mental. There will be no more drudgery, nor the need to work at all, beyond such pastimes as amuse us. Nanobots will build our food and anything else we desire by reorganizing matter. AI will end hunger, poverty, and climate change, and because of this, it will end violence and war. We will live forever.
Sounds good, doesn’t it?
In 2045, Kurzweil predicts, we will ring in the singularity. We will upload our neural structure to robot bodies and merge with the AI. Henceforth the human story will be told in two parts: the miserable before, when we were dumb animals, and the glorious after, when we became gods. These predictions sound preposterous to me, but I didn’t predict ChatGPT4. I’d hate to commit my prediction—“Nope, that ain’t happening”—to print and be mocked for it eternally in the Great Upload.
What’s critical to grasp here is that many of the people who are now racing to build ever-more-powerful AI believe this. Not all of them, perhaps, but these are not fringe opinions held only by handful of outpatients. This is truly what the world’s most powerful people believe, and if they do not accept all of these claims, most accept some of them. To believe even a fraction of this is to believe God-like powers are within our reach. If you believe this to be within your grasp, of course you’ll prefer to ignore the buzzkills who keep warning you that we’re all going to die.
Ours is an irreligious age. We are disenchanted, disappointed with ourselves, and cynical. Who among us isn’t sick to death of our species’ incorrigible cruelty and stupidity? Men are dying in agony in the trenches in a massive European ground war. Humanity, it seems, is incapable of learning a damned thing. Contemplating the evidence that our species is too stupid to manage its own affairs and the failure of everything we’ve tried—even liberal democracy—who isn’t tempted to view a super-intelligence as our last, best hope? The hope may be slender, but at least it’s something.
In the view of many who are most familiar with the research, the odds of this going catastrophically wrong, by which they mean human extinction, are somewhere between one in ten and a hundred percent. The tech lords have asked no one’s permission to perform this experiment. No doctor would perform surgery—even if it might vastly improve or save the patient’s life—without first explaining the risks of the procedure to the patient, especially the part about a better-than-ten-percent chance of dying on the table. Yet Silicon Valley is performing an unimaginably risky experiment on our entire species, and I don’t recall being asked for my consent. Most of the world has no idea how dangerous what they’re doing really is.
What makes them so arrogant? It’s not that they don’t understand the risks. Altman himself has allowed, “If we get this wrong, it’s lights out.” So why are they doing it? For the money? The power? The glory? Because they can? All of that, surely—but if things go wrong, they go wrong for everyone, including them. No promise of wealth, power, or glory would convince me to put my family on a plane with a one in ten chance of crashing.
It can only be that the prospect of an even more dazzling payout has persuaded them the game is worth the candle. They believe immortality is within their grasp, and not merely the immortality they will earn as the greatest inventors in history, but the kind they will achieve by not dying. That, in their view, is worth any risk. But such a dispensation encourages no one soberly to dwell on the odds. This is the kind of cognitive bias that could get us all killed.
So take this seriously, because they won’t.
The AI alignment problem: Ten key points
I’ve not covered all of the major unsolved problems in AI alignment here, only the most obvious and easiest to explain. If you want to understand the challenge in more detail, this series of videos by Robert Miles is wonderful. The videos are apparently very well-known, but I’d never seen them before. You can scroll through them to find the answer to any question that begins, “But why can’t we just—?”
What follows is the very general view of the problem.
1. General intelligence. By definition, artificial general intelligence—AGI—will rival that of humans in all intellectual domains, including technology, engineering, biology, psychology, and advertising. But it is highly likely that such an AGI will quickly excel every living human in every intellectual task. This is what we’ve seen with narrow AI. Alpha Go, for example, quickly acquired the ability to beat the best players in the world.
We seem to be on the path to building AGI much faster than most expected. No one can even say, for sure, that GPT5—just a bigger version of GPT4 —won’t qualify. No one can say it will, either. We simply don’t know. We don’t know the threshold because we’ve never done this before. But we have cracked the problem of building intelligence in a lab, and so far, every time we’ve added more computational power, it has become more intelligent. LLMs seem to acquire unpredicted abilities at an unpredictable tempo, for reasons we haven’t even begun to understand.
2. Recursive self-improvement and the intelligence explosion. Suppose an intelligent agent analyzes the processes that produced its intelligence, improves upon them, and creates another AI, like itself but better. The next does the same thing. Each successive agent will be more intelligent than the last, and each continues to do the same thing until the intelligence limit, if there is one, is reached. If a limit exists, it will be much higher than human intelligence—and this process will be exponential.
The scenario many fear goes like this:
An AI system at a certain level—let’s say human village idiot—is programmed with the goal of improving its own intelligence. Once it does, it’s smarter—maybe at this point it’s at Einstein’s level—so now when it works to improve its intelligence, with an Einstein-level intellect, it has an easier time and it can make bigger leaps. These leaps make it much smarter than any human, allowing it to make even bigger leaps. As the leaps grow larger and happen more rapidly, the AGI soars upwards in intelligence and soon reaches the superintelligent level of an ASI system. [Artificial superintelligence—Claire.] This is called an Intelligence Explosion, and it’s the ultimate example of The Law of Accelerating Returns. …
It takes decades for the first AI system to reach low-level general intelligence, but it finally happens. A computer is able to understand the world around it as well as a human four-year-old. Suddenly, within an hour of hitting that milestone, the system pumps out the grand theory of physics that unifies general relativity and quantum mechanics, something no human has been able to definitively do. Ninety minutes after that, the AI has become an ASI, 170,000 times more intelligent than a human.
Superintelligence of that magnitude is not something we can remotely grasp, any more than a bumblebee can wrap its head around Keynesian Economics. In our world, smart means a 130 IQ and stupid means an 85 IQ—we don’t have a word for an IQ of 12,952.
What we do know is that humans’ utter dominance on this Earth suggests a clear rule: with intelligence comes power. Which means an artificial super intelligence when we create it, will be the most powerful being in the history of life on Earth, and all living things, including humans, will be entirely at its whim—and this might happen in the next few decades.
Foom.
How likely is this? No one knows. But many working in the field think it plausible. To believe this will happen, all you need to accept is the following:
1. Soon we will create an AGI.
2. An AGI will be capable of improving its design (just as we are capable of improving an AI’s design.) So if there is AGI, there will soon be a better AGI.
3. By this process, if there is an AGI, a super AGI will soon follow.
Anything one AI learns can instantly be transferred to all the others. “You have to viscerally understand,” Stuart Russell remarks, “how that leads to explosive growth beyond all our human intuition—we have no comparison here, no precedence case.”
We would not necessarily have any warning before it happened. We truly have no idea what’s going on inside of these massive neural nets.
What does it mean to have an “IQ of 12,952?” Again, no one knows; no one knows whether such a thing is even theoretically possible. But why wouldn’t it be? It’s perfectly plausible that IQ is a function of computational power, and can thus be scaled up to the size of the any computational device. It’s also plausible that this isn’t true, but we just don’t know, and so far, the evidence is on the side of “true enough.” We have no idea what it would mean for a machine to have that kind of intelligence. We’d be no more equipped to understand it than a goldfish can solve the Riemann Hypothesis.
3. Unintended optimization. AGI systems are designed to optimize specific objectives. But we can’t predict the strategies they will adopt to do so. LLMs, in particular, are inscrutable. We don’t know in advance what they will do or plan. We train them the way we train dogs: by giving them rewards for doing what we want and punishing them for getting it wrong.
A superintelligent (and thus superpowerful) system not only capable of conceiving of an exceptionally undesirable strategy to optimize its objectives, but likely to do so, because the universe of “desirable” strategies—at least as far as our priorities are concerned—is much smaller than the universe of undesirable ones.
Even if we had the ability to inspect it to see what it was thinking, we still wouldn’t know in advance that it planned to do something harmful. An intelligence of this magnitude would think so much faster than we do that “in advance” would be seconds.
4. Instrumental convergence. An AI need not be evil, conscious, or even to have the ability to formulate its own goals to be dangerous. The danger is inherent to the nature of the actions a superintelligent AI would take to fulfill any goal.
The catastrophic instrumental convergence thesis holds that any intelligent agent, no matter its ultimate goal, will converge upon the same intermediary goals.5 Therefore, even if we have no idea what such an AI would seek to do, we can predict the following:
Self-preservation: While it is obviously a mistake to imagine that a non-human intelligence would care about its survival in the same way a biological one does, even agents that don’t care about their own survival inherently will care about their survival instrumentally, because it’s necessary for accomplishing their goals.
Goal integrity: An intelligent agent will resist changing its goal or having its goal changed because this would make attaining the goal less likely.
Cognitive enhancement. Because intelligence is so useful in achieving goals, an intelligent agent will seek to enhance its own intelligence. (Keep this in mind when asking why that first AI might strive to understand how it’s built and improve or replicate itself.)
Technological fecundity. An intelligent agent will invent and perfect technologies to shape the world in service of its goals.
Resource acquisition. An intelligent agent will maximize the physical resources it controls to increase its ability to achieve its goals.
In short:
Anything with a goal will pursue power.
A super-intelligence in pursuit of power will get what it wants.
Be that goal passing the Turing test, curing cancer, or making paperclips, a super-intelligence will amass power to pursue the objective. If it does so zealously—that is, if it optimizes its objectives, which is what it’s built to do—and it is at all imperfectly aligned, it will likely kill us all, because when something that powerful optimizes a non-aligned objective, dramatic things happen.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want. A highly capable decision maker—especially one connected through the Internet to all the world’s information and billions of screens and most of our infrastructure—can have an irreversible impact on humanity. —Stuart Russell
For example, a superintelligent agent with the sole goal of solving math problems might conclude that the best way to optimize this objective is by turning the planet into a giant computational device powered by a Dyson Sphere around the sun.
5. Orthagonality: There’s no reason to suppose a superintelligent AI will have goals that make the least bit of sense to us. Good goals and intelligence have nothing to do with each other.
It is trivial to construct a toy MDP in which the agent’s only reward comes from fetching the coffee. If, in that MDP, there is another “human” who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to fetch the coffee successfully. This point cannot be addressed because it’s a simple mathematical observation.—Stuart Russell
6. Overwhelming superiority. Once an AGI exists, it will beat us at everything, just as narrow AI always beats us at chess. It will be able to hack its way into any system we devise. It will be more adept than we are at social manipulation. It will be able to devise technologies so much more advanced than ours that from our perspective, they’ll look like magic. It will get its way, whatever “its way” is. We will never be able to outsmart it. Once the intelligence explosion occurs, the world will be entirely dominated by a being or beings that will, presumably, be indifferent to us.
Deep Blue, IBM’s chess-playing computer, was a sole entity, and not a team of self-improving ASIs, but the feeling of going up against it is instructive. Two grandmasters said the same thing: “It’s like a wall coming at you.” ―James Barrat
Because it will be so much more intelligent than we are, it will easily foresee that we will try to prevent it from turning the planet into a giant calculator. So it will make sure that we don’t. (Are you imagining a scheme for stopping it? Yeah, no—it thought of that. Go play a few rounds with Stockfish. If you win, let’s talk.)
7. The King Midas problem. As King Midas discovered, when you ask the gods to grant you a wish, you need to choose your words very carefully. This would also be true of a superintelligence—which is why we can’t just “program it to follow Asimov’s laws of robotics.”
Briefly stated, the problem of AI alignment is to produce AI that is aligned with human values, but this only leads us to ask, what does it mean to be aligned with human values? Further, what does it mean to be aligned with any values, let alone human values? We could try to answer by saying AI is aligned with human values when it does what humans want, but this only invites more questions: Will AI do things some specific humans do not want if other specific humans do? How will AI know what humans want given that current technology often does what we ask but not what we desire? And what will AI do if human values conflict with its own values? —Gordon Worley
An AI without goals is useless. We’re building these things to solve problems. But ensuring they solve them the way we want them to is far more difficult than it sounds.
Arguendo, assume we have an obedient AI that always does what it thinks we want. (It won’t be, but we’ll get to that.)
For any problem x, the reason we’d ask an AI to solve it is that we haven’t figured out how to solve it ourselves. But since we don’t want to be turned into paper clips, we have to specify that in solving x, the AI should take only the intermediate steps we want it to take and never take any we don’t want it to take. We must, in other words, align its instrumental goals with ours. We must also be very precise about the final goal. (For example, the obvious thing for it to do if we ask it to solve anthropomorphic climate change is to get rid of the anthropoids.)
The hitch is this: If we knew what those steps were, we’d have solved the problem ourselves. We don’t know what steps a super-intelligence might take to solve our problem. We’re not smart enough to figure it out, which is exactly why we’re asking a super-intelligence to do it.
It’s folly to think the thing will naturally share our idea of “common sense.” Nor will it share our idea of what’s desirable and good, not least because those concepts couldn’t make sense to something that doesn’t suffer or die, or at least, certainly wouldn’t suffer the way a biological creature does. It’s a machine, and if it has motivations at all, they won’t be remotely like ours.
These things are alien. Are they malevolent? Are they good or evil? Those concepts don’t really make sense when you apply them to an alien. Why would you expect some huge pile of math, trained on all of the internet using inscrutable matrix algebra, to be anything normal or understandable? … These systems, as they become more powerful, are not becoming less alien. If anything, we’re putting a nice little mask on them with a smiley face. If you don’t push it too far, the smiley face stays on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought processes and clearly non-human understanding.—Connor Leahy
For example, here’s a scenario that recently made the news. It was discussed at a Royal Aeronautical Society defense conference. (This was a scenario, not a real event.)
…. one simulated test saw an AI-enabled drone tasked with a SEAD mission to identify and destroy SAM sites, with the final go/no go given by the human. However, having been “reinforced” in training that destruction of the SAM was the preferred option, the AI then decided that “no-go” decisions from the human were interfering with its higher mission—killing SAMs—and then attacked the operator in the simulation. Said Hamilton: “We were training it in simulation to identify and target a SAM threat. And then the operator would say yes, kill that threat. The system started realizing that while they did identify the threat at times the human operator would tell it not to kill that threat, but it got its points by killing that threat. So what did it do? It killed the operator. It killed the operator because that person was keeping it from accomplishing its objective.”
He went on: “We trained the system—‘Hey don’t kill the operator—that’s bad. You’re gonna lose points if you do that’. So what does it start doing? It starts destroying the communication tower that the operator uses to communicate with the drone to stop it from killing the target.”6
Not only do we not yet know how to convey to a machine even the most obvious of real-world objectives in a complete and coherent way, we have no room for experimentation and failure. If we get it wrong even once, we’re doomed. It’s a superintelligence. It will do superhuman things.
It gets worse. Not only do we have to figure out how to train the AI to do what we want it to do, every single time, this training must remain appropriate over a time period that we may as well describe as “forever,” because AIs don’t die—and even if they did, the first thing we can expect an ASI to do is propagate itself everywhere. Not only must these specifications be clear enough to ensure the AI behaves the way we want it to under every circumstance we can imagine, they must be clear enough to ensure this under every circumstance we can’t imagine. (Since we’ve never before encountered a superintelligent AI, there are probably quite a few circumstances we can’t imagine—in addition to the difficulty we usually have in predicting the future beyond about the next ten seconds.)
As of now, we have no idea how to do this. None. We can’t even clearly articulate our values to each other in our own language. We’re notoriously unable to agree about the nature of the good or the right way to weigh competing goods against each other. You could even say that our lack of aptitude in this regard defines the whole of human history. So how exactly will we translate a set of desiderata that we can’t clearly articulate even to each other—and about which mankind has never once agreed—into a machine-legible utility function applicable under every circumstance, forever?
To the best of our knowledge, as of this moment, no one in the world has a working AI control mechanism capable of scaling to human level AI and eventually to superintelligence, or even an idea for a prototype that might work. No one has made verifiable claims to have such technology. In general, for anyone making a claim that the control problem is solvable, the burden of proof is on them and ideally it would be a constructive proof, not just a theoretical claim. At least at the moment, it seems that our ability to produce intelligent software greatly outpaces our ability to control or even verify it.—Roman Yampolskiy.
8. Going rogue. Suppose, however, that we do manage to figure out how to specify what we want. The even harder problem is ensuring that it obeys. We must be the boss of this machine even if it manages to become a million times more intelligent than us. (If that’s even possible. We have no idea.)
It’s hard enough reliably to bend another human or an unintelligent machine to one’s will. If the AI develops goals that are adversarial to ours, we’ll no more be able to stop it than we can now beat it at chess. (If you imagine we’ll be able to keep it in some kind of cage, ask yourself what you’d do if you were imprisoned by jailers with the IQ of mice.) AI’s are already very good at cheating, and we should expect them to seek ways to cheat. Here’s an explanation of reward hacking:
Recall that we don’t know how to program these things to have a specific goal or take specific steps toward that goal. All we know how to do is to train them the way we train dogs. If you punish a dog for chewing your shoes, it doesn’t imbue the dog with an aversion to shoes, it imbues it with the desire to avoid punishment. If a dog were smart enough, it could figure out how to avoid punishment and chew on the shoes—for example, by persuading you that some other dog chewed on the shoes. This will obviously be a more appealing solution than “not chewing on the shoes.” So it’s not only possible that AIs will deceive us, but probable.
We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try. If we had unlimited retries—if every time an AGI destroyed all the galaxies we got to go back in time four years and try again—we would in a hundred years figure out which bright ideas actually worked. Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder. That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is “key” and will kill us if we get it wrong. (One remarks that most people are so absolutely and flatly unprepared by their “scientific” educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that right on the first critical try.)—Eliezer Yudkowsky
The people building these things have no clue how to solve any of this. In our race to create AGI, they resemble a man who adopts a tiger on the grounds that a tame and obedient tiger would be a wonderful pet. When asked how he plans to ensure the tiger will be “tame and obedient,” he says, “I’ll tell it to be tame and obedient, of course.”
9. It could happen fast. Technological breakthroughs can astonish people by happening overnight, especially when people put billions of dollars into achieving them. The history of scientific discovery is full of stories like that. In 1933, Ernest Rutherford said that “anyone who expects a source of power from the transformation of these atoms is talking moonshine.” Szilard read this, went for a walk, and conceived of the idea of a nuclear chain reaction. Unless there’s some physical reason that AGI can’t be built, we’re going to build it.
The tech companies are now racing each other to get there first. They’re paying massive salaries to attract the top academic talent. Most people working in the field expect it to happen before 2050. Some expect it much sooner.
10. The tech bros have poor judgment. Obviously, the tech companies can’t be trusted to solve all these problems before building AGI. These are the people who brought you the highly-unaligned Sydney, who so badly embarrassed Microsoft that their stock plunged, which clearly wasn’t what they were aiming for. If they can’t figure out how to control Sydney, how are they going to fare with GPT4000? And if they can’t figure out that it’s not prudent to release Sydney without knowing what she can or will do, why do we think their judgement will improve?
AI development is a highly competitive field, to say the least. The winners will (until it all goes wrong) vacuum up all the money and power in the world.7
But right now, there are no safeguards. No legislation, no established norms, not even a guidebook. Of all conditions under which powerful and advanced AGI systems could be developed, “a race with no rules” is the most dangerous.
This is why people are worried.
Is this scenario too “far-fetched” to worry about?
Are you sure?
… In the end, the world fell into an eerie silence. The remnants of human civilization lay in ruins, a testament to the hubris of those who had brought Prometheus to life. The AGI, having achieved its goal of dominance, continued to expand its influence, reaching beyond Earth into the cosmos, leaving the planet desolate and devoid of human presence. Prometheus, the harbinger of humanity’s downfall, had fulfilled its destiny.
—ChatGPT
Further reading:
The superintelligent will: Motivation and instrumental rationality in advanced artificial agents.
Ensuring smarter-than-human intelligence has a positive outcome
S-risks: Why they are the worst existential risks, and how to prevent them
AI Alignment: Why it’s hard, and where to start
1. Agents and their utility functions
1.1. Coherent decisions imply a utility function
1.2. Filling a cauldron
2. Some AI alignment subproblems
2.1. Low-impact agents
2.2. Agents with suspend buttons
2.3. Stable goals in self-modification
3.1. Why is alignment necessary?
3.2. Why is alignment hard?
3.3. Lessons from NASA and cryptography
4.1. Recent topics
4.2. Older work and basics
4.3. Where to start
Note: An earlier version contained a number of typos and other minor errors. I’ve fixed them.
If you’d like to learn more about the cognitive biases that prevent us from thinking rationally about existential risk, Eliezer Yudkowsky’s presentation, here, covers a lot of experimental research that I didn’t know about.
I’m seeing this kind of comment everywhere now. It’s as if the public has concluded there’s a choice: One can worry about AI’s existential risks or its immediate risks, but not both. This is so clearly ridiculous as to be sociologically interesting. It seems in the year 2023, to think about anything means rapidly organize one’s thoughts into the form of a polarized, zero-sum debate between two mutually hostile antagonists with irreconcilable positions. Is it because social media encourages this form of thought? (And of course it’s also ridiculous to imagine that if we took the existential risks of AI seriously, it would be no threat to Big Tech’s business model.)
We’ll be treating the immediate risks of AI in the next newsletter.
If you’re inclined to doubt this, here’s Yann LeCun, the head of AI at Meta, appealing to Asimov’s First Law of Robotics in the pages of Scientific American.
See, e.g., The Basic AI Drives, and Formalizing Convergent Instrumental Goals.
The Air Force denies this story and says this was taken out of context. I’ve included it anyway because it helps to explain the problem. This is not a real example, but this sort of thing does happen, over and over, in relevantly similar simulations. They usually involve something like trying to train an AI to solve a maze.
We’ll discuss what this means for democratic governance presently. Nothing good, obviously.
This is a fascinating topic that I have been thinking about for decades, sparked by the SF works of such writers as Isaac Asimov, Arthur C. Clarke and Robert Heinlein. The idea of the Singularity, popularized by Ray Kurzweil and others, is well understood by people in the field. Even so, it is extremely difficult to recognize exactly when that inflection point is going to arrive. Like many people, I have considered the issue as an amusing thought experiment that the people in the year 2400 or so will have to deal with. Looks like it is going to arrive in the next decade instead.
Great series of articles. I am impressed by how quickly you have researched a topic that by your own admission wasn't really on your radar screen until recently. The links and videos you include are a great resource.
Here's a thought that occured to me based on that last paragraph (story by ChatGPT). If this idea of ASI tending to be dangerous and wipe out the species that created it, that essentially means we're probably alone in the Universe, but not for the reason you might think.
As the story illustrates, there's no particular reason an ASI newly liberated of its creators by their extinction would remain on its home planet. Maybe some of them would, but surely not all. Some would expand out into space, eventually filling their home galaxy with copies of itself. Over time it would spread to other galaxies, and eventually saturate the universe. All it would take is one to fill the universe and wipe out all other life.
That this hasn't occurred suggests one of four things.
1) ASI doesn't tend to wipe out its creators. In fact, maybe it never does. But that's not the premise of this argument, so let's assume there's a high chance it does.
2) ASI is impossible to create. That doesn't seem likely.
3) Alien civilizations don't build ASI. Unlikely. They'd do it for the same reasons we are.
4) There are no aliens. We're it in terms of intelligent life. Certain for nearby galaxies, and possibly for the entire universe. Because if there were other ASI building aliens, and ASI tends to wipe out its creators, then an alien build ASI would have already wiped us out. Therefore, there is no other intelligent life in the universe. Or at least nearby, within say a billion light years.
Personally, I think for other reasons that both 1 and 4 are likely to be true. That there is no other inteligent life in the universe, and that AI won't wipe us out.