AGI, Alignment, and My Future

Link to Google Doc

Late 2017 was when I first got both interested and scared about AI. That was when AlphaZero, DeepMind’s chess, go, and atari playing agent was released that could learn the games without knowing the rules beforehand. AlphaZero started form scratch with zero knowledge of how to play chess, Shogi, or Go. Through self-play it then learned the rules of the games and from then it proceeded to learn strategies and winning actions which eventually lead AlphaZero to beat world-champion programs in each game. AlphaZero was a large step in the direction of a more general agent, one that can learn anything and that can reason about, and act in, complex environments. 

AlphaZero itself is the offspring of the famous AlphaGo, which was the program to beat Lee Sedol in Go. AlphaGo in itself was a breakthrough, and it arrived way before trend (at least a decade ahead of its time). But it required programming the rules into it from the beginning, and it also required large amounts of training data. Still, AlphaGo showed that it was possible to handle complex environments such as Go, where the number of possible moves and positions make it impossible to run tree searches to determine actions. 

From AlphaZero, muZero was developed. muZero is an AI system that combines the capabilities of AlphaZero with a learned model of the world. With this, muZero can plan winning strategies in unknown environments. This is also a significant development and I think it is clear that we are moving closer and closer to what we usually mean by “intelligence”. Humans do not think by “looking ahead” at all possible actions and evaluating them against each other. We combine such capabilities with our model of the world which makes it possible to reduce the number of possible actions and that allows us to plan what will happen in the future. 

This kind of model-based reinforcement learning represents one of the most promising future directions for AI research and development. There is already evidence of it working very well and it seems to be quite isomorphous to the one datapoint of general intelligence that we have—human intelligence. Since AlphaGo and AlphaZero, the developments in AI have been rapid. We have seen AI figure out the protein folding problem through AlphaFold 2, we have seen AI make large progress in natural language processing through GPT-3, and there’s Codex, a derivative of GPT-3, that can code based on natural language inputs. 

These developments highlight a number of things—the increased investment in AI, the rapid development of new algorithms such as reinforcement learning, neural nets, and the like (which in turn are made possible with increased compute), and how some capabilities of AI agents arrive out of the blue. It seems very unlikely to me that people predicted AlphaGo, AlphaZero, or AlphaFold 2 before they actually happened. Time and time again, it seems that both the AI community and the broader population is surprised by the developments. 

The increase in capability begs the question of when we will get general intelligence. Defining general intelligence is tricky but I will use the wikipedia definition of an Artificial General Intelligence (AGI) as an intelligent agent that can learn any intellectual task that a human can do. Having AI reach human level intelligence may not be the final step though, since there is a major difference between AI and humans in that AI can have access to its own code. This means that a sufficiently intelligent AI could spot and correct coding mistakes in its own software as well as figuring out upgrades on its own. It is by this point an intelligence explosion, or singularity, occurs—where the intelligence of AI explodes by its own upgrades to itself. 

Such intelligence is widely different than anything we have ever seen before and, more importantly, we have no reason to expect a priori that such an intelligence is aligned with human values unless we create it to be so. Put in another way, there seems to be little evidence that more intelligence means converging to some set of values and goals. (The orthogonality thesis) Creating and aligning AI seems to me to be the most interesting, hard, and important problem that humanity has ever encountered. Why is this problem so hard? Below are just some of the potential problems that might lead to disaster with intelligent AI. (I also plan to write a lot more about these problem in the future)

The current paradigm in AI research is having agents learn on a dataset by having the objective to optimize some reward. In many cases the reward is a proxy for what we actually want the system to achieve. Consider an AI that is cleaning for humans, the reward for that agent is set to come from human approval, i.e the humans rate the cleaning from 1-10 and the AI optimizes for that rating. It is easy to see how this optimization could go wrong in that the AI could realize that the ratings depend on the human’s mood which have the agent trying to improve the mood of the human before rating. Even worse, the AI could realize that it could achieve perfect reward by hijacking the rating process and force the humans to rate it 10/10. 

Another, almost trivial example, is an AI agent that is trained on data on economic development and that is then put in charge of developing countries. The goal is to increase economic prosperity, but there is no good measure for that so we decide to use night lights, that is a common proxy for economic activity used in development economics. In the beginning the actions of the AI correlate well with the reward which is based on night lights, but after a while the AI realizes that it can gain a higher reward by only recommending actions that lead to more lights at night rather than taking actions that actually increase economic prosperity. The solution may seem trivial, just don’t use night lights as the proxy but reality is not as simple as that. Many, or even most, proxies have this property of unwanted behavior when optimized on, unless carefully constructed. Another famous example of such mindless optimization is the paperclip maximizer that turns the whole world into paperclips. 

Thus, we would like an AI to optimize against something that is close to “human preferences”, but what are these? Can they even be formalized? Preferences can be defined in different ways—the preferences we say we have, revealed preferences, and preferences that we stand by after careful deliberation. 

Another struggle is to formalize human decision making. It is clear that humans do not act rationally at all time. For example, we are very bad at probabilities, we overvalue things in our endowment, and such. But it is not clear that all of those problems are solved with a perfectly rational AI agent that has access to huge amounts of compute. For example, consider being in a completely novel situation—for a human that means being more uncertain and acting more conservatively, however it is not clear how one would make an AI understand uncertainty in such a way and then have that understanding show itself in conservative, or safe, actions. It may be that blowing up Russia with nukes has a low probability of giving huge rewards, but there is huge uncertainty even in estimating the reward and the probability and we would like the AI to refrain from using those nukes unless absolutely necessary. 

Then there is a problem with handling an intelligence that is far smarter than humans, how can we know that it is telling us the truth? How would we be able to tell when the agent is creating nanobots for destroying the world while it is telling us that it is preventing a stock crash? 

These are just some of the problems that likely have to be solved to create aligned AGI. Note that I’m not saying that non-aligned AGI that kills us all is the most likely outcome of AI research. (Even though I’m fairly pessimistic.) Perhaps it is not possible to create AGI, perhaps it is possible and the orthogonality thesis is wrong, perhaps we will stumble upon a nice AGI in our first attempt which can protect us against adversarial AGI. But the fact that we might only have one attempt at creating AGI and that there is some non-zero probability that a mistake or failure leads to the end of humanity leads to these problems being very important. 

Now the point of this post is not to fully convince you to believe that aligning AI is the most important problem and that you should work on it. Instead, I wanted this to be a bit of a primer and an explanation for why I want to spend my time thinking about this problem. I fully believe that it is the most important problem of human history. I therefore want to create a career in Ai research and alignment research instead of continuing on my current track—after finishing my Master’s in Economics, I want to keep studying the math and programming needed and then keep going into researching AI. I think it is quite likely that I’m not intelligent enough to not help make any progress in the field, but in that case I believe that I can learn other skills that are useful in other careers.

My current plan is to keep reading math and ML-papers as well as learning more advanced programming. I will use the blog as a way to share thoughts and ideas that arise along the way, and eventually, sometime in the future, I hope that I can share my own essays that are at the forefront of alignment and AI safety research. 

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.