Link to Google Doc
The road to AGI has long seemed to be in its very infancy, but the recent success of deep learning has made prediction timelines for AGI drop significantly (see Metaculus). A significant part of forecasting AI progress comes from how one expects AGI to be developed. Do we need additional significant algorithmic innovations? Or is it compute, model size, and data availability that are the bottlenecks?
Here, there seems to be two camps in the AI research community. One that believes that scaling is the road to AGI and another that believes that we need to add algorithmic improvements before getting to AGI, i.e that general intelligence capabilities will not magically appear from adding parameters, compute, and training data to current approaches. This is important in two main ways. From a practitioner perspective, it is important to know what path one should research. And from an alignment and safety perspective it is important as it could help in estimating the time until AGI is created. (So far, little progress has been made in alignment for any type of approach to creating AGI, so knowing through which methods AGI may arise seems less important.)
A lot of this seems to come down to what one believes current models are, and what they might be capable of in terms of pattern-matching. What distinguishes human intelligence is the ability to pattern-match deeply across a wide range of domains—let’s call this deep pattern-matching. On the other hand, current ML systems, though advanced and cool, seem to be doing a more shallow and less generalizing type of pattern-matching—let’s call this shallow pattern-matching. The question of scaling or not thus comes down to if one believes that deep pattern-matching can arise from the same type of systems that are currently only able to do shallow pattern-matching. The past years have shown that the pattern-matching of machine learning systems have become less shallow, but is that enough to believe that it is possible to bridge the gap through current ML techniques?
In this post, I plan to look at the recent progress in AI and determine how I’m updating my views in light of this progress. I will also look at some of the other evidence in favor of either of the two views.
First, some introduction to scaling laws in neural language models. As we will see, much of the recent progress in AI has been made in the neural language models. In 2020, Kaplan et al. released ‘Scaling laws for Neural Language Models’ where they investigated how the cross-entropy loss, i.e performance, of language models scaled with parameters, compute, and training data. They find evidence for clean scaling laws that imply that there is room for much improvement by just adding compute and increasing model size. 
In the discussion, the authors finish by commenting on what continued scaling might bring.’In the domain of natural language, it will be important to investigate whether continued improvement on the loss translates into improvement on relevant language tasks. Smooth quantitative change can mask major qualitative improvements: “more is different”.’ This summarizes the scaling hypothesis quite well, by scaling up current models—we can get more and more capabilities and eventually we might get to AGI.
One clear example of scaling that looks smooth on a graph is GDP the past 100 years, but the underlying society has changed drastically over those years. In the same way, it may be that under some threshold of cross-entropy loss the models will generalize in a new way and add important capabilities. This is also what we are seeing with the latest language models that have been created by DeepMind, OpenAI and Google.
In a sense, GPT-3 from OpenAI is what started the large language model (LM) craze that we have seen in the past years. It is an autoregressive model which uses deep learning to tune its 175 billion parameters.  It has a close to human ability to write texts about pretty much any subject and from any prompt (especially with creative prompt writing) and it performs very well in answering simple questions. Perhaps the most important capability in GPT-3 is its ability for few-shot learning. It can learn to provide new types of tasks and texts through a small number of observations. This is one of the defining features of human intelligence—the ability to learn from just a few observations and use that knowledge in the future and in different contexts.
GPT-3 is large with its 175B parameters, but the more recent models dwarf it. Google’s latest LM PaLM has 540B parameters, 3 times more than GPT-3.  This model shows the scaling laws in practice by improving on previous state-of-the-art (SOTA) performances from models such as Gopher (280B params), GPT-3 (175B params), and Chinchilla (70B params). It also shows impressive performance in few-shot and even one-shot learning, effectively outperforming the average human in 5-shot performances on a subset of Beyond the Imitation Game Benchmark.  As PaLM is similar to GPT-3, it shows how just increasing the scale of the model can increase capabilities.
The recent AI progress has not only been about pure language models, progress has also been made in generating images. This is done through pairing text and image in diffusion models, which have been shown to be more efficient than the autoregressive models that are used in LMs. The most famous example of a diffusion image-generator is DALL-E·2, a 540B parameter model, created by OpenAI.  It uses a two stage model where first a prior generates a text embedding from the prompt and then a decoder generates an image from the text embedding. The images it produces from prompts are amazing, and you should check them out at OpenAI’s website. This shows how language models can be used in other tasks as well, something we will see more of below.
In the face of the interest that DALL-E·2 generated, it is not very surprising that Imagen, Google’s latest image-text diffusion model, has not gotten the same amount of attention.  This model shows that increasing the scale of the frozen language model that decodes the prompt also increases the performance of the model. The images generated by Imagen are preferred by human judges compared to DALL-E·2. This is similar to what PaLM showed compared to GPT-3 and Gopher, scaling improves performance and increases the capabilities of the model.
Another way to mix image and text is through combining frozen language models with visual decoding models to get text output. This is what DeepMind have done with Flamingo, a 80B parameter visual language model (VLM).  The model is built by combining DeepMind’s optimal-compute 70B Chinchilla-model with a visual representation model. This allows for new capabilities and surprisingly good few-shot performance for the small model size. Nando De Freitas, a senior researcher at DeepMind, regularly shares interesting and, seemingly, very intelligent output from Flamingo and Chinchilla. Among other things, these models are able to speak in made up languages—a clear sign of intelligence. Flamingo is also able to gauge emotions and complex cause and effect relationships from images, which is also impressive.
Finally, the latest and most general model out of DeepMind: Gato.  Gato is a 1.2B parameter LM, i.e it is pretty small compared to GPT-3 and PaLM. However, it is trained on a wide variety of tasks, such as Atari, robotics, language, and images. It only uses one set of weights for all tasks, and yet it produces remarkable results. For example, it performs better than humans on Atari and it performs well in both dialogue and controlling a robot arm to stack objects. Overall, this is the closest AI research have come to AGI so far. What’s interesting is that this is achieved with a relatively small model, 1.2B parameters, what happens with scale?
The only way to interpret the recent progress has to be by updating one’s priors in favor of the scaling hypothesis. It may still be that it has a lower probability than its alternatives though. All of these models are also evidence in favor of Rich Sutton’s bitter lesson. More general models perform better than specialized models because they can leverage computation more efficiently.
Thus, the recent progress in AI has shown us that scaling might be the answer. Yet, I still am not convinced. It seems like the effectiveness of LMs comes more from the ability to hard-force intelligence through massive compute and shallow pattern-matching rather than the deep generalizing pattern-matching that we think of when we think of intelligence in the domains of science, mathematics, engineering, and so on.
- The performance of large language models seems to follow power laws. Given current model sizes and compute budgets, there seems to be a lot of room for better performance and perhaps even new capabilities.
- Recent models underline the scaling laws by increasing performance with model size. Especially interesting is the use of a single model to complete a large number of tasks, a generalist agent.
- The evidence from recent progress points in favor of the scaling hypothesis. Though, it still seems that the models lack in generalizing capabilities.
Returning to the question of shallow vs. deep pattern-matching; it seems as if current AI models are quite intelligent, yet one must also remember that the output shared from many of the creators are cherry picked to get the most hype for their projects. The greatness of the LMs has mainly been their performance and potential for few-shot learning.  And recently a literature on prompt engineering has arisen. The ability of few-shot learning in LMs comes from the ability to condition on just a few examples, and thus by engineering these examples one can improve the output of a language model. This literature started out by general smart prompting to get better output at specific tasks.
More recently, chain-of-thought (CoT) prompting has been shown to increase performance significantly in multi-step arithmetic and logical reasoning. This type of prompting can be seen in the figure below.
Kojima et al. also show that using CoT allows the LM to show zero-shot capabilities in multi-step arithmetic and logical reasoning.  This is interesting in that it shows that these capabilities are latent in the models, but for some reason they cannot be accessed without creative prompting.
What does this mean for the intelligence of these models? This seems to be evidence in favor of these models only doing some shallow pattern-matching of the type that seems unlikely to be generalizing. But, it also seems that the capacity for the type of deep pattern-matching that generalizes might be latent in the models.
- Prompt engineering improves the performance of large language models on few-shot and zero-shot learning.
- This suggests that capabilities are latent in the models, but it also suggests that the models are mainly doing a shallow type of pattern-matching.
How did evolution create general intelligence?
Now we move on to the biological realm and the evidence that we can find there. The only real evidence we have of how general intelligence can be created is human evolution. Thus, human evolution could provide us with information about how AI could possibly evolve from here. Because if human intelligence came about by some algorithmic improvement compared to species before us, then it may be that AI systems also need algorithmic improvements before becoming generally intelligent. On the other hand, if the main difference between human brains and other animal brains is scale, either by size, neural count, or anything like it, then perhaps it is reasonable to expect that scaling can create AGI as well.
To make it more tangible, imagine the difference between monkeys and humans. What is the difference between the brains of humans and monkeys that makes humans be monkeys that can chip axes, fly to the moon, and create computers? Is the difference that the scale of the human brain helped evolution find a sufficiently deep generalization that allows human brains to learn anything, or is there some algorithmic innovation that separates monkeys and humans?
From neuroscience, the answer seems to be that human brains are mainly scaled up versions of primate brains.  shows that previous research that compared the human brain to other mammals made it look like an extreme outlier, but accounting for different neuronal scaling laws makes it clear that the human brain is very similar to the brains of other primates expect that it is scaled up in terms of absolute count of neurons and relative brain size. So in some sense, this suggests that humans are scaled up monkeys that have a slightly larger brain—and it is this simple thing that allows us to walk on the moon, prove mathematical theorems, and write novels.
I think this is also evidence in favor of the theory that Jeff Hawkins writes about in A Thousand Brains (see my review of the book for my thoughts). He argues that the brain is a universal learner because of the “simple” algorithm that governs it and that this works because of the size of the neocortex relative to other animals. The fact that human intelligence does not seem to be a result of an algorithmic innovation is evidence in favor of that simple learning algorithm, it is just that scale has made us walk on the moon while chimpanzees have not.
Note though, that we must avoid the alluring trap of anthropomorphizing artificial intelligence. AI systems consume compute as well as energy in totally different ways than humans do. The value of arguing from biology is thus not as high, but I still think that there’s an update in favor of the scaling hypothesis here.
- The human brain is the only known general intelligence, how it evolved can contain evidence of how we might expect general intelligence to be developed.
- Neuroscience suggests that the human brain, while impressive, is not much of an outlier. Instead it looks like it is a scaled up version of other primate brains.
- This suggests that scaling the right algorithms and architectures can lead to general intelligence, as it has with humans.
- One should be careful of over-updating from biological evidence, as it may be that an AI will consume compute and energy in very different ways.
Technology is Different
In another way, technology works in ways that are not analogous to biology. Time and time again, innovations have arisen from single changes or single ideas. Nuclear bombs and the first flight are prime examples of technologies that were not scaled up to become useful, they were so from the first time they were created. From this, one should update towards scaling not being as important as algorithmic innovation. However, there’s a problem with this view. Has the innovation in AI already been made? Perhaps the groundbreaking innovation was backpropagation, neural nets, or the Transformer?  It could be analogous to the flight example—the first flight was indeed a huge jump compared to previous attempts but from there it took many years of scaling to reach the much more useful Boeing 737s. So perhaps the innovation to get to AGI is out there, and we just need more scaling to get to AGI.
This sounds reasonable, but seems unlikely. I think the more accurate analogy is nuclear bombs. Figuring out the algorithmic innovation in AI will likely be more like creating a highly deadly nuclear bomb from a pile of nuclear waste. We will not have AI systems that are 90% of AGI, that only create 90% of the impact, or AI systems that are 30% of AGI that only create 30% of the impact from AGI, instead my current belief is that we will see a large discontinuous jump in the ability of AI systems, from shallow pattern-matchers to deep pattern-matchers. This is in line with what Eliezer Yudkowsky, and many of the researchers at MIRI, believe. Though there are many others that believe that the road to AGI will be one of smoother, more continuous development. Paul Cristiano and Nando De Freitas, for example, believe in the scaling and that GPT-n, with some orders of magnitude more parameters, will be an AGI.
- Evidence from the domain of technology may be more applicable to understanding AGI.
- Technologies such as the Wright Flyer, the nuclear bomb, and Bitcoin show that the usefulness of a technology very often improves discontinuously.
- This is evidence against the scaling hypothesis, unless we have already seen the critical innovation, such as backprop or the Transformer.
Based on the above, it seems that many arguments point in the direction of scaling being a path to general intelligence and the deep pattern-matching that signifies it. Yet, something seems to be missing. The larger and larger models certainly are smarter than the smaller, older models, but they still do not show signs of any deep pattern-matching or generalizing capabilities.
I think the technology argument is quite strong. It seems like many examples from history have shown that all one needs to definitively solve a problem are particular solutions, not just doing more of the same thing. However much one would have wanted to create a nuclear bomb, it would not have been possible by just creating a larger and larger pile of nuclear material. The same seems to have been true of AI progress in some cases, especially AlphaZero and AlphaGo. (Though these breakthroughs are somewhat less special when one considers the outsized investments from DeepMind that went into creating them. Perhaps progress in AI is quite linear and smooth with respect to investment.)
Biologically, it also seems to me that it’s likely that AI intelligence is completely different from biological intelligence. It can consume energy and compute in ways that are completely different to biological brains. And therefore, the biological arguments are not as convincing as they seem at a first glance.
Aside from that, I believe that scaling will be powerful and it might lead to some real-world impact—although I don’t believe that we will see skyrocketing, or even markedly different, GDP growth rates from scaled models before we see AGI coming out of some new approach. The question that remains is; given the recent progress and the amount of investment into AI capabilities research, how long can we expect it to take for humanity to develop AGI? That’s a hard question, but I’ll try to answer it along with the common views on the period before AGI in the next post.
 Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arXiv:2001.08361 (2020).
 Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.
 Chowdhery, Aakanksha, et al. “Palm: Scaling language modeling with pathways.” arXiv preprint arXiv:2204.02311 (2022).
 Srivastava, Aarohi, et al. “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models.” arXiv preprint arXiv:2206.04615 (2022).
 Ramesh, Aditya, et al. “Hierarchical text-conditional image generation with clip latents.” arXiv preprint arXiv:2204.06125 (2022).
 Saharia, Chitwan, et al. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.” arXiv preprint arXiv:2205.11487 (2022).
 Alayrac, Jean-Baptiste, et al. “Flamingo: a visual language model for few-shot learning.” arXiv preprint arXiv:2204.14198 (2022).
 Reed, Scott, et al. “A generalist agent.” arXiv preprint arXiv:2205.06175 (2022).
 Kojima, Takeshi, et al. “Large Language Models are Zero-Shot Reasoners.” arXiv preprint arXiv:2205.11916 (2022).
 Herculano-Houzel, Suzana. “The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost.” Proceedings of the National Academy of Sciences 109.supplement_1 (2012): 10661-10668.
 Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).