Large Language Models and the end of Moore’s Law
"Thou shalt witness the doubling of computational power at a constant cost every two years."
Moore's Law is a sacred observation in the realm of electronics and computing. This divine law, first uttered by the prophet Gordon Moore in 1965, has guided the miraculous growth of the computing industry, resulting in ever smaller, more powerful, and more affordable electronic devices for the faithful followers of technology.
As with any sacred dogma, there are heretics. I have heard many times in recent years that we are at the end of Moore’s law. They say that transistors are already getting so small that quantum effects prove to be a major obstacle. Very soon, they say, a transistor could be the around same size of the thing we build the transistor out of, and we’ll run out of tricks to make it any smaller. But let’s not be dogmatic about this: they’re right: all signs point to us being near the end of the sigmoid curve of innovation for silicon based transistors.
But they are responding to a myopic framing of the problem: the real goal has never been to fit as many transistors on a chip as possible. It suits you to frame the problem this way, however, if you are one of the co-founders of a chip company.
Moore’s original framing was an economic framing, but it fails to privilege the true economic mechanism of action. There’s nothing special about integrated circuits as a domain that they afford the potential for rapid advancement and innovation. The problems in this space aren’t easier than the problems elsewhere, and this doubling doesn’t simply occur by accident.
Computation is special for two main reasons:
Investment in computation has payoffs in every industry and problem domain, not just computation itself
Innovations in compute cost drive further innovations in compute cost. Thus, investment in computation is inherently exponential to a greater degree than in other industries and problem domains.
The problems here may actually be more difficult, but for some reason, humanity is willing to throw greater and greater proportions of our resources and effort at the problems in computation. It’s not that we’re sitting tight and noticing that the same 20 engineers keep making things exponentially better given the same amount of time and money — it’s that for the past 100 years, we have been (gradually and now extremely quickly) shifting society’s resource allocation from other problems into this one. The problems aren’t getting easier, we’re just trying a lot harder.
Machine learning isn’t just getting less expensive. We’re also throwing way more money at it.
What happens when we can’t fit any more transistors on a silicon wafer?
Back in the early 20th century, early computers were made using electromechanical devices such as rotor mechanisms. There was a period of rapid progress with these devices, but by the mid-1940s, vacuum tubes were dominant. It wasn’t until the 60s that the transistor entered the picture, and then the 80s ushered in integrated circuits with lithography. In each of these shifts, there was a sigmoid curve on progress, and in each, many must have been worried about the end of the paradigm. But in each case, we threw more effort at the problem and found a new way of doing things that had better returns.
The rate of progress in conventional CPU performance has been slowing, while the rate at which developers pump out bulkier and bulkier Electron apps is rapidly accelerating. If we examine the current computing landscape through this lens, it looks like we’re already near the middle of a new paradigm. Most didn’t notice unless they had a habit of building their own gaming computers because this shift hasn’t happened in more mainstream consumer devices like we’re used to. If anything, mainstream users might feel like software is actively getting slower because of the aforementioned bloat associated with modern development.
On the other hand, the rate of progress of GPU throughput has been rapidly accelerating. In 2007, the Nvidia 8800 Ultra gave a single precision performance of 384 gigaflops, while in 2023, the 4090 yields 82,600 gigaflops. Of course, the 8800 cost only $650 while the 4090 costs more than double that at $1599 (if you can get it at MSRP). We’re getting a better deal, but at the same time, we’re spending quite a bit more.
Machine Learning, Machine Feeling
As it turns out, one technique was particularly well positioned to take advantage of this new trend in compute: machine learning. In 2012, AlexNet marked the beginning of the GPU revolution in ML and the beginning of “deep learning” as we know it today. Before this, there was a great deal of uncertainty and doubt about neural networks. To put things in perspective, early on many academics were concerned that perceptrons could not even fit a simple XOR function. Even in the early 2000s, the field of “AI” was mostly focused on techniques that were more algorithmic / discrete in nature (think A* search or most chess bots) rather than data centric (which most would associate with AI today). But as it turns out, neural networks (specifically convolutional) allowed for much greater leverage of the available hardware.
And more recently, transformers and diffusion models have consumed deep learning. Interestingly, both of these methods shift computation from training time to inference time. Compared to a CNN, for example, where at inference the model must do the entire computation in a single pass, diffusion models are able to spread out this decision making across multiple passes, building and refining upon earlier decisions. The same is true about transformers. Prior methods worked by encoding an entire input passage into a fixed length hidden state from which the entire output would be decoded. Instead, transformers make a forward pass for each token in the output, which breaks the problem down and solves it gradually in sequence, rather than all at once.
This, again, allows these methods to take advantage of drastically more computation than prior methods that required a single forward pass. Right now, we’re running prompts with 4k and 8k context lengths, with 32k down the road. A single 32k prompt probably requires about $2 worth of just electricity, let alone compute resources. And I doubt we’ll stop there. If more and more consumers are willing to pay for intelligence, more and more resources will be put into innovating and manufacturing more capacity.
LLMs will be incredibly useful, and not just for corporate applications
ChatGPT is AI’s first real killer app. While statistical learning methods are ingrained into many industries, consumer applications have lagged. For decades, engineers have been trying to build AI for consumer usage, but until recently, the most often used iterations of AI technology addressed only mild inconveniences. The releases of Siri and Alexa were exciting, but it quickly became clear that no one wanted to use them to any meaningful degree.
This time is different. OpenAI reported that ChatGPT achieved 100 million users within 2 months of its release, which makes it the fastest adopted technology in history. The demand for this service drastically outpaces supply, which is why OpenAI has continued to advance the rate limit for GPT4 prompts even for users that are paying $20 per month. On the API side of things, demand is so high that often times, prompts take 15-20+ seconds to get a response. This makes great real-time applications infeasible, and that is considering only the need for a single prompt. We need dozens of decisions and answers chained in series, but for many scenarios users won’t like to wait for even one or two. And unlike earlier fast growing technologies (e.g. Twitter), this one is extremely performance intensive.
The solution? Gobble up all available compute, and then invest heavily into making more. It’s not going to get easier to innovate on chip design, but this unprecedented demand for it will cause us to allocate even more of our resources into compute. We’ll design better, more efficient, but probably even more expensive cards, and we will put them everywhere. Software is eating the world, and AI (probably similar to what we today call LLM completions) will eat software. Even in places where intelligence isn’t directly needed for enhancing application utility, it might be a lot easier to write and maintain a prompt than it is to write even python code to implement any given functionality. In the same way we saw high level languages replace assembly code (even though theoretically assembly code might be more efficient), and garbage collection replace manual memory management (even though it isn’t as performant), we’ll likely see prompting systems gradually encompass more and more of software engineering.
What are the chances that in the near future (5-10 years) Apple sells a pro max plus plus level iPhone capable of running a (today SoTA) large language model in real time? They could do what was originally promised with Siri or Alexa. What if you could easily invoke a prompt, or multiple prompts with low latency in series directly in the IOS SDK, running locally? We can now run stable diffusion locally on a iPhone, and someone was already able to run LLaMA locally on a Pixel 6.
The 1900s→2000s saw computing go vertical: we increased the density of compute units at a shatteringly exponential rate. But as we reach the end of this sigmoid curve, we’re beginning another (in my opinion, much more exciting) innovation curve. As long as the returns on intelligence stay high, and the demand for intelligence continues to grow rapidly, we’ll continue to reshape society to throw more and more resources at the problem. We are at the end of Moore’s Law, but we are at the beginning of a new paradigm. I’m excited (and also worried) to see where this takes us.