The Gentle Art of Doing Things Differently

VP Strategic Initiatives, Fastly

20. Februar 2025

The last couple of weeks have had the large language model (LLM) world all a-flutter about DeepSeek's new R1 Large Language Model. The quality allegedly rivals that of the previous front runner - OpenAI's o1 model. But where OpenAI spent a rumored $500M on training O1 (and is reportedly spending $1 billion for GPT-5, up from $80-100 million to train GPT-4) DeepSeek claims to have only spent a relatively frugal $5M on their model.

Around the same time another Chinese company, 01.ai, trained its Yi-Lightning model using 2,000 undisclosed GPUs for $3 million according to CEO Kai-Fu Lee. Meanwhile, it is believed that OpenAI used 10,000 Nvidia A100 GPUs to train its GPT-3 model and many more H100 processors to train its GPT-4 and GPT-4o models.

And these pricing disparities are reflected in their pricing. OpenAI o1 costs $15 per million input tokens and $60 per million output tokens whereas DeepSeek Reasoner, which is based on the R1 model, costs $0.55 per million input and $2.19 per million output tokens.

So, how are these much smaller companies managing a near 90% reduction in costs?

By thinking laterally, without ready access to the latest H100 GPUs from Nvidia, they needed to do something different.

Part of the reason for the lower training cost is they’re piggybacking on other models’ training costs. It turns out that you can use Llama, etc to generate large quantities of training data and then train on those generated datasets. Oddly, you can actually generate larger training sets than the original models used.

Ben Thompson speculates in his Stratchery blog post that the older H800s that DeepSeek have, "are Hopper GPUs, they just have much more constrained memory bandwidth than H100s because of U.S. sanctions. Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is impossible to do in CUDA."

So, again - faced with constraints, they had to think laterally and work both smarter and harder.

And they're not the only ones - in "The Short Case for Nvidia" Jeffrey Emanuel lists a whole host of companies like Cerebras and Groq who are taking radically different approaches to chip design in order to give performance that far exceeds that of the flagship H100s.

This is not new - we've seen this pattern of change driven by limitations again and again.

Constraints can lead to revelations

Let's talk about an obscure French literary movement - OuLiPo which is short for "Ouvroir de Littérature Potentielle" or "The Workshop of Potential Literature”. (Not something you'd usually expect to read about in a blog post from a tech company.) Oulipo was a loose collective of writers and mathematicians who wanted to use constrained writing techniques to inspire creativity and as a means of triggering ideas and inspiration.

Founder Raymond Queneau wrote "Exercices de Style" a collection of 99 stories that all retell the same story of a minor altercation on a bus trip –each version uses a different tone and style.

Georges Perec's novel "La disparition" is a 300-page mystery novel written without the letter "e" in which the absence of that letter is also the plot of the story. Remarkably it was also later translated into English as "A Void" by Gilbert Adair, still without using any "e" at all.

From the Dadaist's "Cut Up" technique as popularised by Beat Poets like William Burroughs to Brian Eno's "Oblique Strategies" creatives have used these techniques to push their work forward. To get themselves out of creative ruts, out of local maximas – to borrow a phrase from Machine Learning.

Oulipo has fascinated me ever since I heard about it. I loved the idea that by being constrained you might get great leaps of insight and inventiveness. As someone with a chronic habit of working at start-ups, I was used to having to make do with more which I think is one of the reasons it resonated with me.

Web 2.0, Memcached, and the LAMP Stack

One concrete example of this comes from back in the early 2000s my Fastly co-founder, Artur Bergman, persuaded me to upsticks from London to work with him at LiveJournal which had been acquired by Six Apart, the makers of the pioneering Movable Type and TypePad blog platforms.

LiveJournal was revolutionary in its time - not just for inventing a lot of tropes that are now standard in blogging (it was even, originally, limited to 140 character posts. I wish I was joking) - but because it created or popularized a lot of the standard methods we now have for building websites.

LiveJournal founder Brad Fitzpatrick couldn't afford million-dollar Sun Microsystem servers. So he had to make do with commodity hardware. He, and the very small group of people who ran the site (who have since gone on to do incredible things such as fix the Fail Whale at Twitter, run SRE at YouTube, and keep Facebook scaled. And not least Brad who has been central to the successes of Android, the Go language, and Tailscale) had to think differently.

LiveJournal, and the talks Brad and others gave at conferences popularized techniques like sharding and partitioning. We created Memcached (the M in LAMP, behind MySQL) which is still hugely popular today despite more featureful alternatives like Redis coming along. We weren't the only ones for sure - it was a small community of people and we were close with people at websites like Flickr who used similar techniques. Incidentally, Flickr, and specifically the talk "10+ Deploys Per Day: Dev and Ops Cooperation at Flickr", was the basis for the modern DevOps movement.

Why? Because we had no money. We needed to make things scale on a budget. At a certain point, DMA100 drives run out of read capacity, so you're forced to think around that - why not do data reads from memory? But you can't cache writes... So you need to spread the writes across different disks… Which means you need to shard. You can't wait a month for deployment. So you figure out ways to deploy 10 times a day.

Constraints lead to innovation.

Fastly's Flashes Of Insight

We started because we were frustrated with the state of CDNs. We had all used CDNs for years - when I was at Yahoo! we were one of the earliest and largest customers of Akamai.

Akamai started in 1998 when the internet was a very different place. This was a time of local ISPs, competing 56k modem standards, and websites that were largely static.

What was amazing was that in 2011, when we were building the company, nothing much had changed. You configured Akamai through an XML file, deploys took hours. Logs were shipped to you, via FTP (!!!) no less, once a day. "Fast" purges were not.

Amazingly you still couldn't cache dynamic content and had to rely on "Dynamic Site Acceleration" to lower latency but nothing more. TLS, Media Streaming, Dynamic Content, and DDoS Protection were all different networks.

And all the pretenders to the Akamai throne were just clones of that functionality.

To be fair AWS's CloudFront, although limited, had brought PAYG (pay as you go) to the industry, and our neighbors down the road, CloudFlare, were doing cool stuff, bringing CDN functionality to the masses and blending it with security.

But it felt stagnant. Akamai had dabbled with EdgeComputing in 2004, but the results were underwhelming - limited to Java and C# it was slow, expensive, limited, and required your code to be vetted by your Sales Engineer.

And nobody had a reason why. Over and over we were told that certain things - instant deploys, instant purging, real-time stats and logging, caching of dynamic content - were impossible. It was known. It was a fact.

We were crazy for even asking.

So we tried to figure it out from first principles. We whiteboarded it out (actually, we used the windows of our apartment - whiteboards are really expensive. Even back then, we were coming up with innovative solutions to constraints).

The bigger problem was - how are we going to build this? We were going to have to compete with established players with giant networks - 100s of 1000s of servers.

In short - how do you build a complete CDN for less than a million dollars?

This is where we had our two major "Aha!" moments.

First - the internet had changed. Profoundly and dramatically. We didn't have local ISPs anymore. We didn't need to build a huge network - in fact, that would be bad. A huge number of cache servers means more cache fragmentation which means lower performance. And if each cache server is small, it can do less. Not to mention the issues with coordinating that many servers.

The received wisdom was wrong.

The second moment was more profound. And, surprisingly, it related back to our experience with Memcached. SSDs or Solid State Drives were becoming more popular and widespread. But, in $/GB, they were still frighteningly expensive. You would be nuts to deploy them fleet-wide - they were for specialized applications only, deployed sparingly and carefully.

If you thought of them as $/GB.

But what we realized was that we shouldn't be looking at them that way. We didn't care about the cost per gigabyte. What we needed to think about was dollars per IOPs (input/output operations per second). Throughput. And in that case, Solid State Drives were incredibly cheap. Don't think of SSDs as expensive disks; think of them as really cheap memory.

Not only that, but the rule is that if you're dealing with internet-level traffic, you use Routers not Switches. Switches are for internal communications. But Routers, typically bought from companies like Cisco and Juniper, are expensive. Each one costs around $500k, and we would probably need 3 per data center. $1.5M even before we added servers. But we had only raised $1M. The math, as it were, wasn't mathing.

So what were we going to do? We used switches from Arista instead. These were relatively cheap at around $20K a piece. Not cheap in REAL terms, I'll grant you but SIGNIFICANTLY cheaper than $500k.

But… aren't they ill-suited for internet traffic? True, but Aristas lets you write your own software to run on top of them. So we did. And then we told everyone how we'd done it.

Suddenly a lot of things fell into place. Not only could we build a viable network we could build it cheaper and better. We could cache dynamic content, not just accelerate it. And we could deliver a lower TTFB (Time To First Byte) at the same time.

There were lots of other things we had to work out, both large and small, but these were the major insights that gave us confidence. Everything else was "just" a Simple Matter Of Programming™.

And that's how five people in an office above a pizza shop were able to start competing with a behemoth like Akamai who probably spent more on soft drinks a year than we had in funding.

Writers, Mathematicians, and AI walk into a bar…

There's a nice circularity to the whole story.

Interestingly when I first learned about OuLiPo back in the late 90s I was also playing around with Markov Chains - at first, I used it to generate OuLiPo style works according to some constraint, writing poems only using four-letter words, for example.

But as someone who has spent a long time working on search engines, I also realized that Markov Chains represented a form of Vector Space. Google's PageRank algorithm is based on the same insight. By representing documents in this N dimensional vector you could do really interesting information retrieval techniques. After a while, I moved onto different projects but 20 or so years later, LLMs use these very same vector space techniques to perform tasks that often look like magic.

This only goes to show that you have no idea where these insights are going to take you. An exercise in creative constraint from a French literary movement in the 1960s can have a through line to the magic technology underlying a huge Internet Giant in the early 2000s and then a massive technology trend in 2025 and to some $50B being (temporarily?) wiped off a company's value.

The current AI boom (which is not the first, and probably won't be the last) is ripe for scrappy companies to come in and undercut the big dogs. And this could benefit all of us - the newcomers may be optimizing so that they can compete with the incumbents, just like we did in the early days of Fastly - but those innovations may help everyone. Smaller, cheaper, more efficient LLMs are not only good for profit margins, they're good for the environment.

And I don't think we need a Language Model to tell us why that's good.

Sources

Nur auf Englisch verfügbar

Constraints can lead to revelations

Web 2.0, Memcached, and the LAMP Stack

Fastly's Flashes Of Insight

Writers, Mathematicians, and AI walk into a bar…