# CGO 2024 Keynote

# Record Link

# Full Text

Thank you very much, Fernanda, for that nice introduction. And thank you to all of you for coming here to hear me speak. And it's good to be here in Edinburgh. And for Edinburgh, the weather is great at this time of year.

The title of my talk is Computing Systems for the Foundation Model Era. And I'm going to talk about some of the ideas in software and hardware, the double-picket stampers and the stamping of the systems, to focus on how we execute these sorts of applications much more efficiently.

Right, so everybody's aware of foundation models. You know, you have to have been living under a rock not to have heard about chat GTT or used it.

And so the whole idea is that you take a model that is composed of potentially hundreds of billions of parameters, and you train it on huge amounts of data. and 4 and mid-journey, and stable diffusion that can do things that here to thought we didn't think were possible.

One of the key models or kind of key capabilities that these models exhibit is this idea of in-context learning. So the idea is that one model can be adapted for many different tasks. And so it's possible to develop machine learning models in days with undergrads that, formerly, if you wanted to do the same thing, it would have taken you months with PhD students. And so this provides a great capability because you can do this kind of adaptation by using English language prompting. So you don't need to write code, and so this creates incredible capabilities. And, you know, clearly these sorts of models are going to transform all aspects of society.

And so one of the centers at Stanford is the Center for Research in Foundation Models, which is looking at how these models are going to be used in many different aspects in society.

So as we've seen, these models have impressive capabilities. They can fix bugs. You give it some buggy code, and it will tell you how to fix the bug and what mistakes you've made.

It can generate op. This is a picture I developed, was developed using stable diffusion. And it can also be used to predict the shape and form of proteins. So it can be used to design drugs.

And so the whole idea of protein folding, as one biologist explained to me, the chat at GPT of the biological world was when AlphaFold was debuted. And so these models have incredible capabilities. And as the size of the models grow, the capabilities improve.

So for example, if you look at the rate at which the models have been increasing, so the models have been increasing by a factor of 10 every few months. For instance, a 1.3 billion model trying to explain the joke about "I tried 10,000 random restarts of my neural network, but I was accused of overfitting." I guess no good seed goes unpunished. Well, if you gave this joke to a 1.3 billion size model, then it would come back and tell you something about it was. It wouldn't explain the joke that well, but if you gave the same joke to a $175 billion parameter model, all of a sudden it can explain the joke and tell you that the punchline is a play on "I guess no good seed goes unpunished."

So what you see is as you get to a certain size, you've got these emergent capabilities in these models.

And of course, training these models is incredibly expensive. Tens of millions or maybe hundreds of millions of dollars to train these very large models.

Continue as we want to continue to increase the scale and capability of these models. We need more computational capability and this computational capability has to be efficient, right? Because uh clearly we're gonna run out of power otherwise, right? So the idea is, you know, how do we continue to develop. systems that are even more efficient, more capable for these foundation models.

And so, if we want to get 100x or 1,000x improvement in performance, we need to focus on both performance, of course, and performance per watt.watt. People have been talking about super intelligence and what that will enable us. It will enable everybody to have a personal, it will allow you to have your own research, to have a group of AIs that can help you do your research. And it can also, of course, enable people with a lot fewer resources to take advantage of the sorts of capabilities that AI models can provide.

So we need to continue to develop these efficient models, and we need to make them both efficient and programmable.

The idea is that you might imagine that Yeah, machine learning models have coalesced to something that looks like a transformer and won't move. But I'm going to show you that that's not the case.

You need programmability. You need the ability for machine learning application and algorithm developers to innovate and change their models.

And so you want, let's say, quite efficiency and performance with closer to the right. right? So that's one way goal that we'd like to achieve as system architects.

And so the solution to providing these sorts of capabilities is a vertically integrated solution that combines innovation in machine learning algorithms that are co-optimized with languages and compilers. And I would argue that these compilers should be based on the ideas of data flow and that these data flow compiler should be married with efficient hardware in the form of reconfigurable data flow architectures.

And so the rest of the talk will follow this outline. I'm going to touch on each of these aspects of the vertically integrated solution for efficient and performant systems for foundation models.

So let's start by looking at the trends in ML algorithms. Right. So I've already showed you the dominant trend, which is that the models are getting dramatically bigger.

And some may argue that this trend is plateauing. But I would argue that, you know, it's going to continue, but it's going to continue in a slightly different way.

Right. So today, if you look at most of the large models, they are monolithic models, right?

So you have the single monolithic large language model and you use it as a general purpose AI capability in which you ask it whatever you would like using English language.

But an alternative is to think of these models as a collection or a combination of a number of specialized models, right?

And so the idea is that *instead* of one monolithic model,
you've got a number of *specialist models,* and they combine to
provide the overall capability of the large model.

And you have an automatic way of doing routing the right, the
*prompt* that comes in to the model which is best able to answer
that.

So this is a *research* being done by a number of people,
including Colin Raffle, and so *one of the reasons that this
idea* is interesting is because the specialist models can be both
cheaper and sometimes better for a specific task than the big general
model.

So this example shows that if you're looking at the amount of flops that you need to do inference for a particular model, the blue square models are general models, GPT-3 and of all different sizes. The green circle and star are these specialized models which take far fewer flops, but give you accuracy which is just as good or better than the big general model.

So the idea then is this trend that models will get bigger, but they will get bigger potentially by combining a lot of specialist models. Here's some recent work being done at Samba Nova, and this idea is the Samba 1 composition of experts, right?

So, the idea is that you want to build a 1.3 trillion-parameter model, but you want to do it by combining hundreds of smaller models, right? In this case, 54,7 billion parameters, 70 billion parameter models.

And these are open-source models, right? So you can take these models from Hugging Face and bring them together.

And with the right kind of router, you can get performance which matches or beats proprietary models from OpenAI, such as GPT 3.5 and GPT 4.

Are on these enterprise benchmarks. As shown here on the right here is a set of enterprise

A task that you want to perform from text editing to SQL generations to reasoning and showing that the orange bar is better or comparable to the proprietary GPT 3.5 and GPT 4.

So you've got a collection of open source models that could have been specialized for a particular task, combined to be a large closed source model, right? And so this is a trend that we will see continuing, the fact that you can specialize these models.

And now the big benefit is not only do you have an open model, but serving the models can be dramatically cheaper because you only have to fire up a small portion of the 1.3 trillion parameters as opposed tothe whole model every time that you get a new prompt. So that's one of the important trends.

Now, let's take a look at transformers in attention. As you are well aware, the dominant model, dominant way that these large language models are developed is by using a transformer architecture, right?

And that the core of transformers is attention, right? Attention helps draw connections between different components of the query or the prompt, right?

So for example, in these two sentences, in the first sentence, the it refers to dog, right?

And in the second sentence, the it refers to street, and attention would detect this, right?

So if you look at the attention algorithm, right, you've got three linear maps, QKV, which are called queries, keys, and values. And the whole point of the attention algorithm is to come up with a mapping between these and the output.

And we know that as you increase the sequence length of your model, it typically becomes more capable. And current sequence lengths are between 1 and maybe 64k tokens. And what we'd like to do is increase the sequence length dramatically, maybe by orders of magnitude.

And so why is that important? Well, we want to be able to analyze works of literature, such as books or plays. We want to analyze long legal contracts. We want to look at complete code bases.

And so we want to feed all of this data as the input to the prompt to the model. Also, if you think about analyzing images with vision transformers, you'd like to be able to deal with very high-resolution images; you're going to get better, more information for more resolution. And so your insights will be more robust.

And finally, there are lots of new areas that potentially could be represented as sequences that would be useful to use some sort of attention-based model on, such as time series in financial markets, video, medical imaging, genomics is another area where we could see these sorts of models be very useful.

And here, again, you want longer sequence length. So let's look at the attention algorithm to see where the issues are in increasing sequence length.

So if you look at the operations, you have the attention matrix multiply to generate the attention matrix, which is followed by masking, soft-mark, mask, and then dropout.

And at each step, the amount of memory that gets created is quadratic in the sequence length.

And if you do a naive implementation, then of course, every step may not to have that much computation, but it requires you to move the data off of the accelerator to typically HBM and back on again.

So what happens is that the naive implementation of attention is memory-bound. So you're limited by the bandwidth that you can get between the GPU typically and the HBM. And so if you want to improve the performance of the tension and make it more scalable, then clearly you've got to figure out how to minimize the bandwidth.

And so I teach parallel computing, and I think there are others in the audience too.

And we always talk about how you optimize locality of your algorithms by applying this idea of fusion and tiling. So you tile your matrix computation, and you do fusion in order to minimize memory bandwidth.

And so the idea of flash attention is to apply those techniques to the attention algorithm.

So you take the operations of the attention algorithm, tile the matrices, and then you operate on the matrices a tile at a time, and dramatically reduce the off-chip bandwidth.

And the results are pretty impressive, right? So you can get a 6 to 10X speed up in performance, ranging from the initial implementation, the flash attention in orange, to the one that's more optimized that makes better use of the Tensor Core units on the GPU.

And you get a large fraction of the total compute capability of the Tensor cores using flash attention too. And you get a dramatic decrease in the amount of memory used. So flash attention is used now, is the predominant way that attention is implemented in all of the major models that are used both in OpenAI, in stable diffusion, and everywhere else.

However, the question is, is this enough? Can we do even better? And so if we want to get to sequence lengths of hundreds of thousands or millions, then the quadratic computation requirements of the attention algorithm as they currently describe won't suffice.

You need to come up with an algorithm that has better asymptotic complexity. It's one that's not sub-quadratic.

And so researching with Chris Ray and his students, Simran and Sabri, I've come up with this idea of using state space ways of doing attention, right?

And so they've got a number of signal processing based ideas for implementing the attention or the attention component of transformers. And you essentially replace attention with a convolution, the convolution size being the total length of the sequence.

And then you do the convolution using FFTs. And of course, now you've got something that is better than quadratic.

The problem is, of course, FFTs are not that well supported on current GPUs. But the other issue is that the algorithms fundamentally aren't as accurate as attention.

So this chart shows you accuracy as measured by perplexity, where perplexity is a measure of how predictable language is. And what it shows is that the initial ideas for using the signal processing based attention algorithms were not as good as attention. So attention's down here between 9 and 10. And the original cementage, S4, was above 13. But subsequent implementations have come, have applied different tricks. And now we're at the point where the approach being used by the space-based ideas like Hyena and Mamba and Bass are basically better than attention, right, on somesome applications. And so now you can imagine using these sorts of algorithms to model DNA. And so currently, the modeling ideas are using models that have sequence lengths of half a million tokens, and we can expect over a million very soon. And these models are really, really interesting. These models can predict proteins directly from DNA, which is really quite important and dramatic.

So lastly, you know, we all know, as parallel computation people, that one of the key ways of reducing storage and compute is by taking advantage of sparsity.

And so lots of people have looked at sparsity in neural networks. And then come up with ideas for exploiting sparsity using dynamic sparsity.

The problem is that they haven't fundamentally shown that you can both get better accuracy and better performance using these sparse techniques.

And one of the issues is unstructured sparsity doesn't work that well on most hardware.

So one approach is to use block sparsity. The idea is to use the idea of monoc matrices.

And you decompose the computation into these block sparse matrices and intersperse with permutation.

And these techniques using monoc matrices have been shown to actually provide high accuracy on certain applications.

And also, because they've got dense matrix blocks, they can be executed inefficiently on GPUs.

Lastly, I'd like to say about ML algorithms is that fundamentally, of course, these algorithms are data flow, right?

So most of these algorithms are developed using domain-specific languages like TensorFlow and PyTorch.

And what you get then is a graph of a data flow graph composed of kernels and where the arrows between the kernels represent tensor data that moves between the kernels.

This is a fundamental aspect of ML models.

So what are the implications for ML systems? Summarizing what we've just talked about, so clearly you need to support very large models, terabytes of parameters, and very long sequences to support the future of foundation models. Sparsity will become important, increasingly, in order to manage the amount of compute and storage required.

Dataflow graphs are kind of a fundamental expression of the way in which these models are described. And so figuring out how to take advantage of all the information that dataflow graphs provide is going to be important. And finally, you know, the amount of computation required is going to be huge, and the only way to effectively provide this amount of computation is to make it as efficient as possible.

All right, so given these insights about ML algorithms, let's turn our attention to data flow compilers, right? Talk about data flow being the fundamental way in which these algorithms are described, right?

As I said, what you get out of a data TensorFlow or PyTorch program is a data flow graph where the abstractions are at the tensor algebra level with kernels that fall into tensor algebra.

And sometimes there are kernels that are outside of tensor algebra. So one of the things that we had looked at at Stanford many years ago was the idea of representing computational graphs like this as a graph of parallel patterns, right?

And you're all aware of parallel patterns like map, zip, reduce, flat map, and group by the idea that you can capture both the parallelism and locality by using these patterns.

And that you can convert a graph of ML operators into a hierarchical graph of parallel patterns. And then you can use this intermediate representation as a way of thinking about how you optimize the parallelism and locality in your algorithm.

And not only can you do tensor algebra using this technique, you can also do representations like SQL, graphs, and other things like this. So not only can you use it as an intermediate representation, but you can use it as a programming model, right?

So you can go data flow programming. So here's an example of how you could express both softmax and layer norm using a sequence of parallel patterns. And then you can think about how you optimize the performance of these sorts of graphs by scheduling in space and time.

And one of the things that you can do then is you can do automatic fusion and tiling and an optimization we call hierarchical pipelining or meta pipelining to efficiently execute these hierarchical parallel patterns.

And then once you've done that, you can lower the representation into something that looks closer to hardware. A language that we developed a few years ago and presented at PLDI 2018 is a language called Spatial, which is a representation of parallel patterns as a sequence of pipeline explicit memories, right?

So this is an example of how you would do a dot product using Spatial, where you specify a nested reduce, right?

And you notice that you've got explicit memories, SRAMs and DRAMs, and that you have explicit data movement, right? So you've got explicit data movement and this parallel pattern representation. And this is very close to something that you could map to an FPGA, or you could think about other ways of executing this kind of representation.

So way back when, many of you know a famous Professor at Wisconsin called Jim Smith, right? And Jim Smith was a great proponent of vector architectures. And he once said, you know, if you've got a vector problem, then you should design a vector computer, right?

So we have a data flow problem. So we thought, well, we should design a data flow computer, right?

So the idea that we came up with is the notion of reconfigurable data flow architecture.

And the goal here is to both be a native execution of data flow but also to reduce a lot of the overhead that you get with a conventional von Neumann architectures, right?

And so the idea is that, you know, you would get rid of instruction fetching and decode and you would configure the architecture.

But of course, it's reconfigurable so that you could change the architecture to match different data flow graphs, right?

And so that was the genesis and motivation for what we developed at an architecture called Plasticine.

AndAnd we're in the UK now, so a lot of you know what Plasticine is, right? It's this kind of children's modeling clay that can be constantly reformed, but it also maintains its form and shape.

That was the notion of Plasticine. So it was designed as an architecture that could be programmed by using spatial.

And it was composed of compute elements called pattern compute units and memory elements called pattern memory units.

And we demonstrated that you could both be much more programmable than a conventional FPGA architecture, but also. much more performant, right? So, compared to a FPGA, the same sorts of technology, it was much, much better.

So, you know, in this 2017 timeframe, when we're looking at these architectures, you know, previously, we looked at the most specific languages for machine learning. And so, the ideas came together, "Hey, we could use this kind of idea to accelerate machine learning algorithms." And given all the interest in the Valley in this idea, we decided to form a company based on these ideas. And the company was called Samba Nova Systems.

And the first incarnation of the reconfigurable data flow architecture was called the Cardinal SN10. And those of you who know about Stanford know the all-athletic teams at Stanford are called Cardinal, after the color, not the bird. And so it was called the SN10.

And so the company was founded in 2017. Co-founders were Rodrigo Liang and Chris Ray. Chris Ray, of course, is a colleague at Stanford. And Rodrigo Liang was one of the early people at the first startup I did called Afara in the Niagara processor that was talked about in the introduction.

So this chip was TSMC 7 nanometers. It was a big chip for the time, 40 billion transistors, lots of interconnect to provide the connections between the different units.

It had 640 PCUs and a peak compute capability of 312 teraflops in that brain float, BF16.

And they also had a bunch of other data types, 320 megabytes of on-chip memory. If you think about how it was different from the research product, the research idea, of course, only had floating point.

It didn't have very sophisticated pattern memory units. And it also never got implemented.

So this was lots of effort. hundreds of engineers between an idea that you can present at ISCA and something that you can commercially develop, something that's commercially viable.

So if you look at the patent community, just to dive a little bit into a little of the details since there's some architects in the room that might want to know about this.this.

So if you look at it, it is, as I said, this pipeline: data path, right? And in the research prototype, we didn't really have systolic capability, but it's easy to change this pipeline data path into something that can execute systolic algorithms.

And so it can work in two modes. It can either be in pipeline mode, pipeline data pump mode, or in systolic mode. It supports a lot of different data types. It's got a lot of counters, right? So there's no explicit control flow in the traditional sense, right? You can't do branches on this architecture.

And so you've got a lot of challenges because, of course, your focus is on executing tensor data. And so your branches aren't as useful. And, of course, we all know that branches get in the way. And you also have a tail unit which does other functions that aren't supported in the data path like exponential and signaling. If you look at the patent memory unit, which of course is feeding the data to the PCUs, right?

So it's a bunch of morally-banked SRAM arrays to give you the bandwidth. And it has a bunch of address calculation capability to support all sorts of complex addressing modes, of course, directly in hardware and also the ability to do some indirection for supporting sparsity.

An important component is the data alignment capability because the data alignment gives you a throughput tensor layout operations such as transpose, right, which are really important when you want to execute these kinds of algorithms.

Okay. What is the difference between executing on a conventional GPU and executing on a data flow architecture like the RDU?

Well, traditional GPUs do execute the graph one kernel at a time, and then the intermediate data moves back and forth between the HBM, whereas, of course,

With Dataflow, you lay out the whole computation graph in space. And then you tile and metapipeline things such that you reduce bandwidth.

And the result is more efficient execution, right? So just to give you an example, going back to this Monarch example, which we said was important for sparse execution, this shows a gem followed by a permute, followed by a gem, followed by permute. And the permutes are this transpose that get done within the PMU.

And what we see is that as you increase the sparsity, the improvement that you get by running this graph spatially increases. And what's more, the pipelining that you get improves because of course the amount of time you spend doing GEM goes down relative to the amount of time you spend doing the permute.

And so the importance of overlapping and pipelining becomes more important. And so we see the blue bar increasing relative to the yellow bar as the sparsity increases. The other thing to know, of course, is to support this kind of execution, you need a lot of on-chip SRAM, right?

And so the amount of SRAM you need is dramatically more than you might need if you want to execute the graph one kernel at a time.

So you can take an algorithm like Flash Attention and you can apply this data flow spatial execution operation and you get to overlap the different components of the Flash Attention algorithm and you can do this automatically in the compiler, right?

And so the compiler can do this transformation for you and all of a sudden you get very efficient execution. And you never had to write an explicitly fused kernel.

You could write the individual kernels, and then the compiler would do the tiling and fusion for you. And it works properly, it works correctly, it works easily because you've got this underlying data flow execution in your hardware.

So in order to run large models, it's not enough to have a single chip. You need multiple chips. These chips need to be connected together by a high bandwidth network. You can support lots of memory if you connect DVR directly to these processes. And so that's what we have with the systems we call data scale. The latest incarnation of the RDU is called the SN40M. The distinction that this chip has compared to previous RDUs is it has HBM. So formerly, all the other chips did not have HBM.

The idea was, hey, you could reduce the bandwidth enough that you could use DDR, and then you could get much higher capacity.

Of course, if you want to do very efficient inference, you do need a lot of memory bandwidth, and so this, you know, former chips were kind of focused on training.

The 40L is focused on inference, and so you need the three-tier memory. On-chip SRAM, lots of on-chip SRAM, I've argued why that's important. You need high bandwidth D, uh, uh, HBM, and then you need high capacity. And so you look at the three tiers and, uh, you know, with that, uh, a, uh, node composed of eight chips, uh, you get, um, four gigabytes of on-chip SRAM, uh, 512 gigabytes of high bandwidth memory and 12 terabytes of DDR.

And this is optimized for running models, such as that 1.3 trillion-a-parameter model that I showed you at the beginning of the talk, which are a composition of experts.

The key thing is, of course, you can switch between the experts which live in DDR very efficiently because you've got a lot of DDR bandwidth.

Compared to GPU, you can get 2 to 3x times improvement in serving these kinds of composition of expert models.

So what's next for reconfigurable data flow? Well, I think the ideas are very shown. They're used in very regular sorts of applications. And then going forward, the interesting thing will be to see how well these ideas work for regular sorts of data flow that you might see in highly irregular and sparse applications such as graph analytics, data processing, and dynamic ML models.

And what we'd also like to do is to see how we can go beyond what's possible in traditional threaded architectures like GPUs, which have the issues of interwarp divergence and interwarp divergence, and the fact that they've got the static thread launch. launch.

And so this is the idea of data flow threads. And there are two key papers that we developed.

So the idea is, of course, that you turn control flow into data flow. And so the idea is that you go down both paths at the same time.

So Yogi Berra said, when you get to a fork in the road, you take it. So that's what this architecture does. It goes down both ways and runs them at full speed. The paper that focused on that was a paper that showed how you could use Dataflow Threads to improve the performance of a bunch of database algorithms.

And the most recent paper was just presented by Alex Rucker yesterday here at HPCA.

And this is a full language compiler for Dataflow Threads. You as a programmer could think about how to develop algorithms using this idea.

And so you can imagine then that you want to maintain the efficiencies on regular applications while maintaining, while providing the capability of increasing the range of applications that you can address with the style of reconfigurable data flow architectures.

There's also important work to do For compilers for sparse machine learning and working with Fred Kostat and students, we are focusing on that.

One idea is to figure out how to extract, you know, tensor expressions from a program and map them to different libraries. So this was the mosaic paper that was presented at PLDI this year.

Another idea is the idea of a sparse abstract machine, which represents a model and compiler for sparse tensor algebra, and represents a streaming data flow. And so we presented the initial idea focusing on Tensor Algebra, but Tensor Algebra is not enough for ML applications, so you need to augment that with other operations.

And so working on that, we've developed sort of SAM++, and we're working on both new compilers for sparse abstract machine and new hardware implementations and also of course mapping these kinds of sparse data flow representations efficiently to reconfigurable data flow architectures.

So lastly, I want to talk about some of the work that we're doing to use ML to improve systems, while thinking about how one uses these foundation models to improve the overall way that the system works.

And the focus here is on networking. And as many of you know, networking is an area where you need to think about how you do things at the rate at which the network is running in real time. So if you look at networking today, it's composed of both the control plane and a data plane, and the control plane runs slowly on conventional CPUs and can run complicated algorithms, but isn't close to line rate.

And the data plane is specialized hardware that can only do very simple things, can do simple heuristics, and can run at line rate at full speed.

And so the idea of Taurus working with Muhammad Shabazz, who's a professor at Purdue, and Tushar Swami, is to take a data plane, which currently then is kind of fast and dumb, only uses simple heuristics, and replace it with ML inference, where every packet can be inferred upon. Right? And so in order to make that work, you need to change the architecture of the data plane. Right? So the data plane architecture currently, the compute is done by using the specialized match action tables. But match action tables aren't much good at implementing ML algorithms or deep neural networks.

And so of course, you want to replace the match action table. Some of those resources want to apply a reconfigurable data flow architecture accelerator block. And now, all of a sudden, you can run simple ML algorithms at line rate.

And so you apply some number of features to the model, and you get some decisions that you use to decide what to do with the packets, whether to drop if you're doing some sort of security operation, whether to send the packet onwards or to drop it.

Right? And so as an example of sort of why this is a good thing, here's anomaly detection for security where the green is what. happens when you try and do this kind of operation at the control plane, you can't keep up with the line rate.

But if you do it in the data plane with Torus, then you can keep up at the rate at which the packets are coming into the network.

And what's more, then, is that the accuracy of the model is much better than software because you get to see every packet, right?

And so the amount of packets that you detect that are bad, is much higher, and your overall F1 score is better. Right? So great.

The problem is, of course, in order for this idea to work, the models have to be simple, which means they're not as accurate and capable as you would like.

And so the question is, could we get much higher accuracy for the model while still operating at line rate?

That is the work of Qixing Zhang. And the idea is to use this idea of slow and fast models.

And so the idea, if you kind of look at the set of models that you get, at the top you've got these foundation models that we've been talking about that have hundreds of millions of parameters and are very. expensive to run and could never keep up with real-time online rate. And then at the bottom, you've got heuristics, which are basically dumb and stupid.

And then you've got smaller models, which, as we've shown, could be useful, but are not going to be nearly as accurate as the foundation models. So the idea then is to use the slow models, slow accurate models, as a way of providing labels and supervision to the faster, simpler models.

And so It's got a whole environment that takes data from the incoming traffic and determines when to update the fast model, how often to do it, what data to use, and how to label the data to retrain the small model.

And so, if you look at the overall system, then you've got some idea of Taurus capability, the ability to do inference at line rate.

But these models are small and fast and not as accurate as you would like. And then periodically, you want to be able to update those models by using information from a slower, more accurate model that is using information to provide labeled data for the training of the smaller model.

And so, the results are kind of shown here with anomaly detection, where the blue model is the fast model that doesn't get retrained, and the red is the model that gets retrained using the slow model, the slow accurate model.

And we see that the F1 score, the accuracy, is much better. So that's the idea, is that you can have this hierarchy of models that can give you the best of both worlds, right? And in system design, as we all understand it, you know, caches kind of, multi-level caches work that way, right?

You want to be able to be both, ultimately you want to be fast and accurate, and the way to do it is by using a hierarchy. So with that, let me conclude by saying computing systems in the foundation model era will be a co-design, which is perfect for this audience.

You've got HPCA. You've got the CGO. You've got PPOP. So this is the audience that focuses on how we bring these multiple ideas and communities together to create systems that combine new ideas in ML algorithms, programming languages, and compilers and accelerated architectures into an overall system that is much better than the parts, right?

So, you know, in terms of ML algorithms, we talked about composition of experts, new attention algorithms. To this idea that, you know, attention is, as currently described, is the endpoint of ML algorithms is false.

You know, algorithms will continue to morph. And so committing LOM architectures to hardware is a bad idea, because that means they can't be changed. Or they could be changed only at great expense of making a new chip. Sparsity, as I said, is going to bebe an important area, and this idea of model hierarchy, the ability to be able to use models at the same time in order to get the benefits of smaller, faster models, right?

And we've seen speculative decoding is another idea along those same lines. Programming languages, parallel patterns, data flow compilers, data flow. is going to be an increasingly important aspect of getting high performance on these accelerators and, of course, being able to use sparsity is also an important area.

Then lastly, hardware, we need hardware that is both flexible, and very efficient. And the key ideas of reconfigurable data flow are focused on achieving those aims. But this is just one idea. There are many other ideas that people will come up with and work on.

# Chapter Summary

Chapter 1 (00:00:00,191~ 00:01:14,203): The introduction sets the stage for a conference, mentioning logistical details such as room assignments and the schedule for the student research competition. It also highlights the importance of sponsor support for the event's success and introduces Fernando Ferreira, who will present the keynote speaker.

Chapter 2 (00:01:15,043 ~ 00:03:59,791): Professor Curly Olukotun is introduced as the keynote speaker, discussing his work at Stanford and his involvement with successful projects like Niagara. He is set to talk about AI advancements, specifically in software and hardware optimization for generative AI technologies like ChatFT and Gem9.

Chapter 3 (00:04:00,211 ~ 00:09:36,950): The keynote speech delves into generative AI, illustrating its impact on society and various sectors. The talk transitions into discussing the implications of foundation models in computing, emphasizing the need for more efficient computational methods and infrastructure. It concludes with the introduction of a vertically integrated solution that combines advances in ML algorithms, compilers, and reconfigurable data flow architectures.

Chapter 4 (00:09:37,991 ~ 00:13:40,964): The discussion continues with the evolution of ML models, from monolithic to more specialized, efficient models tailored for specific tasks. This shift towards specialized models promises to improve computational efficiency and accessibility of AI capabilities across different sectors.

Chapter 5 (00:13:41,284 ~ 00:20:44,274): Attention is given to the technical aspects of ML models, such as transformers and attention mechanisms. The talk covers the challenges of scaling up models and introduces innovations like flash attention and state space methods to address these issues. Despite the improvements in efficiency and performance, the quest for more advanced and cost-effective solutions continues.

Chapter 6 (00:20:44,595 ~ 00:25:01,436): This section further discusses the advancements in attention mechanisms and the possibilities of enhancing ML models' capabilities. It also addresses the increasing importance of sparsity and dataflow graphs in managing computational demands, underlining the necessity of efficient computation in future ML developments.

Chapter 7 (00:25:02,56 9~ 00:30:51,144): The chapter focuses on data flow compilers and the concept of reconfigurable data flow architecture as a promising solution for executing ML algorithms more efficiently. The narrative illustrates the transition from theoretical frameworks to practical implementations, highlighted by the creation of Samba Nova Systems and the development of the Cardinal SN10 architecture.

Chapter 8 (00:30:52,625 ~ 00:34:35,313): The narrative elaborates on the technical evolution and commercialization of reconfigurable data flow architectures, detailing the development process and the improvements made from research concepts to commercial products. It lays out the significant features of the Cardinal SN10 chip and its potential impact on ML applications.

Chapter 9 (00:34:36,453 ~ 00:38:59,897): The discussion shifts towards the execution differences between traditional GPUs and reconfigurable data flow architectures, emphasizing the advantages of the latter in ML applications, including efficiency and programmability. The chapter also mentions the continued evolution of these architectures to better accommodate ML algorithms.

Chapter 10 (00:39:00,918 ~ 00:44:47,228): This section highlights the applications and implications of reconfigurable data flow architectures in networking, introducing the concept of Taurus for ML inference at line rate within data planes. The aim is to enhance networking operations by integrating ML algorithms directly into the networking hardware.

Chapter 11 (00:44:47,508 ~ 00:52:35,314): The conclusion reflects on the advancements in computing systems tailored for foundation models. It encapsulates discussions on ML algorithms, programming languages, and hardware architecture innovations, underscoring the synergistic approach required to develop efficient and flexible next-generation computing systems. The narrative emphasizes the importance of collaboration among different disciplines to realize these technological advancements.