# PPoPP 2024 Keynote

# Record Link

# Full Text

He's a professor at the Massachusetts Institute of Technology. I imagine many of you are familiar with his work because he has done a large lot of research. 2012 Dijkstra Prize in Distributed Computing for the introduction of software transactional learning.

More recently, he became interested in understanding the relationship between deep learning and how neural tissue computes and is part of the effort in extracting the connectivity maps of the brain, a field called connectomics.

So this effort also led him to co-found the company called Neuromagic. And, I imagine today he will touch on some of the work that's happening there.

So with that, please welcome our speaker, Nir Shabi.

Hi, everybody. Can you guys hear me? Yes? Yeah. Thank you very much for having me. Yeah, I'm going to talk about sparsity and deep neural networks.

My career started in multi-core programming, and it's a thing I do, but the brains indeed do interest me.

And so I want to show you this photo. You can tell this is Mount Fuji. It takes you about a tenth of a second to do that. Some say even a hundredth of a second. So very few neurons are able to find that time, right?

And yet, you manage to recognize it. And you're doing that with neural tissue, with neurons and synapses in the head.

My research field now is about getting connectivity maps of neural tissue. The way we do this is very labor intensive, and computer intensive, and machine learning intensive.

So we take a little sliver of brain, we put it in a resin. It sits on top of a microtome that goes up and down, tiny slices of this thing, onto a tape so that when you get this kind of brain on tape, you take the slices and you put them on a wafer.

So each one of these little things is a piece of brain, 30 nanometers thick. And then we run it through an electron microscope.

This image is supposed to show you how short Jeff Liffman, my colleague, is. And after we move it through the electron microscope, we have these thousands and thousands of images, which collectively actually capture a tiny sliver, a cube of brain. This is 100 terabytes block of brain. If you want to understand what it really is like, this little thing, this is what it looks like on a pen.

This thing was sliced 10,000 times and then interchanged. And then what we do, right, is we take these slices, which look something like this, right, and what this is is a slice through the brain and we can actually use machinery to reconstruct, you know, from the slices what the brain will look like.

And so what happens if you look at this, what we're doing really is we're kind of carving a piece at a time and identifying in each piece where the neurons are. And we then use a lot of computing to reconstruct it. And so here you see a connection between neurons, a synapse.

And this is what it looks like in the actual slide panel. Now, when we finish this process, we actually get the ability to reconstruct real pieces of brain.

And this is an example of two ethical dendrites, a brain of a mouse. And if you look around, so each one of these is potentially a connection to a synapse, right?

This is a synapse. And so around those dendrites, these are all the other dendrites that are sitting in this kind of near volume. And in the same volume, these are all the axons that go through this volume. And these are all the glia, the support cells that go into there.

So the whole thing, all these axons and dendrites and glia cells, they actually are the volume. So it's a really, really, it looks like a really dense thing. So I've been doing this for a while, and I want to kind of to talk to you about what my conclusions are from doing this computational motor biology.

So let's talk about animal hardware, which is why we're here. So the idea is that we're going to build computer machines that are going to be like the brains. The brain is all about blue blood. And this is going to be an enormous market. We're already feeling it.it.

The basic idea is I have, you know, a computing device, a GPU or a TPU, I have thousands of cores, right, and I have high bandwidth memory, typically 16-32 gigs of high bandwidth memory, right, and I take my model and my data. I fit them into my memory. And then I apply hundreds and thousands of terra ops that I compute onto this in order to deliver the kind of segmentation, in the case of my technology.

And so, OK. The way we're kind of supposed to think about this is that we're going to put these things together, and then we're going to have this massive compute capability. hundreds of petaflops of machine learning power.

But I want to go back and think about this a little bit from the point of view of the brain.

So if we think about compute in the brain, a human cortex is about 16 billion neurons. Okay? Typically, OK, a neuron will fire, this is from energy considerations, about 0.16 times per second, and you know, each neuron has, in a mouse or a human, typically 7,000 synapses. But 7,000 synapses doesn't mean that I connect to 7,000 other neurons. Okay? It could be 700. It could be 70. We don't really know yet. And the reason I say that is because neurons connect to one another. These are all synapses. They connect to one another multiple times. And so we don't really know if it's 7,000 or 700.

Let's just assume for a second that we're talking 700. So 1.6 billion times 0.6 per second times 700 is about 2 trillion operations per second.

That's what your brain is doing, your cortex is doing. So this is less than an eyeball.

So in terms of compute, the cortex is many orders of magnitude less compute than what we do on our GPU pods.

What about memory size? What's the memory of our brain? Well, if I want to represent it, I have in the cortex about 300 trillion synapses.

So if I want to make a graph that just represents the connectivity, not even the weights, then that's about a petabyte.

So I need a petabyte to represent this graph. And your typical GPU has 16 to 32 gigs.

Right? So, you know, even if I put them together, I'm still three or four orders of magnitude less than one. Okay? So, if you will, to kind of combine these thoughts, right? So you know what we're building, right, is a lot of flocks applied to essentially an iPhone of memory. Right? But what we need, right, what we need, is an iPhone of compute applied to a petabyte of memory.

We're kind of building the wrong thing if we want to win it or break it. And why? Well, because we don't know the algorithm. So not knowing the algorithm has led us to go in this direction.

Now, what is the hope here? One thing to really understand is we talk about parallelism in the brain and some massive parallel machine and so on and so forth. But the truth is, why do we need so much parallelism in the brain? Well, if I want to do a billion instructions per second, and all I have are these very slow neurons, then I need a billion of them to work at the same time.

But on a modern monthly processor, I can do a billion instructions per second without parallelism. So we're fixated about the parallelism, but that's not really the main point. That's not the main thing that I think we should take away from brains.

The main thing I think we should take away from brains are two things. One is that they're sparse. Their computation is very sparse, both in connectivity and in activity. And the other thing is that there's incredible locality of reference. A neuron fires into the neighboring neuron, into the neighboring neuron. So hopefully in the future, we can find a way to mimic this efficiency.

But in this talk, I would like to talk about how we apply these principles for inference on today's hardware and software.

OK. So this. This is my version of what you've probably seen a million times. It's this amazing ability that NVIDIA has shown from one generation to the next to increase the amount of flops that are delivered by their hardware.

So this is five generations of NVIDIA chips over the last six years, seven years.

Amazing. And then at the same time, right? The memory hasn't grown very much. HBM memory is hard. It doesn't scale. It's expensive. A lot of bad things you can say about it, right?

So there's this kind of gap between the, you know, we have a machine with a lot of compute and little memory, right? How little? Well, this is your desktop, right? You can have a terabyte on your desktop, right? But right now, not on your GPU.

So, okay, so we've got this gap. So why is this a problem?

Well, that's part of what I'd like to address today. So part of it is, again, a slide that you've seen multiple times.

It's because the models are huge, and you can't fit to them. But it's more than that, right?

With these new generative models, the fact that they're generative is causing us a lot of trouble.

So this is where we are. And I wish we could just have S-round scale like this, but it doesn't, right?

It's unfortunate. OK. So the near future of machine learning is going to look something like, I'm going to have CPUs at the edge. I'm going to have GPUs in the data center, probably add some GPU capability to the CPU, CPU capability to the GPU, and in fact, the next generation of chips from Nvidia and AMD is going to actually combine CPU and GPU to get the best properties of both. So that's kind of where we're headed in the near future, before we really understand what we want to build.

OK. This slide is from my class in my first lecture in my multi-core processing class.

OK, I show this picture of Kunli. Here he is with the Niagara 1, the first commercial multi-processor chip.

And I ask the students, why is Kunli smiling? And he's smiling because he doesn't have to write the software.

Now, his lecture is the actual one. I've got to get to that. For the computer architects, it's still the case that you guys, we're suffering the decisions that the hardware designers are making.

And so let me go to this hardware designer and show you. If I'm going to run a program, what are the differences between what I get from a CPU and what I get from a GPU?

A GPU typically, or a GPU for that matter, or a Samba Nova, whichever one you want. Small cores, lots of caches, tiny caches also per core. You've got a lot of bandwidth to memory, but you have limited memory.

The CPU on the other hand, very large caches, faster than the GPUs, powerful cores, Limited bandwidth to memory, but lots of memory.

OK, so what was the miracle of the GPU and machine learning? Well, it turns out that when you have something that looks like a layered neural network, right, I can read a layer in, compute, read it, put it out, read, write, read, write, do that. Because of the bandwidth in the GPU, it works fantastically well. And so GPUs are off at executing neural networks.

If I try to do the same thing on the CPU, I'm not going to get something impressive. And the reason it's not going to be impressive, of course, is reading and writing from the low bandwidth memory of the CPU is kind of not going to help us. And also, the GPU has a lot more flaws. And since the introduction of tensor cores, a lot more flaws.

So what do we do? I actually need to reduce the size of the model to sparsify it.

And I need to execute in cache. And you can do that. So you can actually kind of sparsify the models, infuse the layers, and you get really great performance.

But still, the GPU has a lot of advantages. So I'm not saying one or the other. I'm not doing the comparison yet.

But this is kind of the feel of what computation looks like on these two devices.

So those are the devices. That's the hardware. And then we have the computation that we want to do in the machine.

So the world started. We have large inputs, and we have, you know, few parts of the computation are memory bound. This is perfect for a TQ.

Then we have NLP, right? We get lots of compute per way, small inputs, and some parts are memory bound, but a lot of parts are still compute bound.

Then came generative LLMs, right? Certainly, you know, I'm doing vector matrix multiplication, and everything is memory bound.

And in the multimodal models, this is kind of going again back into the place here.

I have both memory and computer power. So keep this kind of in mind that our applications; the models that we're thinking of, are not uniform. They work what we need. So what we want to do in all these cases is kind of learn a little bit from what brains do. And I want to actually reduce, compress And one thing I can do is distill a new model instead of the one that I have. That's a lot of compute and a lot of work, but it is doable.

The jury is out on whether you can distill every model into something that is as small as what you would get if you did quantization or sparsification. Quantization, reducing the size of the weights. Sparsification, you're moving weights all together, or moving new weights.

So this is the idea, right? I live in an accurate neural network. I want to compress it while preserving the accuracy. Now, there's a lot to be said on the ML side. I'm going to touch a little bit on it. But I really want to focus on the parallel program. So you have typically two types of sparsity.

And for all the stuff that I'm saying, a lot of you may know it or not. I have a neural network. I have my layers that these are essentially matrix multiplications. And then if I look at a given layer of computation at the end of it, after I get my computation of input times weights, I apply a non-linearity.

And the type of non-linearity that I apply, whether it's a ReLU or a Go, will determine how many of zeros, right? But we don't know if it's as accurate as the other. The jury is out.

There are lots of beautiful papers written in the very last few months about this topic. So we're just at the beginning of understanding activation sparsity. So we've got weight sparsity, reducing the number of weights here, and activation sparsity kind of getting more zeros.

So if you will, if I want to think about this sorry, an activation matrix. Every column is an input.

So what I'm going to do is multiply them. I'm going to multiply the row by the column. And that gives me an output entry.

So kernel sparsity, weight sparsity, is going to be this matrix. And activation sparsity will be this matrix. OK, what types of sparsity do I have?

Well, if I go to do weight sparsity, not all types of weight sparsity are the same. Between the process, it could be odd. The non-zeros could be anywhere. Structured means I force them to have a certain structure.

And typically, a structured sparsification is less accurate than an unstructured one. You can get more sparsity. In fact, 2x more sparsity by being unstructured as a rule of thumb than you can by being structured; let's say, in 4.1. And it's different from other magic structures. So this is not a proof. It's just a kind of plain rule of thumb of somebody that's been practicing this.

So I've got my activation sparsity, I've got my weight sparsity, and I now want to kind of use this with the amount of memory so that I can actually get better performance. So three things, right?

So reduce memory footprint, allows you to go and fit onto devices on the edge, tiny devices on your phone, right? Or also fit a very large model into the data center. So that's one footprint. Another one is data movement, right?

If I make things sparse, I can reduce the amount of data that I have to bring in. And finally, I can reduce the amount of compute altogether if I have less ways to multiply or multiply, less activation.

OK. So we know how. There's a lot of research there. And for the very large part, to sparsify a 175 billion parameter model, we have techniques that are one-shot techniques.

One of them, sparse GPT, is essentially where you combine sparsing and quantization in your compression approach. And what this will allow you to do, if you just did quantization, you can go down and down and down and eventually. But with sparsity and quantization together, you can get beyond that. So if you will, sparse GBT allows you to do a one-shot sparsity plus quantization, preserving accuracy.

So here's a graph of as we go up in the billions of parameters, and we have 2, 4, and 4A sparsities, which. You can see that as the model kind of gets bigger and bigger, the complexity goes down.

So you can actually do that. And you can also combine this with quantization. What you see there is that if you have a three-day quantization, you can actually get impressive performance by doing 50% sparsity.

So this is what we can do. And of course, there's a lot of other technology here.

At this point, I'm going to assume that we know how to quantize and sparsify. Let's talk about the execution of these things.

How do I actually execute these models? because uncompressing is very important.

And you have to move faster. So one way is to use a bit mask compress format where basically I take my non-zeros, and I just keep a bit map of where they are.

So this is really good when I'm at the 50% sparsity range. Up in the sparsity way, I might want to do what we call compressed sparse row.

I'm sure everybody here is seeing compressed sparse row endlessly. In the middle range, this is really good.

But as the sparsity goes up, it becomes less impressive. And at that point, it's potentially good to use the default instruction screening, which is the idea that I'm going to just encode the amount of 0 in the instruction.

So we're just going to keep pointing to the actual data and execute just the instruction. And then we're going to stream the instructions in easily. So this is kind of for high sparses.

So I've got my big maps. I've got my compressed sparse row. And I've got my large data - compression techniques, right, I can actually get, okay, my model to really shrink and get the footprint down.

This is like, this is an example with the, this part from the neural magic technology, right, compressing a model from, you know, this is a, this is a lama, what does this say?

Yeah, seven billion lama. And so you can take it down from 26 gigs to 2.5, so 10x reduction. And I'm still at 8 to 8 here. So I haven't kept on applying quantization. So this is kind of what you can achieve with sparsity. So reducing the memory footprint so that I can actually fit this on better devices.

I can fit three new models and so on. The second thing, right, that I might want to do is, you know, improve data moving, okay? So let's talk about that in the context of large language models, okay, generative large language models. So you've probably seen this image, many of you, right? pre-filled by kind of my history of what I want to compute.

So I take queries times keys times value. This is kind of what you would say is a compute-bound kind of thing. This looks like a computer vision. It's really very, very, very compute-intensive.

And then, if I actually want to do the decode, I only am feeding you a vector, right?

So I multiply it by one key and one value, and I can get all the others from essentially caching them, because they haven't changed.

So at this point, it's matrix-vector computation, right? And it's memory value. If you will, if I'm a huge generative LLM running on a GPU.

Right? Me, the GPU, right? Well, for me, this part, the pre-fill, that's what I'm built for.

All right, that's like training, okay? That's, okay, give me the, I have all the fonts you need, and I'm just gonna go through this thing, okay?

But now, if you ask me to decode, right, I have a lot of flops, and I have a little memory, right?

And this is not exactly fitting with what I do best as a GPU. And so what we need to do, what we're doing, because we have these devices that have a lot of flops and little memory, is essentially we've decided to contort our whole software infrastructure to fit this.

And so what we're doing, if you will, is combining GPUs together. so that we have more numbers. Then we have more flops. Then we have to saturate the flops. So we're increasing the batch size.

So we increase the batch size that we have to have. We do KV caches and so on and so forth. And we're struggling. There's so many beautiful papers in these conferences that I saw that struggle with the scheduling and all the reintroduction that has to do with the fact that I.

OK, so that was my little rant about that. But that's what we have. And then if I want to talk about this decoding, and I want to talk about the effect of data movement, then this is on a CPU. This is kind of an eight-course machine. And what you can see is that quantization will take you from MP32 all the way to 4 bits, 3 bits. And as you go down the bits, you can decode more and more things faster. And that's because it's all memory.

And because it's mostly memory, you manage to get a beautiful effect from this. And you can see that sparsity will take you beyond that.

This is in date 80%. So you can kind of keep on going beyond what you would get just in polarization.

OK. So that's in terms of reducing traffic by compressing the model. We can also do activation compression.

So in activation sparks and decoding, what I'm going to do is this. I have this huge weight matrix here. And I said I'm going to multiply this by this and get that result. But if what I have here is a lot of 0's, I don't need to bring in all of the weights that are going to be multiplied by 0.

So instead, what I can do is I can just you know, the specific columns that have to be with the weights, I'm sorry, with the inputs that are non-zero, right? Do this computation and get my result. So if I can actually look at my input and see where the zeros are, I can mask these guys out and not bring in the whole weight matrix every time. That will give me great performance. So activation sparsity, and there are many beautiful algorithms that people are inventing in these very days. It's a very fresh kind of thing that people are doing today. is a way of actually reducing memory traffic.

We're going to win by reducing headway traffic. Another way to reduce memory traffic is I take my model, it's compressed. Let's say I use a, you know, I can bring it into my GKE. And I then open up and do a dense combination.

So I'm not trying to, I have enough blocks I might need to do that. I'm not trying to bend it by reducing the block. I just want to make that piece of movement. So on the object, object to object, and then that.

And that will give you, you know, more, okay. Now, as you know, again, this is a, so, you know, so we talk about how data movement, right, reducing data movement is going to help me, but of course, it has its limitations.

I'm going to have to increase the batch size. And what you can see here is that if I'm batch size 1, a single user on a CPU, I can get 3x improvement by cutting down the amount of data that I move.

As the batch size increases and things become more particular, then I win less. So here I won't be winning 2x.

So there's a limit. If I'm doing batch computation in a direction which we're going in, then I win less by actually reducing the data movement traffic.

And last but not least, it's about flops. So sparsity allows us to reduce flops.

And so let's talk about that. Let's see what that does. So I'm waiting to see a real tensor core that does really get benefit from high level of sparsity.

People are working on this. I hope that soon you will see something meaningful. But we're still not there. So tensor cores are still delivering dense computation well, but it's hard to get them to deliver performance from sparsity.

There are a lot of beautiful attempts, but it's still hard. And I would say that there are all kinds of ways of commuting things and so on. It gets you something, but it doesn't get you, it's not as if I designed this thing for sparsity diffusion.

But I do have, On every CPU, almost, that you will see today, I have vector operations. And so we can use the vector operations. And it turns out that vector operations are really very useful for doing this kind of sparse computation.

So what you can do, if you have your calculation of the weight. So what I want to do, let me just say, this is the calculation that supposedly we're supposed to do. Multiply the dot product, the activation of the weights, get the output of the neuron, and put it into the output.

So what I'd like to do is kind of walk you through a detailed evolution of how we've gotten the algorithm to do this. So let's just do that. That'll be kind of the final thing that I talk about. It's a little technical, but I think it will be interesting.

So this is what I want to compute. But of course, with my vector operations, what I'm really going to do is for every sparse weight that I have here, I'm going to—you know—take a piece of the activation matrix and I'm going to compute a piece of the output.

I'm going to essentially do this by taking my weight, broadcasting it to be the same C-width as my output, then multiply—youthen multiply—you know—this broadcast thing times all these input values, and that I'm going to then add into the scene, right?

So broadcast of A times B plus C, that's what I'm going to do. And that can be done really efficiently.

Now, okay, that's one thing we need to notice. Another thing to kind of understand is that we've got to use the equivalent to the ones that we use to condense the quantum matrix multiply, which is we really want to, you know, we want in the input matrix, right, we want to take a nice, long chunk of it, right, so that wherever we hit, right, wherever we hit values, you know, in this unstructured, sparse matrix, I'll find them here.

In my cache, in my registers. I'll find the values of B in my register. So in this case, I go and I execute this row.

And then after that, I will execute the next row. And every time, what I kind of win from is that when I go. and run through this column, I'm hitting this specific row, everybody's reading from the same row, and register, and then I get parallelism here because they're writing to different places. Each one of these is landing in a different place in C, and so they're not competing, of the roadblock.

Now, if I have structure in sparsity, let's say four blocks, then if I'm screening the instructions in, I can stream an instruction, but then if I'm in a four block, I can do all of the entries of the block. I can do all of them essentially with the same instruction, just adding one to the index. If I run the red guys, then they land in one, two, and six. And now if I just use the same instruction and I do a for loop on this kind of frontier, if you will, then the next one will be 2, 3, and 7.

So it's just adding 1 to the index of where my data is. So I can essentially bring the instruction in and do all four of these locations at the same time. Now, this I would do on a machine that didn't have VNNI instruction, but most processors today, most general-purpose processors have vector operations, VNNI vector operations.

And so in a VNNI vector operation, I actually would say 8 or 16-bit words. What I can do is I can take my block, separately by each one of these columns here. And then that will accumulate into my register C. So this looks at something like, I take this, I broadcast it out. Then I multiply it by each one of these multiplied by one of these kind of four values, each of the main bits right here. And then it's all accumulated into the specific locations. So each one of these is multiplied this by this, and then accumulated into the proper places.

So that gives you really great performance. But of course, we said, "OK, if you're going to do floor block, you're going to lose, and that's going to cost you 2x." That's a lot. 2x. That's one CPU instead of two. OK? So not good. And so instead, you want to avoid doing structured sparsity. And the jury is out. I don't want to say I'm against structured sparsity. I'm not. OK? Then I will really be able to deliver higher levels of smart stuff.

So OK, so what do I do? So if it's unstructured, I can still get away with it. And the way I get away with unstructured is the following.

I'm still going to use my VN and I. What I'm going to do is I'm going to take each one.

I'm going to go on my road. I'm going to find the places where I have non-zeros. Even they might be far away from one another. Each one of these corresponds to a channel here in the input.

And what I'll do is this. I'll broadcast. I'm just doing it for two, if you can think of doing it for three.

So I'll take these two values, 16 bits each, let's say, and I will broadcast them. Then I'll multiply them.

I'll take each one of these channels and, using standard interleaving instructions that are available on the multi-force level, I will interleave these two channels.

So I'm getting an interleaving where each one of these two guys is getting one from here and one from here, and being multiplied by each other.

And so on and so forth. So this and this multiplied by the first one, and the second one, and so on and so forth.

And yes, the interleaving has a cost, but as you'll see, this cost is not that large, and therefore I get really beautiful performance in this kind of thing.

And I do get unstructured sparsity. So I can actually execute an unstructured computation.

And here's what you get in terms of performance. This is on a CPU with sparsity.

This is non-generative AI. ResNet-50, which used to be the benchmark. So ResNet-50, this is what an A100 will do, executing max.

My career started in multi-core programming, and it's a thing I do, but the brains indeed do interest me.And you can see that on a Genoa dual socket machine, you can be competitive with that.

I think, here's another one that's kind of a little bit shocking, and that is that, you know, in terms of latency, right, this is on BERT, okay, I can make a 4-core Intel MacBook do what the A100 does, okay, when we're talking just sheer latency. If it's, so sparsity, really cutting down computation when I'm talking about non-intermittent AI is really odd.

So let's talk about generative AI. The story here is very fresh and young. So I, I, you know, I put together some slides and some numbers.

We're at the very beginning of this. So, so it's not to, you know, barely be here on the site.

I, I, I myself don't know exactly how to think about it being here, but I will say this. The blue is the deep sparse intake model, 90% sparse. You have four A10s and an A100.

And I'm looking at different batch sizes. And the reason I'm looking at different batch sizes, we said this before, is because GPUs have so much compute available. We want to increase the batch size because that allows us to utilize their compute.

And so what we get is this kind of thing A small company would probably have around, you know, the average is five, six, seven users per second. And then, you know, the big guys have these massive batches, maybe even larger than 64, maybe 120 or something. Right?

And you can see, right, that you know, the GPUs do really well when the batch size increases, but not so great when we're in a small batch size, right? So in fact, I'm sorry, looking at this, right, tokens per user, right, so there's a lot of throughput from the GPU, but still what we're kind of seeing here, this, by the way, this is an 80%. This thing is an 8-core CPU.

For a single user at the edge, your 8-core CPU is good enough. It's going to deliver you great performance. As you go deeper and deeper into the data center, into the massive compute, then you need other solutions.

Right? There's an issue also here of what are you paying for your CPU, what you're paying for your GPU. Maybe you have your CPU. You're not paying anything for it. Maybe you're taking it in a cloud. What you can see is that if what I want to generate is about 15 tokens per second per user, then the CPU offers a really great price point, even in the data center. But the moment I go to 30 tokens per second, Then I need the GPU power. And essentially, the CPU becomes much less inventive.

This is an 8-core CPU. You can increase it to a 16-core, a 24-core, and so on and so forth. But I'm just trying to capture the idea of a very tiny, weak CPU device. But in the end, it still makes sense when I'm at large batch sizes to use all the computation that is available to me from that.

So I have to put it all together. I tried to show you that we don't have yet the architecture. that is going to allow us to do brain-like things, to be as efficient as birds. But sparsity is a kind of tool that we can use today to push the frontier and get us a little bit closer.

And there's a lot of research going on. Right now, I'm assuming this is going to keep on going, because it's a really beautiful, important direction.

# Chapter Summary

Chapter 1 (00:00:07,810~ 00:01:26,080): An introduction to the speaker, Nir Shabi, a professor at MIT renowned for his research in distributed computing and his contributions to understanding neural tissue computations. He is also a co-founder of Neuromagic, which focuses on connectomics.

Chapter 2 (00:01:26,120~ 00:02:41,982): Shabi discusses his interest in sparsity within deep neural networks and how the brain's efficiency in recognition tasks, such as identifying Mount Fuji rapidly, has inspired his research in neural tissue connectivity mapping.

Chapter 3 (00:02:42,922~ 00:05:14,883): The process of creating connectivity maps of neural tissue through meticulous laboratory procedures involving slicing brain tissue, imaging with electron microscopes, and computer-intensive reconstruction to understand neuron connections and brain structure.

Chapter 4 (00:05:14,963~ 00:07:10,637): Shabi explains the intricate structure of the brain revealed through connectomics, highlighting the dense network of neurons, synapses, and glial cells. He connects this complexity to the development of computational models aimed at emulating brain function, touching on the hardware and memory requirements.

Chapter 5 (00:07:11,077~ 00:08:39,442): A comparison between brain compute capabilities and modern GPUs, with emphasis on the disparity in memory size and computational power. Shabi outlines the efficiency of the human cortex and its lower computational needs compared to GPUs.

Chapter 6 (00:08:44,712~ 00:11:16,629): Further examination of the computational differences between the human brain and GPUs. The discussion shifts towards the advantages of sparsity in computation within the brain and its potential application in hardware and software inference.

Chapter 7 (00:11:16,869~ 00:14:30,903): Exploring the principles of sparsity and locality in neural computation, including potential implications for future hardware development. Shabi highlights the limitations of current technology in achieving brain-like efficiency and the necessity of novel approaches.

Chapter 8 (00:14:31,203~ 00:16:12,751): An insight into the challenges faced by computer architects due to decisions made by hardware designers. The discussion contrasts CPU and GPU capabilities, particularly regarding memory and computational power.

Chapter 9 (00:16:15,600~ 00:17:44,005): The importance of sparsity in reducing model size for efficient execution in various computing environments, from edge devices to data centers. Shabi emphasizes the need to adapt models for better performance on current hardware.

Chapter 10 (00:17:44,205~ 00:20:35,269): The evolution of machine learning models and the challenges posed by large, generative models due to their memory-bound nature. An argument for learning from brain efficiency to improve model performance and reduce computational load.

Chapter 11 (00:20:37,431~ 00:25:28,665): An in-depth examination of different types of sparsity in machine learning models, including weight sparsity and activation sparsity. Shabi discusses the potential for combining sparsity with quantization for enhancing model performance.

Chapter 12 (00:25:30,906~ 00:29:48,590): Outlining various techniques and technologies for implementing sparsity in neural networks, including compressed sparse row and bit mask formats. The goal is to reduce memory footprint and enhance execution efficiency.

Chapter 13 (00:29:49,822~ 00:33:06,214): The significance of reducing data movement in large language models through activation compression and model compression strategies. Shabi discusses the challenges and solutions for optimizing data movement in computational models.

Chapter 14 (00:33:07,014~ 00:37:28,483): Advanced strategies for optimizing neural network execution through sparsity, focusing on activation sparsity and memory traffic reduction methods. Challenges in leveraging tensor cores for sparse computations are addressed.

Chapter 15 (00:37:28,523~ 00:44:31,029): A technical overview of algorithmic optimizations for sparse matrix operations on CPUs, exploring vector operations and structured versus unstructured sparsity. Shabi critiques the limitations and potentials of structured sparsity implementations.

Chapter 16 (00:44:32,089~ 00:52:06,255): A summary of performance gains achieved through sparsity in generative AI models, comparing CPU and GPU efficiency across different batch sizes. Shabi concludes with reflections on the current state of computational architecture and the ongoing quest for brain-like computational efficiency.