What's the deal with LlamaCPP and caching?

j4k3@lemmy.world · 9 months ago

What's the deal with LlamaCPP and caching?

j4k3@lemmy.world · 9 months ago

GPUs are improving in architecture to a small extent across generations, but that is limited in its relevance to AI stuff. Most GPUs are not made primarily for AI.

Here is the fundamental compute architecture in a nutshell impromptu class… The CPU on a fundamental level is almost like an old personal computer from the early days of the microprocessor in every core. It is kinda like an Apple II’s 6502 in every core. The whole multi core structure is like a bunch of those Apple II’s working together in a way. If you’ve ever seen the mother boards for one of these computers or had any interest in building bread board computers, there are a lot of other chips that are needed to support the actual microprocessor. Almost all of these chips that were needed in the past are still needed and are indeed present inside the CPU. This has made computers much more simple as far as building complete computers.

You may have seen the classic ad (now ancient meme) where Bill Gates says computers will never need more than 64Kb of memory. This has to do with how many bits of memory can be directly addressed by the CPU. The spirit of this problem, how much memory can be directly addressed by the processor is still around today. This problem is one reason why system memory is slow compared to on-die caches. The physical distance plays a big role, but each processor is still limited in how much memory it can address directly. The solution is really quite simple. If you only have let’s say 4 bits to address memory locations in binary 0000, then you are able to count to 15 (1111b = 15) and can access the bits stored in those 15 locations. This is a physical hardware input/output thing where there are physical wires coming out from the die. The solution to get more physical storage space is simply to create a way to use the last memory location as an additional flag register that tells you what additional things need to be done to access more memory. So if the location at 1111b is a register, and that register has 4 bits, we lost an addressable memory location so we only have 14 available locations in directly addressable memory, but if we look at the contents of memory location 1111b and then use that to engage some external circuitry that will hold this bit state, (so like 0001b is detected, and external circuits are used to hold that extra 1 bit high), now we effectively have 0000 & 0000 (-1) addressable memory locations available to us. But with the major caveat that we have to do a bunch of extra work to access those additional bits. The earliest personal computers with processors like the 6502, manually created this kind of memory extension on the circuit board. Later computers of the next few generations used a more powerful memory control chip that handled all of the extra bits that the CPU could not directly address without it taking so much CPU time to manage the memory and it started to allow other peripherals to store stuff directly in memory without involving the CPU. To this day, the fundamental way memory is accessed is done the same way with modern computers. The processor has a limited amount of address space it can access and a peripheral memory controller tries to make the block of memory the processor sees as relevant as possible as fast as it can. This is a problem when you want to do something all at once that is much larger than this addressing structure can throughput.

So why not just add more addressing pins? Speed and power are the main issues. When you start getting all bits set high, it uses a lot of power and it starts to impact the die in terms of heat and electrical properties (this is as far as my hobbyist understanding takes me comfortably).

This is where we get to the GPU. A GPU basically doesn’t have a memory controller like a CPU. A GPU is very limited in other ways as far as instruction architecture and overall speed. However, a GPU combines memory directly with compute hardware. This means the memory size is directly correlated with the compute hardware. These are some of the largest chunks of silicon you can buy and they are produced on cutting edge fab nodes from the foundries. It isn’t market gatekeeping like it may seem at first. Things like his Nvidia sells a 3080 and 3080Ti as 8 and 16 GBV is just garbage marketing idiots ruling the consumer world. In reality the 16GBV version is twice the silicon of the 8GBV.

The main bottle neck for addressing space, as previously mentioned, is the L2 to L1 bus width and speed. That is hard info to come across.

The AVX instructions were specifically created for AI type workloads. Llama.cpp supports several of these instructions. This is ISA or instruction set architecture, aka assembly language. It means this can work much more quickly when a single instruction call can do a complex task. In this case AVX512 is a single instruction that is supposed to load 512 bits from memory all at one time. In practice, it seems most implementations may do two loads of 256 bits with one instruction, but my only familiarity with this if from reading a blog post a couple of months ago about benchmarks and AVX512. This instruction set is really only available on enterprise (server class) hardware or in other words a true workstation (tower with a server like motherboard and enterprise level CPU and memory.

I can’t say how much this hardware can or can not do. I only know about what I have tried. Indeed, no one I have heard of is marketing their server CPU hardware with AVX512 as a way to run AI in the cloud. This may be due to power efficiency, or it may just be impractical.

The 24GBV consumer level cards are the largest practically available. The lowest enterprise level card I know of is the A6000 at 48GBV. That will set you back around $3K used and in dubious condition. You can get two new 24GBV consumer cards for that much. If you look at the enterprise gold standard of A/H100’s your going to spend $15K for 100GBV. With consumer cards and $15k, if you could find a tower that cost 1k and could fit 2 cards in each you could get 4 comps, 8 GPUs, and have 192GBV. I think the only reason for the enterprise cards is for major training of large models with massive datasets.

The reason I think a workstation setup is maybe a good deal for larger models is simply the ability to load large models into memory at a ~$2k price point. I am curious if I could do training for a 70B and a setup like this.

A laptop with my setup is rather ridiculous. The battery life is a joke with the GPU running. Like I can’t use it for 1 hour with AI on the battery. If I want to train a LoRA, I have to put it in front of a window AC unit that is turned up to max cool and leave it there for hours. Almost everything AI is already setup to run on a server/web browser. I like the laptop because I’m disabled with a bedside stand that makes a laptop ergonomic for me. Even with this limitation, a dedicated AI desktop would have been better.

As far as I can tell, running AI on the CPU does not need super fast clock speeds it needs more data bus width. This means more cores are better, but not just consumer cores nonsense with a single system memory bus channel.

Hope this helps with the fundamentals outside if the consumer marketing BS.

I would not expect anything black Friday related to be relevant IMO.

webghost0101@sopuli.xyz · 9 months ago

Are you secretly buildzoid from actual hardcore overclocking?

I feel like i mentally leveled up just from reading that! I am not sure how to apply all of it to my desktop upgrade plans but being a life long learning you just pushed me a lot closer to one day fully understanding how computers compute.

I really enjoyed reading it. <3

j4k3@lemmy.world · edit-2 9 months ago

Thanks I never know if I am totally wasting my time with this kind of thing. Feel free to ask questions or talk any time. I got into Arduino and breadboard computer stuff after a broken neck and back 10 years ago. I figured it was something to waste time on while recovering and the interest kinda stuck. I don’t know a ton but I’m dumb and can usually over explain anything I think I know.

As far as compute, learn about the arithmetic logic unit (ALU). That is where the magic happens as far as the fundamentals are concerned. Almost everything else is just registers (aka memory), and these are just arbitrarily assigned to tasks. Like one is holding the next location in running software (program counter), others are for flags with special meaning like interrupts for hardware or software that mean special things if bits are high or low. Ultimately everything getting moved around is just arbitrary meaning applied to memory locations built into the processor. The magic is in the ALU because it is the one place where “stuff” happens like math, comparisons of register values, logic; the fun stuff is all in the ALU.

Ben Eater’s YT stuff is priceless for his exploration of how computers really work at this level.