Does anyone understand this claim from the press release?
> M4 Max supports up to 128GB of fast unified memory and up to 546GB/s of memory bandwidth, which is 4x the bandwidth of the latest AI PC chip. This allows developers to easily interact with large language models that have nearly 200 billion parameters.
Having more memory bandwidth is not directly helpful in using larger LLM models. A 200B param model requires at least 200GB RAM quantized down from the original precision (e.g. "bf16") to "q8" (8 bits per parameter), and these laptops don't even have the 200GB RAM that would be required to run inference over that quantized version.
How can you "easily interact with" 200GB of data, in real-time, on a machine with 128GB of memory??
Wouldn't it be incredibly misleading to say you can interact with an LLM, when they really mean that you can lossy-compress it to like 25% size where it becomes way less useful and then interact with that?
(Isn't that kind of like saying you can do real-time 4k encoding when you actually mean it can do real-time 720p encoding and then interpolate the missing pixels?)
Yes the size is much reduced, and you do have reduced quality as a result, but it isn't as bad as what you're implying. Just a few days ago Meta released q4 versions of their llama models. It's an active research topic.
> M4 Max supports up to 128GB of fast unified memory and up to 546GB/s of memory bandwidth, which is 4x the bandwidth of the latest AI PC chip. This allows developers to easily interact with large language models that have nearly 200 billion parameters.
Having more memory bandwidth is not directly helpful in using larger LLM models. A 200B param model requires at least 200GB RAM quantized down from the original precision (e.g. "bf16") to "q8" (8 bits per parameter), and these laptops don't even have the 200GB RAM that would be required to run inference over that quantized version.
How can you "easily interact with" 200GB of data, in real-time, on a machine with 128GB of memory??