|
Quick version for the non-engineers.
When an AI model generates text, it stores a running memory of the conversation called the KV cache. The longer the conversation, the bigger the cache, the more GPU memory it eats.
For a large model serving 512 users at once, that cache alone can consume 512 GB of memory. Four times more than the model itself.
TurboQuant compresses that cache to 3 bits per value, down from 16. Zero accuracy loss. Up to 8x speedup on NVIDIA H100s. No retraining required.
That's why memory stocks got hammered. If AI needs 6x less working memory, maybe the market doesn't need as much HBM as everyone assumed.
Clean logic. But incomplete.
|