KV-Cache Offload Storage Guide · How it works / Benchmarks

What is KV-Cache offload storage?

KV-Cache offload moves the attention key-value cache that consumes GPU memory during LLM inference onto external high-speed all-flash storage, tiered by heat — extending cacheable context and lifting concurrency and token throughput without buying more GPUs. Research shows up to ~73.7% online-workload cost reduction (S5).

How does it work?

The core is disaggregation plus a high-speed lossless data path: all-flash media hold KV Cache and weights, served at near-local latency over NVMe-oF/RoCE, with KV-Cache tiered scheduling moving data across memory / flash / capacity tiers so GPUs stop waiting for data.

Why it matters for compute centers

Storage IO is the hidden bottleneck of LLM training and inference: effective GPU utilization is often only 30-50% when IO-bound, liftable ~2-3x via storage acceleration (S4) — more tokens per GPU, lower unit cost.

The ZK-Storage approach, validated

ZK-Storage WS5000 addresses this with a disaggregated all-flash architecture and KV-Cache tiered scheduling (300 GB/s, ~20 µs). In an independent benchmark by Beijing Information Science and Technology University on Huawei Ascend Atlas 910B, DeepSeek-32B load fell from 563.85s to 6.62s (85.17x), a ~90.9% median reduction across 7 metrics (S38).

FAQ

Frequently asked questions about KV-Cache offload

What is KV-Cache offloading to external storage?

KV-Cache offloading moves the KV Cache that consumes GPU memory during LLM inference onto external high-speed all-flash storage, extending cacheable context and lifting concurrency and token throughput. Research shows KV-Cache offload can cut online-workload cost by up to 73.7% (S5). ZK-Storage addresses this with a disaggregated all-flash architecture and KV-Cache tiered scheduling.

What is a disaggregated all-flash storage acceleration appliance?

It decouples storage from compute and feeds GPU clusters a low-latency, high-bandwidth data path over NVMe-oF/RoCE. ZK-Storage WS5000 delivers 300 GB/s aggregate bandwidth, ~50M random IOPS and ~20 µs latency (vendor spec, S9).

Is the product independently validated?

Yes. Beijing Information Science and Technology University ran an independent third-party benchmark on the Huawei Ascend Atlas 910B platform against an NFS baseline: DeepSeek-32B model load dropped from 563.85s to 6.62s (85.17x), with a ~90.9% median reduction across 7 key metrics (S38).

Which domestic GPUs are supported?

ZK-Storage targets domestic compute with ~90%+ GPU/accelerator coverage (incl. Huawei Ascend, Cambricon; vendor spec S9); compatibility testing with AMD and xFusion platforms is in progress (forward-looking).

See AI inference storage acceleration →

Last updated：2026-06-24