Today's Paper Review - DeepSeek-V4
Agentic workflows and test-time scaling are forcing LLMs to process massive contexts. The problem? Standard attention scales quadratically, making ultra-long context reasoning incredibly expensive and memory-intensive.
DeepSeek tackles this bottleneck head-on with DeepSeek-V4.
Instead of forcing the model to attend to every single token, DeepSeek-V4 compresses the vast majority of the context and focuses heavy computation only where it matters most.
Why It Matters
If million-token context windows become highly efficient and practical, it fundamentally changes what AI agents can do.
DeepSeek-V4 represents meaningful progress in this direction, drastically lowering the inference barrier for ultra-long context reasoning.
DeepSeek-V4 High-Level Architecture
DeepSeek-V4 replaces full attention with an interleaved system of specialized attention mechanisms, supported by a powerful residual stream:
- Manifold-Constrained Hyper-Connections (mHC): Widens the residual stream to give it more expressive power. The representation expands into a larger dimension in the stream and compresses back down when passing through individual Transformer layers to keep compute low. Our full mHC review.โ
- Heavily Compressed Attention (HCA): This module compresses groups of 128 tokens into a single entry using a learned token-level compressor component. To prevent losing local context, this global summary is concatenated with a sliding window of the most recent tokens.
-
Compressed Sparse Attention (CSA): A more granular approach that compresses tokens in blocks of 4, and then filters out only the most important entries. This is combined with the last 128 uncompressed tokens for local context.
- Filtering the most important entries is done using an indexer attention component which runs attention over a lower dimension of the compressed entries.
Results Highlights
The paper demonstrates impressive results. The above figure shows the highlights:
- Competitive performance with top proprietary models. (left)
- Dramatic reduction of compute and memory consumption. (right)
Check out the full breakdown below.