fix: correct fused KV cache stride in paged MQA logits#311
Open
JasonOA888 wants to merge 1 commit intodeepseek-ai:mainfrom
Open
fix: correct fused KV cache stride in paged MQA logits#311JasonOA888 wants to merge 1 commit intodeepseek-ai:mainfrom
JasonOA888 wants to merge 1 commit intodeepseek-ai:mainfrom
Conversation
The from_blob calls for kv_cache and kv_cache_sf used wrong strides that assumed a flat layout (all values then all SF), but the API assertion enforces an interleaved layout ([values|SF] per row). FP4 path: stride(1) was head_dim/2 (64), should be fp4_with_sf_bytes (68) FP8 path: stride(1) was head_dim (128), should be head_dim_with_sf (132) kv_cache_sf offset was also wrong (block_kv * head_dim instead of just head_dim), and stride(1) was 1 instead of the row stride divided by element size. Also fixes the test helpers to construct the correct interleaved layout instead of the flat layout that only worked by coincidence for small block sizes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
fp8_fp4_paged_mqa_logitsdecomposes the fused KV cache into separatekv_cacheandkv_cache_sftensors viatorch::from_blob. The strides passed tofrom_blobassume a flat layout (all values then all SF), but the API assertion enforces an interleaved layout ([values | SF]per row):What was wrong
kv_cache stride(1):
head_dim/2(64), should befp4_with_sf_bytes(68)head_dim(128), should behead_dim_with_sf(132)kv_cache_sf offset:
block_kv * head_dim, pointing past all data rows into undefined territoryhead_dim(orhead_dim/2for FP4), pointing to the SF of the first rowkv_cache_sf stride(1):
1(contiguous), but SF values are interleaved everyfp4_with_sf_bytes/head_dim_with_sfbytesImpact
TMA descriptors constructed with wrong strides cause misaligned reads for any row > 0 in the paged KV cache. This silently corrupts attention logits for multi-row blocks.
The test helper had the same flat-layout assumption, masking the bug for small block sizes where corruption fell in the last 1-2 rows.
Fix
fp4_with_sf_bytes/head_dim_with_sf) for bothkv_cacheandkv_cache_sffrom_blobkv_cache_sfoffset tohead_dim/head_dim / 2(start of first row's SF)