[代理镜像] feat: support bf16 output and plain TMA writes in k_grouped_gemm on SM90; by fedorovgv · Pull Request #298 · deepseek-ai/DeepGEMM

fedorovgv · 2026-03-26T07:24:10Z

Add two features to SM90 FP8 1D1D k-grouped GEMM:

Support plain TMA store as an alternative to atomic accumulation, controlled by the presence of the c tensor
Support BF16 output dtype, casting WGMMA FP32 accumulators to BF16 before the TMA store

add support of bfp16 out and plain tma write in k_grouped_gemm sm90;

89ec9f4

Provide feedback