[代理镜像] [Fix] fix B200 cu128 NVCC compilation failed by FlamingoPg · Pull Request #173 · deepseek-ai/DeepGEMM

FlamingoPg · 2025-08-26T18:32:08Z

cu128 B200 failed with

➜  DeepGEMM git:(main) python3 /sgl-workspace/DeepGEMM/tests/test_bf16.py                                           
Library path:
 > ['/usr/local/lib/python3.12/dist-packages/deep_gemm']

Testing GEMM:
Warning: please use at least NVCC 12.9 for the best DeepGEMM performanceNVCC compilation failed: nvcc fatal   : Unsupported gpu architecture 'sm_100f'

Traceback (most recent call last):
  File "/sgl-workspace/DeepGEMM/tests/test_bf16.py", line 123, in <module>
    test_gemm()
  File "/sgl-workspace/DeepGEMM/tests/test_bf16.py", line 30, in test_gemm
    getattr(deep_gemm, func_name)(a, b, d, c=c)
RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:176): false and "NVCC compilation failed"

After fix

➜  DeepGEMM git:(main) ✗ python3 /sgl-workspace/DeepGEMM/tests/test_bf16.py                                         
Library path:
 > ['/usr/local/lib/python3.12/dist-packages/deep_gemm']

Testing GEMM:
 > Perf (m=  128, n= 2112, k= 7168, layout=NT, BF16, acc=0):   18 us |  221 TFLOPS | 1860 GB/s | 0.79x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NT, FP32, acc=0):   18 us |  212 TFLOPS | 1815 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NT, FP32, acc=1):   18 us |  212 TFLOPS | 1871 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NN, BF16, acc=0):   18 us |  218 TFLOPS | 1833 GB/s | 0.78x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NN, FP32, acc=0):   19 us |  207 TFLOPS | 1774 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NN, FP32, acc=1):   19 us |  206 TFLOPS | 1821 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TT, BF16, acc=0):   18 us |  220 TFLOPS | 1851 GB/s | 0.79x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TT, FP32, acc=0):   18 us |  212 TFLOPS | 1820 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TT, FP32, acc=1):   18 us |  210 TFLOPS | 1855 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TN, BF16, acc=0):   18 us |  218 TFLOPS | 1837 GB/s | 0.77x cuBLAS

johnnynunez · 2025-08-26T23:47:11Z

@LyricZhao Could you merge it?

* Add symmetric memory tests * Add symm buffer class * Rename files

FlamingoPg mentioned this pull request Aug 26, 2025

[Bug] cu128 test_bf16 error #172

Closed

zhyncs force-pushed the main branch from 198d857 to 89b4089 Compare August 26, 2025 23:29

fix B200 cu128 NVCC compilation failed

2074fa7

LyricZhao merged commit 3a93f4e into deepseek-ai:main Aug 27, 2025

LyricZhao pushed a commit that referenced this pull request Apr 16, 2026

Fix B200 cu128 NVCC compilation failed (#173)

ac09fc9

LyricZhao added a commit that referenced this pull request Apr 16, 2026

Add PyTorch symmetric memory and mega MoE interfaces (#173)

ab451f3

* Add symmetric memory tests * Add symm buffer class * Rename files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix B200 cu128 NVCC compilation failed#173

[Fix] fix B200 cu128 NVCC compilation failed#173
LyricZhao merged 1 commit intodeepseek-ai:mainfrom
sgl-project:main

FlamingoPg commented Aug 26, 2025

Uh oh!

johnnynunez commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

FlamingoPg commented Aug 26, 2025

Uh oh!

johnnynunez commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants