豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

[Fix] fix B200 cu128 NVCC compilation failed#173

Merged
LyricZhao merged 1 commit intodeepseek-ai:mainfrom
sgl-project:main
Aug 27, 2025
Merged

[Fix] fix B200 cu128 NVCC compilation failed#173
LyricZhao merged 1 commit intodeepseek-ai:mainfrom
sgl-project:main

Conversation

@FlamingoPg
Copy link
Copy Markdown
Contributor

cu128 B200 failed with

➜  DeepGEMM git:(main) python3 /sgl-workspace/DeepGEMM/tests/test_bf16.py                                           
Library path:
 > ['/usr/local/lib/python3.12/dist-packages/deep_gemm']

Testing GEMM:
Warning: please use at least NVCC 12.9 for the best DeepGEMM performanceNVCC compilation failed: nvcc fatal   : Unsupported gpu architecture 'sm_100f'

Traceback (most recent call last):
  File "/sgl-workspace/DeepGEMM/tests/test_bf16.py", line 123, in <module>
    test_gemm()
  File "/sgl-workspace/DeepGEMM/tests/test_bf16.py", line 30, in test_gemm
    getattr(deep_gemm, func_name)(a, b, d, c=c)
RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../../jit/compiler.hpp:176): false and "NVCC compilation failed"

After fix

➜  DeepGEMM git:(main) ✗ python3 /sgl-workspace/DeepGEMM/tests/test_bf16.py                                         
Library path:
 > ['/usr/local/lib/python3.12/dist-packages/deep_gemm']

Testing GEMM:
 > Perf (m=  128, n= 2112, k= 7168, layout=NT, BF16, acc=0):   18 us |  221 TFLOPS | 1860 GB/s | 0.79x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NT, FP32, acc=0):   18 us |  212 TFLOPS | 1815 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NT, FP32, acc=1):   18 us |  212 TFLOPS | 1871 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NN, BF16, acc=0):   18 us |  218 TFLOPS | 1833 GB/s | 0.78x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NN, FP32, acc=0):   19 us |  207 TFLOPS | 1774 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=NN, FP32, acc=1):   19 us |  206 TFLOPS | 1821 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TT, BF16, acc=0):   18 us |  220 TFLOPS | 1851 GB/s | 0.79x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TT, FP32, acc=0):   18 us |  212 TFLOPS | 1820 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TT, FP32, acc=1):   18 us |  210 TFLOPS | 1855 GB/s | 0.00x cuBLAS
 > Perf (m=  128, n= 2112, k= 7168, layout=TN, BF16, acc=0):   18 us |  218 TFLOPS | 1837 GB/s | 0.77x cuBLAS

@johnnynunez
Copy link
Copy Markdown

@LyricZhao Could you merge it?

@LyricZhao LyricZhao merged commit 3a93f4e into deepseek-ai:main Aug 27, 2025
LyricZhao added a commit that referenced this pull request Apr 16, 2026
* Add symmetric memory tests

* Add symm buffer class

* Rename files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants