[代理镜像] Fix JIT cache race condition with multi-process compilation by Gregory-Pereira · Pull Request #302 · deepseek-ai/DeepGEMM

Gregory-Pereira · 2026-04-11T04:55:24Z

addresses: #301

Signed-off-by: greg pereira <grpereir@redhat.com>

Gregory-Pereira · 2026-04-11T16:06:02Z

How I deployed this

Pretty confident this is working, recreated the users deployment. Going to share how I tested for posterity. My deployment:

apiVersion: v1
kind: Pod
metadata:
  generation: 1
  labels:
    app: test-deepgemm-jit-fix
    topology.kubernetes.io/region: US-EAST-04
    topology.kubernetes.io/zone: "377"
  name: test-deepgemm-jit-fix
  namespace: grpereir-dev
spec:
  containers:
    - command:
        - bash
        - -c
        - |
          set -e

          # Install cuobjdump (required by DeepGEMM JIT to extract kernel symbols)
          echo "=== Installing cuobjdump ==="
          dnf install -y cuda-cuobjdump-12-9 2>&1 || yum install -y cuda-cuobjdump-12-9 2>&1
          which cuobjdump && echo "cuobjdump installed" || { echo "FAIL: cuobjdump not found"; exit 1; }

          # Ensure cold DeepGEMM JIT cache
          export DG_JIT_CACHE_DIR=/tmp/deep_gemm_cache
          rm -rf "$DG_JIT_CACHE_DIR"

          echo "=== Starting vLLM with DP=8 + expert parallel ==="
          python3 -m vllm.entrypoints.openai.api_server \
            --model deepseek-ai/DeepSeek-V3.2 \
            --download-dir /weights \
            --data-parallel-size 8 \
            --enable-expert-parallel \
            --max-model-len 32768 \
            --port 8000 \
            --trust-remote-code 2>&1 &
          VLLM_PID=$!

          # Wait for server to be ready (model download + startup)
          echo "=== Waiting for vLLM to start (this may take a while for model download) ==="
          MAX_WAIT=7200
          ELAPSED=0
          while [ $ELAPSED -lt $MAX_WAIT ]; do
            if curl -s http://localhost:8000/health > /dev/null 2>&1; then
              echo "=== vLLM is ready after ${ELAPSED}s ==="
              break
            fi
            if ! kill -0 $VLLM_PID 2>/dev/null; then
              echo "=== FAIL: vLLM process died during startup ==="
              echo "=== Dumping JIT cache state ==="
              find "$DG_JIT_CACHE_DIR" -type f 2>/dev/null | head -50
              ls -la "$DG_JIT_CACHE_DIR"/locks/ 2>/dev/null | head -20
              wait $VLLM_PID
              exit 1
            fi
            sleep 10
            ELAPSED=$((ELAPSED + 10))
          done

          if [ $ELAPSED -ge $MAX_WAIT ]; then
            echo "=== FAIL: vLLM did not start within ${MAX_WAIT}s ==="
            kill $VLLM_PID 2>/dev/null
            exit 1
          fi

          # Send a test request
          echo "=== Sending test request ==="
          curl -s http://localhost:8000/v1/completions \
            -H "Content-Type: application/json" \
            -d '{
              "model": "deepseek-ai/DeepSeek-V3.2",
              "prompt": "Hello, how are you?",
              "max_tokens": 64
            }' | python3 -m json.tool

          echo "=== TEST PASSED ==="
          kill $VLLM_PID 2>/dev/null
          wait $VLLM_PID 2>/dev/null || true
      env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              key: HF_TOKEN
              name: hf-token
        - name: VLLM_RPC_TIMEOUT
          value: "300"
        - name: DG_JIT_DEBUG
          value: "1"
        - name: DG_JIT_PRINT_COMPILER_COMMAND
          value: "1"
      image: ghcr.io/llm-d/llm-d-cuda-dev:sha-4de2f73
      imagePullPolicy: IfNotPresent
      name: vllm
      resources:
        limits:
          cpu: "64"
          memory: 512Gi
          nvidia.com/gpu: "8"
        requests:
          cpu: "32"
          memory: 256Gi
          nvidia.com/gpu: "8"
      securityContext:
        runAsUser: 0
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /weights
          name: weights
        - mountPath: /dev/shm
          name: dshm
  enableServiceLinks: true
  nodeSelector:
    gpu.nvidia.com/model: H200
  serviceAccount: default
  serviceAccountName: default
  volumes:
    - name: weights
      persistentVolumeClaim:
        claimName: deepseek-v3-2-weights
    - emptyDir:
        medium: Memory
        sizeLimit: 64Gi
      name: dshm

Some relevant stuff to note about this deployment:

Image: ghcr.io/llm-d/llm-d-cuda-dev:sha-4de2f73 (vLLM 0.1.dev1+g780ba3745, DeepGEMM 2.3.0+8b58b01) I built this from test DeepGEMM Serialize concurrent JIT compilation per kernel llm-d/llm-d#1132. Its based on the vLLM the issue reporter stated in [Bug]: Deepseek v3.2 RuntimeError: Worker failed with error "Assertion error" vllm-project/vllm#39057. For deepgemm I used: https://github.com/Gregory-Pereira/DeepGEMM/tree/test-fix/jit-cache-race-condition, WHICH IS BEHIND MAIN. This was done to match up with the version of deepgemm vLLM was installing so its 2 commits behind and 1 commit ahead.
Startup script shenanigans: there are two relevant things. Firstly I was missing cuobjdump in my llm-d image. I added it to the next build but while I was waiting re-ran with installing it as a startup script:

          # Install cuobjdump (required by DeepGEMM JIT to extract kernel symbols)
          echo "=== Installing cuobjdump ==="
          dnf install -y cuda-cuobjdump-12-9 2>&1 || yum install -y cuda-cuobjdump-12-9 2>&1
          which cuobjdump && echo "cuobjdump installed" || { echo "FAIL: cuobjdump not found"; exit 1; }

Additionally the key test here is that were purging the cache so we make sure where JIT recompiling

          # Ensure cold DeepGEMM JIT cache
          export DG_JIT_CACHE_DIR=/tmp/deep_gemm_cache
          rm -rf "$DG_JIT_CACHE_DIR"

Relevant log follows for proof

Stats

# Total successful kernel loads (each "Loading CUBIN" = one worker loading one kernel)
$ grep -c "Loading CUBIN" test-deepgemm-jit-fix-debug3.log
383

# Unique kernels compiled
$ grep "Loading CUBIN" test-deepgemm-jit-fix-debug3.log | sort -u | wc -l
57

# JIT compilation errors (race conditions, cuobjdump failures, etc.)
$ grep -c "Assertion error\|compilation failed\|cuobjdump failed" test-deepgemm-jit-fix-debug3.log
0

Per-kernel load distribution

Every kernel is loaded exactly 8 times — once per DP worker — confirming the lock serializes compilation and all workers share the result:

$ grep "Loading CUBIN" test-deepgemm-jit-fix-debug3.log \
    | sed 's|.*/cache/||; s|/kernel.cubin||' \
    | sort | uniq -c | sort -rn

   8 kernel.transpose_fp32.c1a3f22add777280fda8196ff449b7d2
   8 kernel.transpose_fp32.b07a133da70913f23440f7e5253337d7
   8 kernel.transpose_fp32.a43b5b54b5352f122e6b7ae407a9af45
   8 kernel.smxx_paged_mqa_logits_metadata.d5c235650c093c5a615485090cf8f725
   8 kernel.smxx_fp8_paged_mqa_logits.fac5f5fad385e5fe806ab12e5aca33b5
   8 kernel.sm90_fp8_gemm_1d2d.f589a00989cc89fcad66f0fbbc7d7a40
   8 kernel.sm90_fp8_gemm_1d2d.ea19d2564b863c4eb0b0dff96d2c735b
   8 kernel.sm90_fp8_gemm_1d2d.dd95206d9baeb237a7594c2b9412bd79
   8 kernel.sm90_fp8_gemm_1d2d.dcbcc161389b25f03637dfc35c409e3c
   8 kernel.sm90_fp8_gemm_1d2d.d9234f95a3c07cf6c808a87d6e30f024
   8 kernel.sm90_fp8_gemm_1d2d.d566be34011a2ad0285c4d61ee4dcf5e
   8 kernel.sm90_fp8_gemm_1d2d.ce78d1e38d717b3e289b0df9f009d21d
   8 kernel.sm90_fp8_gemm_1d2d.b7c4db398b32042bfaaaa4401fba61c4
   8 kernel.sm90_fp8_gemm_1d2d.96e29548574cf55b0c4e27312ebb9f1a
   8 kernel.sm90_fp8_gemm_1d2d.8d8e8da86bedd9e56d2f844cab2d470b
   8 kernel.sm90_fp8_gemm_1d2d.732ac67c3e656fc37c0afb05545d2fc9
   8 kernel.sm90_fp8_gemm_1d2d.6def38d091e15c20db81413b142af7a5
   8 kernel.sm90_fp8_gemm_1d2d.60696ba8ad90bcfbb9615bfc4d3a3180
   8 kernel.sm90_fp8_gemm_1d2d.5c01a6837d43ce0187a7365c01982316
   8 kernel.sm90_fp8_gemm_1d2d.5bdd8703555276ddd2641e0d263c3e0b
   8 kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e
   8 kernel.sm90_fp8_gemm_1d2d.343998dba1f0bde4cef2f4f8422dd463
   8 kernel.sm90_fp8_gemm_1d2d.2b939be087cfb2ed49e26f352b9ef1ea
   8 kernel.sm90_fp8_gemm_1d2d.2b7b92f42eb2d389b8b6e7d1bf18440c
   8 kernel.sm90_fp8_gemm_1d2d.28fcae3d8d549cb228148e9b5d8c5885
   8 kernel.sm90_fp8_gemm_1d2d.211524f966efaea8addf41b4e10000f6
   8 kernel.sm90_fp8_gemm_1d2d.142600968af483ae36513d812fa0eefd
   8 kernel.sm90_fp8_gemm_1d2d.11257298673ca627647f4711af881aa2
   8 kernel.sm90_fp8_gemm_1d2d.0b6915cabf0c5ca7856729210465ea68
   8 kernel.sm90_fp8_gemm_1d2d.08696995c27cb4c44e04a0bc336f3fb5
   7 kernel.sm90_fp8_gemm_1d2d.fc7f99e4e1360325fb27aefe9f62e4fc
   7 kernel.sm90_fp8_gemm_1d2d.f3d64bf9b09c46e5822820355da67e7c
   7 kernel.sm90_fp8_gemm_1d2d.e5e9b9a79bd5ffc6f7a17b8f4e4f6f5e
   ...

First kernel load sequence — showing lock behavior

The first kernel (371756...) is loaded 8 times — one process compiled it while the other 7 waited on the file lock, then loaded the cached cubin:

$ grep "Loading CUBIN.*371756" test-deepgemm-jit-fix-debug3.log

Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin

Lock file state after JIT (from failure dump of earlier run)

$ ls -la /tmp/deep_gemm_cache/locks/

-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.08696995c27cb4c44e04a0bc336f3fb5.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.0b6915cabf0c5ca7856729210465ea68.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.11257298673ca627647f4711af881aa2.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.142600968af483ae36513d812fa0eefd.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.211524f966efaea8addf41b4e10000f6.lock
...

One lock file per unique kernel — different kernels compile in parallel, only the same kernel is serialized.

Finally pod successfully completes:

k get pods
NAME                    READY   STATUS      RESTARTS   AGE
test-deepgemm-jit-fix   0/1     Completed   0          35m

Gregory-Pereira · 2026-04-11T16:18:26Z

These were all the full logs with debug logging from the following:

        - name: DG_JIT_DEBUG
          value: "1"
        - name: DG_JIT_PRINT_COMPILER_COMMAND
          value: "1"

NOTE: THESE ARE BIG, each log file is ~800k lines. To get around github file size upload limit of 25 MB I zipped them:

test-deepgemm-jit-fix-debug2.log.gz
test-deepgemm-jit-fix-debug3.log.gz

Log 2 actually crashed to an OOM, I had to set max model length to serve appropriately resulting in the success on run 3. Log 1 was omitted entirely because it crashed almost immediately - this was how I found out my image was missing cuobjdump

afurm · 2026-04-19T06:15:46Z

@@ -18,6 +19,29 @@

 namespace deep_gemm {



Exception-unsafe fd in constructor

Line 28: fd_ = open(lock_path.c_str(), O_CREAT | O_RDWR, 0666);

If open() succeeds but DG_HOST_ASSERT on the next line throws, the constructor never finishes and the destructor never runs — fd leaks. Use try-catch in constructor body, close fd in catch before re-throwing; or use initializer-list pattern that throws from open() directly so no valid fd exists if later assertions fail.

Fix JIT cache race condition with multi-process compilation

288f146

Signed-off-by: greg pereira <grpereir@redhat.com>

Gregory-Pereira mentioned this pull request Apr 11, 2026

test DeepGEMM Serialize concurrent JIT compilation per kernel llm-d/llm-d#1132

Draft

This was referenced Apr 11, 2026

[Bug]: Deepseek v3.2 RuntimeError: Worker failed with error "Assertion error" vllm-project/vllm#39057

Open

[Bug] JIT cache lacks cross-process synchronization, can cause failures under multi-process parallelism #301

Open

afurm reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix JIT cache race condition with multi-process compilation#302

Fix JIT cache race condition with multi-process compilation#302
Gregory-Pereira wants to merge 1 commit intodeepseek-ai:mainfrom
Gregory-Pereira:fix/jit-cache-race-condition

Gregory-Pereira commented Apr 11, 2026

Uh oh!

Gregory-Pereira commented Apr 11, 2026 •

edited

Loading

Uh oh!

Gregory-Pereira commented Apr 11, 2026

Uh oh!

afurm Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Gregory-Pereira commented Apr 11, 2026

Uh oh!

Gregory-Pereira commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How I deployed this

Relevant log follows for proof

Stats

Per-kernel load distribution

First kernel load sequence — showing lock behavior

Lock file state after JIT (from failure dump of earlier run)

Uh oh!

Gregory-Pereira commented Apr 11, 2026

Uh oh!

afurm Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gregory-Pereira commented Apr 11, 2026 •

edited

Loading