豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

Fix JIT cache race condition with multi-process compilation#302

Open
Gregory-Pereira wants to merge 1 commit intodeepseek-ai:mainfrom
Gregory-Pereira:fix/jit-cache-race-condition
Open

Fix JIT cache race condition with multi-process compilation#302
Gregory-Pereira wants to merge 1 commit intodeepseek-ai:mainfrom
Gregory-Pereira:fix/jit-cache-race-condition

Conversation

@Gregory-Pereira
Copy link
Copy Markdown

addresses: #301

Signed-off-by: greg pereira <grpereir@redhat.com>
@Gregory-Pereira
Copy link
Copy Markdown
Author

Gregory-Pereira commented Apr 11, 2026

How I deployed this

Pretty confident this is working, recreated the users deployment. Going to share how I tested for posterity. My deployment:

apiVersion: v1
kind: Pod
metadata:
  generation: 1
  labels:
    app: test-deepgemm-jit-fix
    topology.kubernetes.io/region: US-EAST-04
    topology.kubernetes.io/zone: "377"
  name: test-deepgemm-jit-fix
  namespace: grpereir-dev
spec:
  containers:
    - command:
        - bash
        - -c
        - |
          set -e

          # Install cuobjdump (required by DeepGEMM JIT to extract kernel symbols)
          echo "=== Installing cuobjdump ==="
          dnf install -y cuda-cuobjdump-12-9 2>&1 || yum install -y cuda-cuobjdump-12-9 2>&1
          which cuobjdump && echo "cuobjdump installed" || { echo "FAIL: cuobjdump not found"; exit 1; }

          # Ensure cold DeepGEMM JIT cache
          export DG_JIT_CACHE_DIR=/tmp/deep_gemm_cache
          rm -rf "$DG_JIT_CACHE_DIR"

          echo "=== Starting vLLM with DP=8 + expert parallel ==="
          python3 -m vllm.entrypoints.openai.api_server \
            --model deepseek-ai/DeepSeek-V3.2 \
            --download-dir /weights \
            --data-parallel-size 8 \
            --enable-expert-parallel \
            --max-model-len 32768 \
            --port 8000 \
            --trust-remote-code 2>&1 &
          VLLM_PID=$!

          # Wait for server to be ready (model download + startup)
          echo "=== Waiting for vLLM to start (this may take a while for model download) ==="
          MAX_WAIT=7200
          ELAPSED=0
          while [ $ELAPSED -lt $MAX_WAIT ]; do
            if curl -s http://localhost:8000/health > /dev/null 2>&1; then
              echo "=== vLLM is ready after ${ELAPSED}s ==="
              break
            fi
            if ! kill -0 $VLLM_PID 2>/dev/null; then
              echo "=== FAIL: vLLM process died during startup ==="
              echo "=== Dumping JIT cache state ==="
              find "$DG_JIT_CACHE_DIR" -type f 2>/dev/null | head -50
              ls -la "$DG_JIT_CACHE_DIR"/locks/ 2>/dev/null | head -20
              wait $VLLM_PID
              exit 1
            fi
            sleep 10
            ELAPSED=$((ELAPSED + 10))
          done

          if [ $ELAPSED -ge $MAX_WAIT ]; then
            echo "=== FAIL: vLLM did not start within ${MAX_WAIT}s ==="
            kill $VLLM_PID 2>/dev/null
            exit 1
          fi

          # Send a test request
          echo "=== Sending test request ==="
          curl -s http://localhost:8000/v1/completions \
            -H "Content-Type: application/json" \
            -d '{
              "model": "deepseek-ai/DeepSeek-V3.2",
              "prompt": "Hello, how are you?",
              "max_tokens": 64
            }' | python3 -m json.tool

          echo "=== TEST PASSED ==="
          kill $VLLM_PID 2>/dev/null
          wait $VLLM_PID 2>/dev/null || true
      env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              key: HF_TOKEN
              name: hf-token
        - name: VLLM_RPC_TIMEOUT
          value: "300"
        - name: DG_JIT_DEBUG
          value: "1"
        - name: DG_JIT_PRINT_COMPILER_COMMAND
          value: "1"
      image: ghcr.io/llm-d/llm-d-cuda-dev:sha-4de2f73
      imagePullPolicy: IfNotPresent
      name: vllm
      resources:
        limits:
          cpu: "64"
          memory: 512Gi
          nvidia.com/gpu: "8"
        requests:
          cpu: "32"
          memory: 256Gi
          nvidia.com/gpu: "8"
      securityContext:
        runAsUser: 0
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /weights
          name: weights
        - mountPath: /dev/shm
          name: dshm
  enableServiceLinks: true
  nodeSelector:
    gpu.nvidia.com/model: H200
  serviceAccount: default
  serviceAccountName: default
  volumes:
    - name: weights
      persistentVolumeClaim:
        claimName: deepseek-v3-2-weights
    - emptyDir:
        medium: Memory
        sizeLimit: 64Gi
      name: dshm

Some relevant stuff to note about this deployment:

  1. Image: ghcr.io/llm-d/llm-d-cuda-dev:sha-4de2f73 (vLLM 0.1.dev1+g780ba3745, DeepGEMM 2.3.0+8b58b01) I built this from test DeepGEMM Serialize concurrent JIT compilation per kernel llm-d/llm-d#1132. Its based on the vLLM the issue reporter stated in [Bug]: Deepseek v3.2 RuntimeError: Worker failed with error "Assertion error" vllm-project/vllm#39057. For deepgemm I used: https://github.com/Gregory-Pereira/DeepGEMM/tree/test-fix/jit-cache-race-condition, WHICH IS BEHIND MAIN. This was done to match up with the version of deepgemm vLLM was installing so its 2 commits behind and 1 commit ahead.
  2. Startup script shenanigans: there are two relevant things. Firstly I was missing cuobjdump in my llm-d image. I added it to the next build but while I was waiting re-ran with installing it as a startup script:
          # Install cuobjdump (required by DeepGEMM JIT to extract kernel symbols)
          echo "=== Installing cuobjdump ==="
          dnf install -y cuda-cuobjdump-12-9 2>&1 || yum install -y cuda-cuobjdump-12-9 2>&1
          which cuobjdump && echo "cuobjdump installed" || { echo "FAIL: cuobjdump not found"; exit 1; }

Additionally the key test here is that were purging the cache so we make sure where JIT recompiling

          # Ensure cold DeepGEMM JIT cache
          export DG_JIT_CACHE_DIR=/tmp/deep_gemm_cache
          rm -rf "$DG_JIT_CACHE_DIR"

Relevant log follows for proof

Stats

# Total successful kernel loads (each "Loading CUBIN" = one worker loading one kernel)
$ grep -c "Loading CUBIN" test-deepgemm-jit-fix-debug3.log
383

# Unique kernels compiled
$ grep "Loading CUBIN" test-deepgemm-jit-fix-debug3.log | sort -u | wc -l
57

# JIT compilation errors (race conditions, cuobjdump failures, etc.)
$ grep -c "Assertion error\|compilation failed\|cuobjdump failed" test-deepgemm-jit-fix-debug3.log
0

Per-kernel load distribution

Every kernel is loaded exactly 8 times — once per DP worker — confirming the lock serializes compilation and all workers share the result:

$ grep "Loading CUBIN" test-deepgemm-jit-fix-debug3.log \
    | sed 's|.*/cache/||; s|/kernel.cubin||' \
    | sort | uniq -c | sort -rn
   8 kernel.transpose_fp32.c1a3f22add777280fda8196ff449b7d2
   8 kernel.transpose_fp32.b07a133da70913f23440f7e5253337d7
   8 kernel.transpose_fp32.a43b5b54b5352f122e6b7ae407a9af45
   8 kernel.smxx_paged_mqa_logits_metadata.d5c235650c093c5a615485090cf8f725
   8 kernel.smxx_fp8_paged_mqa_logits.fac5f5fad385e5fe806ab12e5aca33b5
   8 kernel.sm90_fp8_gemm_1d2d.f589a00989cc89fcad66f0fbbc7d7a40
   8 kernel.sm90_fp8_gemm_1d2d.ea19d2564b863c4eb0b0dff96d2c735b
   8 kernel.sm90_fp8_gemm_1d2d.dd95206d9baeb237a7594c2b9412bd79
   8 kernel.sm90_fp8_gemm_1d2d.dcbcc161389b25f03637dfc35c409e3c
   8 kernel.sm90_fp8_gemm_1d2d.d9234f95a3c07cf6c808a87d6e30f024
   8 kernel.sm90_fp8_gemm_1d2d.d566be34011a2ad0285c4d61ee4dcf5e
   8 kernel.sm90_fp8_gemm_1d2d.ce78d1e38d717b3e289b0df9f009d21d
   8 kernel.sm90_fp8_gemm_1d2d.b7c4db398b32042bfaaaa4401fba61c4
   8 kernel.sm90_fp8_gemm_1d2d.96e29548574cf55b0c4e27312ebb9f1a
   8 kernel.sm90_fp8_gemm_1d2d.8d8e8da86bedd9e56d2f844cab2d470b
   8 kernel.sm90_fp8_gemm_1d2d.732ac67c3e656fc37c0afb05545d2fc9
   8 kernel.sm90_fp8_gemm_1d2d.6def38d091e15c20db81413b142af7a5
   8 kernel.sm90_fp8_gemm_1d2d.60696ba8ad90bcfbb9615bfc4d3a3180
   8 kernel.sm90_fp8_gemm_1d2d.5c01a6837d43ce0187a7365c01982316
   8 kernel.sm90_fp8_gemm_1d2d.5bdd8703555276ddd2641e0d263c3e0b
   8 kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e
   8 kernel.sm90_fp8_gemm_1d2d.343998dba1f0bde4cef2f4f8422dd463
   8 kernel.sm90_fp8_gemm_1d2d.2b939be087cfb2ed49e26f352b9ef1ea
   8 kernel.sm90_fp8_gemm_1d2d.2b7b92f42eb2d389b8b6e7d1bf18440c
   8 kernel.sm90_fp8_gemm_1d2d.28fcae3d8d549cb228148e9b5d8c5885
   8 kernel.sm90_fp8_gemm_1d2d.211524f966efaea8addf41b4e10000f6
   8 kernel.sm90_fp8_gemm_1d2d.142600968af483ae36513d812fa0eefd
   8 kernel.sm90_fp8_gemm_1d2d.11257298673ca627647f4711af881aa2
   8 kernel.sm90_fp8_gemm_1d2d.0b6915cabf0c5ca7856729210465ea68
   8 kernel.sm90_fp8_gemm_1d2d.08696995c27cb4c44e04a0bc336f3fb5
   7 kernel.sm90_fp8_gemm_1d2d.fc7f99e4e1360325fb27aefe9f62e4fc
   7 kernel.sm90_fp8_gemm_1d2d.f3d64bf9b09c46e5822820355da67e7c
   7 kernel.sm90_fp8_gemm_1d2d.e5e9b9a79bd5ffc6f7a17b8f4e4f6f5e
   ...

First kernel load sequence — showing lock behavior

The first kernel (371756...) is loaded 8 times — one process compiled it while the other 7 waited on the file lock, then loaded the cached cubin:

$ grep "Loading CUBIN.*371756" test-deepgemm-jit-fix-debug3.log
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin
Loading CUBIN: /tmp/deep_gemm_cache/cache/kernel.sm90_fp8_gemm_1d2d.371756703f6a0c121d20ad6a4d46139e/kernel.cubin

Lock file state after JIT (from failure dump of earlier run)

$ ls -la /tmp/deep_gemm_cache/locks/
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.08696995c27cb4c44e04a0bc336f3fb5.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.0b6915cabf0c5ca7856729210465ea68.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.11257298673ca627647f4711af881aa2.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.142600968af483ae36513d812fa0eefd.lock
-rw-r--r-- 1 root root 0 Apr 11 15:23 kernel.sm90_fp8_gemm_1d2d.211524f966efaea8addf41b4e10000f6.lock
...

One lock file per unique kernel — different kernels compile in parallel, only the same kernel is serialized.

Finally pod successfully completes:

k get pods
NAME                    READY   STATUS      RESTARTS   AGE
test-deepgemm-jit-fix   0/1     Completed   0          35m

@Gregory-Pereira
Copy link
Copy Markdown
Author

These were all the full logs with debug logging from the following:

        - name: DG_JIT_DEBUG
          value: "1"
        - name: DG_JIT_PRINT_COMPILER_COMMAND
          value: "1"

NOTE: THESE ARE BIG, each log file is ~800k lines. To get around github file size upload limit of 25 MB I zipped them:

test-deepgemm-jit-fix-debug2.log.gz
test-deepgemm-jit-fix-debug3.log.gz

Log 2 actually crashed to an OOM, I had to set max model length to serve appropriately resulting in the success on run 3. Log 1 was omitted entirely because it crashed almost immediately - this was how I found out my image was missing cuobjdump

Comment thread csrc/jit/compiler.hpp
@@ -18,6 +19,29 @@

namespace deep_gemm {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception-unsafe fd in constructor

Line 28: fd_ = open(lock_path.c_str(), O_CREAT | O_RDWR, 0666);

If open() succeeds but DG_HOST_ASSERT on the next line throws, the constructor never finishes and the destructor never runs — fd leaks. Use try-catch in constructor body, close fd in catch before re-throwing; or use initializer-list pattern that throws from open() directly so no valid fd exists if later assertions fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants