Fix JIT cache race condition with multi-process compilation#302
Fix JIT cache race condition with multi-process compilation#302Gregory-Pereira wants to merge 1 commit intodeepseek-ai:mainfrom
Conversation
Signed-off-by: greg pereira <grpereir@redhat.com>
How I deployed thisPretty confident this is working, recreated the users deployment. Going to share how I tested for posterity. My deployment: apiVersion: v1
kind: Pod
metadata:
generation: 1
labels:
app: test-deepgemm-jit-fix
topology.kubernetes.io/region: US-EAST-04
topology.kubernetes.io/zone: "377"
name: test-deepgemm-jit-fix
namespace: grpereir-dev
spec:
containers:
- command:
- bash
- -c
- |
set -e
# Install cuobjdump (required by DeepGEMM JIT to extract kernel symbols)
echo "=== Installing cuobjdump ==="
dnf install -y cuda-cuobjdump-12-9 2>&1 || yum install -y cuda-cuobjdump-12-9 2>&1
which cuobjdump && echo "cuobjdump installed" || { echo "FAIL: cuobjdump not found"; exit 1; }
# Ensure cold DeepGEMM JIT cache
export DG_JIT_CACHE_DIR=/tmp/deep_gemm_cache
rm -rf "$DG_JIT_CACHE_DIR"
echo "=== Starting vLLM with DP=8 + expert parallel ==="
python3 -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-V3.2 \
--download-dir /weights \
--data-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 32768 \
--port 8000 \
--trust-remote-code 2>&1 &
VLLM_PID=$!
# Wait for server to be ready (model download + startup)
echo "=== Waiting for vLLM to start (this may take a while for model download) ==="
MAX_WAIT=7200
ELAPSED=0
while [ $ELAPSED -lt $MAX_WAIT ]; do
if curl -s http://localhost:8000/health > /dev/null 2>&1; then
echo "=== vLLM is ready after ${ELAPSED}s ==="
break
fi
if ! kill -0 $VLLM_PID 2>/dev/null; then
echo "=== FAIL: vLLM process died during startup ==="
echo "=== Dumping JIT cache state ==="
find "$DG_JIT_CACHE_DIR" -type f 2>/dev/null | head -50
ls -la "$DG_JIT_CACHE_DIR"/locks/ 2>/dev/null | head -20
wait $VLLM_PID
exit 1
fi
sleep 10
ELAPSED=$((ELAPSED + 10))
done
if [ $ELAPSED -ge $MAX_WAIT ]; then
echo "=== FAIL: vLLM did not start within ${MAX_WAIT}s ==="
kill $VLLM_PID 2>/dev/null
exit 1
fi
# Send a test request
echo "=== Sending test request ==="
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.2",
"prompt": "Hello, how are you?",
"max_tokens": 64
}' | python3 -m json.tool
echo "=== TEST PASSED ==="
kill $VLLM_PID 2>/dev/null
wait $VLLM_PID 2>/dev/null || true
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
key: HF_TOKEN
name: hf-token
- name: VLLM_RPC_TIMEOUT
value: "300"
- name: DG_JIT_DEBUG
value: "1"
- name: DG_JIT_PRINT_COMPILER_COMMAND
value: "1"
image: ghcr.io/llm-d/llm-d-cuda-dev:sha-4de2f73
imagePullPolicy: IfNotPresent
name: vllm
resources:
limits:
cpu: "64"
memory: 512Gi
nvidia.com/gpu: "8"
requests:
cpu: "32"
memory: 256Gi
nvidia.com/gpu: "8"
securityContext:
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /weights
name: weights
- mountPath: /dev/shm
name: dshm
enableServiceLinks: true
nodeSelector:
gpu.nvidia.com/model: H200
serviceAccount: default
serviceAccountName: default
volumes:
- name: weights
persistentVolumeClaim:
claimName: deepseek-v3-2-weights
- emptyDir:
medium: Memory
sizeLimit: 64Gi
name: dshmSome relevant stuff to note about this deployment:
# Install cuobjdump (required by DeepGEMM JIT to extract kernel symbols)
echo "=== Installing cuobjdump ==="
dnf install -y cuda-cuobjdump-12-9 2>&1 || yum install -y cuda-cuobjdump-12-9 2>&1
which cuobjdump && echo "cuobjdump installed" || { echo "FAIL: cuobjdump not found"; exit 1; }Additionally the key test here is that were purging the cache so we make sure where JIT recompiling # Ensure cold DeepGEMM JIT cache
export DG_JIT_CACHE_DIR=/tmp/deep_gemm_cache
rm -rf "$DG_JIT_CACHE_DIR"Relevant log follows for proofStats# Total successful kernel loads (each "Loading CUBIN" = one worker loading one kernel)
$ grep -c "Loading CUBIN" test-deepgemm-jit-fix-debug3.log
383
# Unique kernels compiled
$ grep "Loading CUBIN" test-deepgemm-jit-fix-debug3.log | sort -u | wc -l
57
# JIT compilation errors (race conditions, cuobjdump failures, etc.)
$ grep -c "Assertion error\|compilation failed\|cuobjdump failed" test-deepgemm-jit-fix-debug3.log
0Per-kernel load distributionEvery kernel is loaded exactly 8 times — once per DP worker — confirming the lock serializes compilation and all workers share the result: $ grep "Loading CUBIN" test-deepgemm-jit-fix-debug3.log \
| sed 's|.*/cache/||; s|/kernel.cubin||' \
| sort | uniq -c | sort -rnFirst kernel load sequence — showing lock behaviorThe first kernel ( $ grep "Loading CUBIN.*371756" test-deepgemm-jit-fix-debug3.logLock file state after JIT (from failure dump of earlier run)$ ls -la /tmp/deep_gemm_cache/locks/One lock file per unique kernel — different kernels compile in parallel, only the same kernel is serialized. Finally pod successfully completes: k get pods
NAME READY STATUS RESTARTS AGE
test-deepgemm-jit-fix 0/1 Completed 0 35m |
|
These were all the full logs with debug logging from the following: NOTE: THESE ARE BIG, each log file is ~800k lines. To get around github file size upload limit of 25 MB I zipped them: test-deepgemm-jit-fix-debug2.log.gz Log 2 actually crashed to an OOM, I had to set max model length to serve appropriately resulting in the success on run 3. Log 1 was omitted entirely because it crashed almost immediately - this was how I found out my image was missing |
| @@ -18,6 +19,29 @@ | |||
|
|
|||
| namespace deep_gemm { | |||
|
|
|||
There was a problem hiding this comment.
Exception-unsafe fd in constructor
Line 28: fd_ = open(lock_path.c_str(), O_CREAT | O_RDWR, 0666);
If open() succeeds but DG_HOST_ASSERT on the next line throws, the constructor never finishes and the destructor never runs — fd leaks. Use try-catch in constructor body, close fd in catch before re-throwing; or use initializer-list pattern that throws from open() directly so no valid fd exists if later assertions fail.
addresses: #301