Summary
A step that throws RetryableError emits a step_retrying event with a valid retryAfter timestamp, but the scheduled wake-up never fires. The step remains status: pending and the run stays running with no further events until manually cancelled. Intermittent — the same step retries successfully many times before a single retry is silently dropped. Reproduces on workflow-core 4.2.1 and 4.2.2, across two different projects with two different workflow definitions and step functions.
Possibly related to #1735 (same specVersion 3 / 4.2.x hang surface area), but the signature is distinct — no CBOR errors, most retries on the same run fire correctly, only an isolated retry is lost.
Environment
- Framework: Nitro (Node.js)
- Deployment: Vercel Functions (Fluid Compute)
- SDK:
workflow@4.2.1 and workflow@4.2.2
Three concrete examples
All three hung with the same signature: step_retrying emitted → no further events → run frozen for 11–20 hours → manually cancelled.
| Run ID |
workflow-core |
specVer |
Step role |
Attempt |
Last event |
retryAfter Δ |
Silence |
wrun_01KPC5WFQPHDB4Z08SKBK70E4P |
4.2.1 |
3 |
external HTTP POST |
2 |
2026-04-16T23:27:02.857Z |
+2.3s |
19h 44m |
wrun_01KPA0ABPZSZR250R2581YPM30 |
4.2.1 |
3 |
status poll |
1 |
2026-04-16T10:48:24.870Z |
+20s |
11h 23m |
wrun_01KPC0HP2GR212CHF9K9JV47RF |
4.2.2 |
2 |
external HTTP POST |
5 |
2026-04-17T03:46:40.797Z |
+33s |
16h 33m |
Two separate projects, two different workflow definitions, three different deployments — same Vercel team (account info available on request).
Last event on the hung step (run 1, upstream error message redacted)
{
"eventType": "step_retrying",
"correlationId": "step_01KPC5WFTXCZRG2K9CD5XZDA05",
"specVersion": 3,
"eventData": {
"error": { "message": "<redacted — HTTP 429 rate-limit from upstream>" },
"retryAfter": "2026-04-16T23:27:05.183Z"
},
"createdAt": "2026-04-16T23:27:02.857Z",
"runId": "wrun_01KPC5WFQPHDB4Z08SKBK70E4P"
}
No subsequent event of any kind on this run — no step_started, no step_completed, no step_failed, no timer or wakeup event. updatedAt froze at 23:27:02.840Z.
Healthy baseline (same run)
Run 1 had 238 other retries with identical event shape — all fired (median 925ms after retryAfter, p99 1.58s, max 7.88s). Run 3's hung step had 4 prior retries that fired cleanly before the 5th was dropped. Not a "scheduler not implemented" bug — an intermittent miss.
Scope
- Not version-specific: reproduces on both 4.2.1 and 4.2.2
- Not workflow-specific: two different projects, different workflow definitions, different step functions
- Not attempt-specific: dropped on attempts 1, 2, and 5
- Not retryAfter-delta-specific: 2s, 20s, 33s all observed
- Intermittent: small fraction of retries are lost; most fire normally
More examples available — happy to share additional run IDs.
Workaround
Wrapping step calls in Promise.race([step(), sleep(watchdogMs)]) and treating the sleep winning as "invoke a fresh step". Relies on idempotent downstream writes. Not a pattern we want to keep permanently.
Observability feedback
No public event type exposes retry-timer scheduling or firing — no wait_scheduled, timer_set, timer_fired, etc. From the event log alone it's impossible to distinguish "retry enqueued but never fired" from "retry never enqueued". Surfacing these internals would make dropped retries self-evident instead of requiring wall-clock inference.
Expected behavior
When step_retrying is emitted with a retryAfter, the scheduled wake-up should fire at or near that time, or emit a visible failure event. Silent drops leave the run indistinguishable from "legitimately waiting" except by elapsed wall-clock time.
Summary
A step that throws
RetryableErroremits astep_retryingevent with a validretryAftertimestamp, but the scheduled wake-up never fires. The step remainsstatus: pendingand the run staysrunningwith no further events until manually cancelled. Intermittent — the same step retries successfully many times before a single retry is silently dropped. Reproduces on workflow-core 4.2.1 and 4.2.2, across two different projects with two different workflow definitions and step functions.Possibly related to #1735 (same specVersion 3 / 4.2.x hang surface area), but the signature is distinct — no CBOR errors, most retries on the same run fire correctly, only an isolated retry is lost.
Environment
workflow@4.2.1andworkflow@4.2.2Three concrete examples
All three hung with the same signature:
step_retryingemitted → no further events → run frozen for 11–20 hours → manually cancelled.wrun_01KPC5WFQPHDB4Z08SKBK70E4Pwrun_01KPA0ABPZSZR250R2581YPM30wrun_01KPC0HP2GR212CHF9K9JV47RFTwo separate projects, two different workflow definitions, three different deployments — same Vercel team (account info available on request).
Last event on the hung step (run 1, upstream error message redacted)
{ "eventType": "step_retrying", "correlationId": "step_01KPC5WFTXCZRG2K9CD5XZDA05", "specVersion": 3, "eventData": { "error": { "message": "<redacted — HTTP 429 rate-limit from upstream>" }, "retryAfter": "2026-04-16T23:27:05.183Z" }, "createdAt": "2026-04-16T23:27:02.857Z", "runId": "wrun_01KPC5WFQPHDB4Z08SKBK70E4P" }No subsequent event of any kind on this run — no
step_started, nostep_completed, nostep_failed, no timer or wakeup event.updatedAtfroze at23:27:02.840Z.Healthy baseline (same run)
Run 1 had 238 other retries with identical event shape — all fired (median 925ms after
retryAfter, p99 1.58s, max 7.88s). Run 3's hung step had 4 prior retries that fired cleanly before the 5th was dropped. Not a "scheduler not implemented" bug — an intermittent miss.Scope
More examples available — happy to share additional run IDs.
Workaround
Wrapping step calls in
Promise.race([step(), sleep(watchdogMs)])and treating the sleep winning as "invoke a fresh step". Relies on idempotent downstream writes. Not a pattern we want to keep permanently.Observability feedback
No public event type exposes retry-timer scheduling or firing — no
wait_scheduled,timer_set,timer_fired, etc. From the event log alone it's impossible to distinguish "retry enqueued but never fired" from "retry never enqueued". Surfacing these internals would make dropped retries self-evident instead of requiring wall-clock inference.Expected behavior
When
step_retryingis emitted with aretryAfter, the scheduled wake-up should fire at or near that time, or emit a visible failure event. Silent drops leave the run indistinguishable from "legitimately waiting" except by elapsed wall-clock time.