豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

Retry silently dropped: step stays pending indefinitely after step_retrying event (workflow-core 4.2.1 and 4.2.2) #1806

@admin-vt

Description

@admin-vt

Summary

A step that throws RetryableError emits a step_retrying event with a valid retryAfter timestamp, but the scheduled wake-up never fires. The step remains status: pending and the run stays running with no further events until manually cancelled. Intermittent — the same step retries successfully many times before a single retry is silently dropped. Reproduces on workflow-core 4.2.1 and 4.2.2, across two different projects with two different workflow definitions and step functions.

Possibly related to #1735 (same specVersion 3 / 4.2.x hang surface area), but the signature is distinct — no CBOR errors, most retries on the same run fire correctly, only an isolated retry is lost.

Environment

  • Framework: Nitro (Node.js)
  • Deployment: Vercel Functions (Fluid Compute)
  • SDK: workflow@4.2.1 and workflow@4.2.2

Three concrete examples

All three hung with the same signature: step_retrying emitted → no further events → run frozen for 11–20 hours → manually cancelled.

Run ID workflow-core specVer Step role Attempt Last event retryAfter Δ Silence
wrun_01KPC5WFQPHDB4Z08SKBK70E4P 4.2.1 3 external HTTP POST 2 2026-04-16T23:27:02.857Z +2.3s 19h 44m
wrun_01KPA0ABPZSZR250R2581YPM30 4.2.1 3 status poll 1 2026-04-16T10:48:24.870Z +20s 11h 23m
wrun_01KPC0HP2GR212CHF9K9JV47RF 4.2.2 2 external HTTP POST 5 2026-04-17T03:46:40.797Z +33s 16h 33m

Two separate projects, two different workflow definitions, three different deployments — same Vercel team (account info available on request).

Last event on the hung step (run 1, upstream error message redacted)

{
  "eventType": "step_retrying",
  "correlationId": "step_01KPC5WFTXCZRG2K9CD5XZDA05",
  "specVersion": 3,
  "eventData": {
    "error": { "message": "<redacted — HTTP 429 rate-limit from upstream>" },
    "retryAfter": "2026-04-16T23:27:05.183Z"
  },
  "createdAt": "2026-04-16T23:27:02.857Z",
  "runId": "wrun_01KPC5WFQPHDB4Z08SKBK70E4P"
}

No subsequent event of any kind on this run — no step_started, no step_completed, no step_failed, no timer or wakeup event. updatedAt froze at 23:27:02.840Z.

Healthy baseline (same run)

Run 1 had 238 other retries with identical event shape — all fired (median 925ms after retryAfter, p99 1.58s, max 7.88s). Run 3's hung step had 4 prior retries that fired cleanly before the 5th was dropped. Not a "scheduler not implemented" bug — an intermittent miss.

Scope

  • Not version-specific: reproduces on both 4.2.1 and 4.2.2
  • Not workflow-specific: two different projects, different workflow definitions, different step functions
  • Not attempt-specific: dropped on attempts 1, 2, and 5
  • Not retryAfter-delta-specific: 2s, 20s, 33s all observed
  • Intermittent: small fraction of retries are lost; most fire normally

More examples available — happy to share additional run IDs.

Workaround

Wrapping step calls in Promise.race([step(), sleep(watchdogMs)]) and treating the sleep winning as "invoke a fresh step". Relies on idempotent downstream writes. Not a pattern we want to keep permanently.

Observability feedback

No public event type exposes retry-timer scheduling or firing — no wait_scheduled, timer_set, timer_fired, etc. From the event log alone it's impossible to distinguish "retry enqueued but never fired" from "retry never enqueued". Surfacing these internals would make dropped retries self-evident instead of requiring wall-clock inference.

Expected behavior

When step_retrying is emitted with a retryAfter, the scheduled wake-up should fire at or near that time, or emit a visible failure event. Silent drops leave the run indistinguishable from "legitimately waiting" except by elapsed wall-clock time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions