Retry silently dropped: step stays pending indefinitely after step_retrying event (workflow-core 4.2.1 and 4.2.2)

## Summary

A step that throws `RetryableError` emits a `step_retrying` event with a valid `retryAfter` timestamp, but the scheduled wake-up never fires. The step remains `status: pending` and the run stays `running` with no further events until manually cancelled. Intermittent — the same step retries successfully many times before a single retry is silently dropped. Reproduces on workflow-core **4.2.1 and 4.2.2**, across two different projects with two different workflow definitions and step functions.

Possibly related to #1735 (same specVersion 3 / 4.2.x hang surface area), but the signature is distinct — no CBOR errors, most retries on the same run fire correctly, only an isolated retry is lost.

## Environment

- Framework: Nitro (Node.js)
- Deployment: Vercel Functions (Fluid Compute)
- SDK: `workflow@4.2.1` and `workflow@4.2.2`

## Three concrete examples

All three hung with the same signature: `step_retrying` emitted → no further events → run frozen for 11–20 hours → manually cancelled.

| Run ID | workflow-core | specVer | Step role | Attempt | Last event | retryAfter Δ | Silence |
|---|---|---|---|---|---|---|---|
| `wrun_01KPC5WFQPHDB4Z08SKBK70E4P` | 4.2.1 | 3 | external HTTP POST | 2 | 2026-04-16T23:27:02.857Z | +2.3s | 19h 44m |
| `wrun_01KPA0ABPZSZR250R2581YPM30` | 4.2.1 | 3 | status poll | 1 | 2026-04-16T10:48:24.870Z | +20s | 11h 23m |
| `wrun_01KPC0HP2GR212CHF9K9JV47RF` | 4.2.2 | 2 | external HTTP POST | 5 | 2026-04-17T03:46:40.797Z | +33s | 16h 33m |

Two separate projects, two different workflow definitions, three different deployments — same Vercel team (account info available on request).

## Last event on the hung step (run 1, upstream error message redacted)

```json
{
  "eventType": "step_retrying",
  "correlationId": "step_01KPC5WFTXCZRG2K9CD5XZDA05",
  "specVersion": 3,
  "eventData": {
    "error": { "message": "<redacted — HTTP 429 rate-limit from upstream>" },
    "retryAfter": "2026-04-16T23:27:05.183Z"
  },
  "createdAt": "2026-04-16T23:27:02.857Z",
  "runId": "wrun_01KPC5WFQPHDB4Z08SKBK70E4P"
}
```

No subsequent event of any kind on this run — no `step_started`, no `step_completed`, no `step_failed`, no timer or wakeup event. `updatedAt` froze at `23:27:02.840Z`.

## Healthy baseline (same run)

Run 1 had **238 other retries with identical event shape** — all fired (median 925ms after `retryAfter`, p99 1.58s, max 7.88s). Run 3's hung step had 4 prior retries that fired cleanly before the 5th was dropped. Not a "scheduler not implemented" bug — an intermittent miss.

## Scope

- **Not version-specific:** reproduces on both 4.2.1 and 4.2.2
- **Not workflow-specific:** two different projects, different workflow definitions, different step functions
- **Not attempt-specific:** dropped on attempts 1, 2, and 5
- **Not retryAfter-delta-specific:** 2s, 20s, 33s all observed
- **Intermittent:** small fraction of retries are lost; most fire normally

More examples available — happy to share additional run IDs.

## Workaround

Wrapping step calls in `Promise.race([step(), sleep(watchdogMs)])` and treating the sleep winning as "invoke a fresh step". Relies on idempotent downstream writes. Not a pattern we want to keep permanently.

## Observability feedback

No public event type exposes retry-timer scheduling or firing — no `wait_scheduled`, `timer_set`, `timer_fired`, etc. From the event log alone it's impossible to distinguish "retry enqueued but never fired" from "retry never enqueued". Surfacing these internals would make dropped retries self-evident instead of requiring wall-clock inference.

## Expected behavior

When `step_retrying` is emitted with a `retryAfter`, the scheduled wake-up should fire at or near that time, or emit a visible failure event. Silent drops leave the run indistinguishable from "legitimately waiting" except by elapsed wall-clock time.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry silently dropped: step stays pending indefinitely after step_retrying event (workflow-core 4.2.1 and 4.2.2) #1806

Summary

Environment

Three concrete examples

Last event on the hung step (run 1, upstream error message redacted)

Healthy baseline (same run)

Scope

Workaround

Observability feedback

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Run ID	workflow-core	specVer	Step role	Attempt	Last event	retryAfter Δ	Silence
`wrun_01KPC5WFQPHDB4Z08SKBK70E4P`	4.2.1	3	external HTTP POST	2	2026-04-16T23:27:02.857Z	+2.3s	19h 44m
`wrun_01KPA0ABPZSZR250R2581YPM30`	4.2.1	3	status poll	1	2026-04-16T10:48:24.870Z	+20s	11h 23m
`wrun_01KPC0HP2GR212CHF9K9JV47RF`	4.2.2	2	external HTTP POST	5	2026-04-17T03:46:40.797Z	+33s	16h 33m

Retry silently dropped: step stays pending indefinitely after step_retrying event (workflow-core 4.2.1 and 4.2.2) #1806

Description

Summary

Environment

Three concrete examples

Last event on the hung step (run 1, upstream error message redacted)

Healthy baseline (same run)

Scope

Workaround

Observability feedback

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions