[代理镜像] fix(core): add hard 180s timeout to AI HTTP calls by quanru · Pull Request #2350 · web-infra-dev/midscene

quanru · 2026-04-15T03:54:58Z

Summary

aiAct was observed hanging on the final planning step for ~37 minutes in production (doubao-seed-2-0-lite) with no error, no retry, and no abort. Root cause turned out to be a latent bug in the openai Node SDK, not a Midscene bug — but Midscene has to defend against it because upgrading the SDK does not fix it.

Root cause — openai SDK (open source, Apache-2.0, https://github.com/openai/openai-node)

openai/client.mjs → fetchWithTimeout (verified identical in 6.3.0, our current pinned version, and 6.34.0, the latest on npm at the time of this PR):

async fetchWithTimeout(url, init, ms, controller) {
  ...
  const timeout = setTimeout(abort, ms);           // start the request timer
  try {
    return await this.fetch.call(undefined, url, fetchOptions); // resolves at *headers*
  } finally {
    clearTimeout(timeout);                          // ← cancelled as soon as headers arrive
  }
}

The timeout option only covers the TTFB (response-header arrival) phase. The JSON body is read later via await response.json() with no timer attached. If a server returns 200 OK fast and then stalls while streaming the body — which is what doubao-seed did for 37 minutes — the SDK sits there forever and the caller has no signal of failure.

We filed this as an upstream concern; in the meantime Midscene has to apply its own cover.

Fix

packages/core/src/ai-model/service-caller/index.ts

New helper buildRequestAbortSignal(timeoutMs, userSignal?) that composes the caller's abortSignal with a hard-timeout AbortController, and forwards it as signal on both completion.create calls (streaming + non-streaming). The SDK forwards this signal into the underlying ReadableStream, so it actually covers the body-read phase the SDK timer misses.
Default hard timeout DEFAULT_AI_CALL_TIMEOUT_MS = 180_000. Overridable per intent via MIDSCENE_MODEL_TIMEOUT, MIDSCENE_INSIGHT_MODEL_TIMEOUT, MIDSCENE_PLANNING_MODEL_TIMEOUT, or modelConfig.timeout. Setting timeout: 0 disables the hard timeout (only abortSignal can cancel).
resolveEffectiveTimeoutMs() factored out so createChatClient (SDK-level timeout option) and callAI (our injected AbortSignal) share one source of truth.
Streaming branch now uses try/finally so the timer + abort listener are always cleaned up even if the SSE iterator throws mid-stream.
On non-streaming, the existing retryCount loop naturally converts a hard timeout into a retry — so a single stalled body read no longer terminates the whole task.

Observability

Exported AI_CALL_HARD_TIMEOUT_CODE = 'AI_CALL_HARD_TIMEOUT'. The AbortError raised by our timer carries this code, distinguishing it from caller aborts / network errors / 5xx retries.
Exported isHardTimeoutError(err) so monitors / tests can branch without string-matching the message.
On timeout trip, warnCall logs AI call hit hard timeout (${ms}ms, attempt X/Y, model Z, intent W) so production can count how often the safety net actually fires.

Behavior change (read before merging)

Prior default: OpenAI SDK's 10-minute header timeout (ineffective against body-read stalls).
New default: 180 s end-to-end hard timeout. Users running models with single-call latency > 180 s must raise MIDSCENE_MODEL_TIMEOUT or set it to 0 to opt out. Documented in apps/site/docs/{en,zh}/model-config.mdx.

Docs

apps/site/docs/en/model-config.mdx and apps/site/docs/zh/model-config.mdx — rewrote the MIDSCENE_MODEL_TIMEOUT row to state the new default, the 0 opt-out, and a brief explanation of why the hard timeout exists.

Test plan

npx vitest --run tests/unit-test/service-caller tests/unit-test/gpt-image-detail — 47/47 pass. New cases:
- reproduces the original hang by mocking completion.create to never resolve and asserts the call rejects with the hard-timeout error
- asserts the DEFAULT_AI_CALL_TIMEOUT_MS = 180_000 is applied to the OpenAI client when not configured
- asserts a caller-supplied abortSignal still cancels the request before the timeout fires
- retries after a hard timeout and returns the next successful response (verifies timeout participates in retry path)
- asserts timeout: 0 disables the hard timeout (no SDK timeout, no auto-abort)
- asserts the raised error is tagged with AI_CALL_HARD_TIMEOUT (isHardTimeoutError() returns true)
pnpm run lint
Published a prepatch (1.7.3-beta-20260415071518.0) for downstream smoke-testing.

…te hangs Production logs showed the OpenAI SDK request-level timeout failing to fire when the underlying socket stalls, leaving aiAct stuck in planning for ~37 minutes with no error or retry. Inject an AbortSignal-backed hard timeout (default 180s, overridable via MIDSCENE_*_TIMEOUT) into both streaming and non-streaming completion.create calls so the request is actually cancelled and the existing retry path can take over.

cloudflare-workers-and-pages · 2026-04-15T03:57:39Z

Deploying midscene with Cloudflare Pages

Latest commit:	`5ca3cee`
Status:	✅ Deploy successful!
Preview URL:	https://cdbdd810.midscene.pages.dev
Branch Preview URL:	https://fix-ai-call-default-timeout.midscene.pages.dev

View logs

…gument The timeout fix now always passes a { signal } options object to completion.create, so the existing tests that asserted the second arg was undefined need to accept the injected AbortSignal.

- Wrap streaming branch in try/finally so the AbortSignal listener and hard-timeout timer are always cleaned up, even if the SSE iterator throws mid-stream. - Extract resolveEffectiveTimeoutMs so both createChatClient (SDK timeout option) and callAI (injected AbortSignal) share a single source of truth instead of piggybacking the value on the client's return shape. - Treat timeout <= 0 as an explicit opt-out: skip the SDK timeout option and do not start a timer, so only the caller-provided abortSignal can cancel the request. Lets users extend or disable the hard limit without recompiling. - Expand JSDoc on DEFAULT_AI_CALL_TIMEOUT_MS to document the behavior change (SDK default 600s -> hard 180s) and the opt-out. - Add tests for timeout-then-retry-succeeds and the timeout=0 opt-out path.

- Tag the hard-timeout AbortError with code=AI_CALL_HARD_TIMEOUT and export isHardTimeoutError() so callers/metrics can branch on it instead of string-matching the message. - Warn-log when the hard timeout fires inside callAI's retry loop, including attempt / model / intent so production can measure how often we actually save the day. - Rewrite DEFAULT_AI_CALL_TIMEOUT_MS JSDoc to point at the real root cause: openai SDK's fetchWithTimeout clears its own timer in the finally block the moment response headers arrive, leaving the subsequent response.json() body read with no timeout. Our injected AbortSignal is what actually protects the body-read phase. - Update en/zh model-config docs: default changed from SDK's 10-min default to 180s hard timeout; explain the 0-disables semantics and the reason for the extra layer.

github-actions bot added the change: fix label Apr 15, 2026

quanru added 3 commits April 15, 2026 14:05

test(core): update gpt-image-detail assertions for new AbortSignal ar…

735459c

…gument The timeout fix now always passes a { signal } options object to completion.create, so the existing tests that asserted the second arg was undefined need to accept the injected AbortSignal.

quanru mentioned this pull request Apr 15, 2026

SDK timeout option does not cover response-body read phase (can hang forever) openai/openai-node#1825

Open

refactor(core): extract AI request timeout helper

9fd915d

yuyutaotao approved these changes Apr 16, 2026

View reviewed changes

quanru merged commit 2b9e0ef into main Apr 16, 2026
8 checks passed

quanru deleted the fix/ai-call-default-timeout branch April 16, 2026 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): add hard 180s timeout to AI HTTP calls#2350

fix(core): add hard 180s timeout to AI HTTP calls#2350
quanru merged 5 commits intomainfrom
fix/ai-call-default-timeout

quanru commented Apr 15, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

quanru commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause — openai SDK (open source, Apache-2.0, https://github.com/openai/openai-node)

Fix

Observability

Behavior change (read before merging)

Docs

Test plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying midscene with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

quanru commented Apr 15, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Apr 15, 2026 •

edited

Loading