豆豆友情提示:这是一个非官方 GitHub 代理镜像,主要用于网络测试或访问加速。请勿在此进行登录、注册或处理任何敏感信息。进行这些操作请务必访问官方网站 github.com。 Raw 内容也通过此代理提供。
Skip to content

fix(core): add hard 180s timeout to AI HTTP calls#2350

Merged
quanru merged 5 commits intomainfrom
fix/ai-call-default-timeout
Apr 16, 2026
Merged

fix(core): add hard 180s timeout to AI HTTP calls#2350
quanru merged 5 commits intomainfrom
fix/ai-call-default-timeout

Conversation

@quanru
Copy link
Copy Markdown
Collaborator

@quanru quanru commented Apr 15, 2026

Summary

aiAct was observed hanging on the final planning step for ~37 minutes in production (doubao-seed-2-0-lite) with no error, no retry, and no abort. Root cause turned out to be a latent bug in the openai Node SDK, not a Midscene bug — but Midscene has to defend against it because upgrading the SDK does not fix it.

Root cause — openai SDK (open source, Apache-2.0, https://github.com/openai/openai-node)

openai/client.mjsfetchWithTimeout (verified identical in 6.3.0, our current pinned version, and 6.34.0, the latest on npm at the time of this PR):

async fetchWithTimeout(url, init, ms, controller) {
  ...
  const timeout = setTimeout(abort, ms);           // start the request timer
  try {
    return await this.fetch.call(undefined, url, fetchOptions); // resolves at *headers*
  } finally {
    clearTimeout(timeout);                          // ← cancelled as soon as headers arrive
  }
}

The timeout option only covers the TTFB (response-header arrival) phase. The JSON body is read later via await response.json() with no timer attached. If a server returns 200 OK fast and then stalls while streaming the body — which is what doubao-seed did for 37 minutes — the SDK sits there forever and the caller has no signal of failure.

We filed this as an upstream concern; in the meantime Midscene has to apply its own cover.

Fix

packages/core/src/ai-model/service-caller/index.ts

  • New helper buildRequestAbortSignal(timeoutMs, userSignal?) that composes the caller's abortSignal with a hard-timeout AbortController, and forwards it as signal on both completion.create calls (streaming + non-streaming). The SDK forwards this signal into the underlying ReadableStream, so it actually covers the body-read phase the SDK timer misses.
  • Default hard timeout DEFAULT_AI_CALL_TIMEOUT_MS = 180_000. Overridable per intent via MIDSCENE_MODEL_TIMEOUT, MIDSCENE_INSIGHT_MODEL_TIMEOUT, MIDSCENE_PLANNING_MODEL_TIMEOUT, or modelConfig.timeout. Setting timeout: 0 disables the hard timeout (only abortSignal can cancel).
  • resolveEffectiveTimeoutMs() factored out so createChatClient (SDK-level timeout option) and callAI (our injected AbortSignal) share one source of truth.
  • Streaming branch now uses try/finally so the timer + abort listener are always cleaned up even if the SSE iterator throws mid-stream.
  • On non-streaming, the existing retryCount loop naturally converts a hard timeout into a retry — so a single stalled body read no longer terminates the whole task.

Observability

  • Exported AI_CALL_HARD_TIMEOUT_CODE = 'AI_CALL_HARD_TIMEOUT'. The AbortError raised by our timer carries this code, distinguishing it from caller aborts / network errors / 5xx retries.
  • Exported isHardTimeoutError(err) so monitors / tests can branch without string-matching the message.
  • On timeout trip, warnCall logs AI call hit hard timeout (${ms}ms, attempt X/Y, model Z, intent W) so production can count how often the safety net actually fires.

Behavior change (read before merging)

Prior default: OpenAI SDK's 10-minute header timeout (ineffective against body-read stalls).
New default: 180 s end-to-end hard timeout. Users running models with single-call latency > 180 s must raise MIDSCENE_MODEL_TIMEOUT or set it to 0 to opt out. Documented in apps/site/docs/{en,zh}/model-config.mdx.

Docs

apps/site/docs/en/model-config.mdx and apps/site/docs/zh/model-config.mdx — rewrote the MIDSCENE_MODEL_TIMEOUT row to state the new default, the 0 opt-out, and a brief explanation of why the hard timeout exists.

Test plan

  • npx vitest --run tests/unit-test/service-caller tests/unit-test/gpt-image-detail — 47/47 pass. New cases:
    • reproduces the original hang by mocking completion.create to never resolve and asserts the call rejects with the hard-timeout error
    • asserts the DEFAULT_AI_CALL_TIMEOUT_MS = 180_000 is applied to the OpenAI client when not configured
    • asserts a caller-supplied abortSignal still cancels the request before the timeout fires
    • retries after a hard timeout and returns the next successful response (verifies timeout participates in retry path)
    • asserts timeout: 0 disables the hard timeout (no SDK timeout, no auto-abort)
    • asserts the raised error is tagged with AI_CALL_HARD_TIMEOUT (isHardTimeoutError() returns true)
  • pnpm run lint
  • Published a prepatch (1.7.3-beta-20260415071518.0) for downstream smoke-testing.

…te hangs

Production logs showed the OpenAI SDK request-level timeout failing to fire
when the underlying socket stalls, leaving aiAct stuck in planning for ~37
minutes with no error or retry. Inject an AbortSignal-backed hard timeout
(default 180s, overridable via MIDSCENE_*_TIMEOUT) into both streaming and
non-streaming completion.create calls so the request is actually cancelled
and the existing retry path can take over.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 15, 2026

Deploying midscene with  Cloudflare Pages  Cloudflare Pages

Latest commit: 5ca3cee
Status: ✅  Deploy successful!
Preview URL: https://cdbdd810.midscene.pages.dev
Branch Preview URL: https://fix-ai-call-default-timeout.midscene.pages.dev

View logs

quanru added 3 commits April 15, 2026 14:05
…gument

The timeout fix now always passes a { signal } options object to
completion.create, so the existing tests that asserted the second arg was
undefined need to accept the injected AbortSignal.
- Wrap streaming branch in try/finally so the AbortSignal listener and
  hard-timeout timer are always cleaned up, even if the SSE iterator
  throws mid-stream.
- Extract resolveEffectiveTimeoutMs so both createChatClient (SDK
  timeout option) and callAI (injected AbortSignal) share a single
  source of truth instead of piggybacking the value on the client's
  return shape.
- Treat timeout <= 0 as an explicit opt-out: skip the SDK timeout
  option and do not start a timer, so only the caller-provided
  abortSignal can cancel the request. Lets users extend or disable
  the hard limit without recompiling.
- Expand JSDoc on DEFAULT_AI_CALL_TIMEOUT_MS to document the
  behavior change (SDK default 600s -> hard 180s) and the opt-out.
- Add tests for timeout-then-retry-succeeds and the timeout=0
  opt-out path.
- Tag the hard-timeout AbortError with code=AI_CALL_HARD_TIMEOUT and
  export isHardTimeoutError() so callers/metrics can branch on it
  instead of string-matching the message.
- Warn-log when the hard timeout fires inside callAI's retry loop,
  including attempt / model / intent so production can measure how
  often we actually save the day.
- Rewrite DEFAULT_AI_CALL_TIMEOUT_MS JSDoc to point at the real root
  cause: openai SDK's fetchWithTimeout clears its own timer in the
  finally block the moment response headers arrive, leaving the
  subsequent response.json() body read with no timeout. Our injected
  AbortSignal is what actually protects the body-read phase.
- Update en/zh model-config docs: default changed from SDK's 10-min
  default to 180s hard timeout; explain the 0-disables semantics and
  the reason for the extra layer.
@quanru quanru merged commit 2b9e0ef into main Apr 16, 2026
8 checks passed
@quanru quanru deleted the fix/ai-call-default-timeout branch April 16, 2026 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants