fix(core): add hard 180s timeout to AI HTTP calls#2350
Merged
Conversation
…te hangs Production logs showed the OpenAI SDK request-level timeout failing to fire when the underlying socket stalls, leaving aiAct stuck in planning for ~37 minutes with no error or retry. Inject an AbortSignal-backed hard timeout (default 180s, overridable via MIDSCENE_*_TIMEOUT) into both streaming and non-streaming completion.create calls so the request is actually cancelled and the existing retry path can take over.
Deploying midscene with
|
| Latest commit: |
5ca3cee
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://cdbdd810.midscene.pages.dev |
| Branch Preview URL: | https://fix-ai-call-default-timeout.midscene.pages.dev |
…gument
The timeout fix now always passes a { signal } options object to
completion.create, so the existing tests that asserted the second arg was
undefined need to accept the injected AbortSignal.
- Wrap streaming branch in try/finally so the AbortSignal listener and hard-timeout timer are always cleaned up, even if the SSE iterator throws mid-stream. - Extract resolveEffectiveTimeoutMs so both createChatClient (SDK timeout option) and callAI (injected AbortSignal) share a single source of truth instead of piggybacking the value on the client's return shape. - Treat timeout <= 0 as an explicit opt-out: skip the SDK timeout option and do not start a timer, so only the caller-provided abortSignal can cancel the request. Lets users extend or disable the hard limit without recompiling. - Expand JSDoc on DEFAULT_AI_CALL_TIMEOUT_MS to document the behavior change (SDK default 600s -> hard 180s) and the opt-out. - Add tests for timeout-then-retry-succeeds and the timeout=0 opt-out path.
- Tag the hard-timeout AbortError with code=AI_CALL_HARD_TIMEOUT and export isHardTimeoutError() so callers/metrics can branch on it instead of string-matching the message. - Warn-log when the hard timeout fires inside callAI's retry loop, including attempt / model / intent so production can measure how often we actually save the day. - Rewrite DEFAULT_AI_CALL_TIMEOUT_MS JSDoc to point at the real root cause: openai SDK's fetchWithTimeout clears its own timer in the finally block the moment response headers arrive, leaving the subsequent response.json() body read with no timeout. Our injected AbortSignal is what actually protects the body-read phase. - Update en/zh model-config docs: default changed from SDK's 10-min default to 180s hard timeout; explain the 0-disables semantics and the reason for the extra layer.
yuyutaotao
approved these changes
Apr 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
aiActwas observed hanging on the final planning step for ~37 minutes in production (doubao-seed-2-0-lite) with no error, no retry, and no abort. Root cause turned out to be a latent bug in theopenaiNode SDK, not a Midscene bug — but Midscene has to defend against it because upgrading the SDK does not fix it.Root cause — openai SDK (open source, Apache-2.0, https://github.com/openai/openai-node)
openai/client.mjs→fetchWithTimeout(verified identical in 6.3.0, our current pinned version, and 6.34.0, the latest on npm at the time of this PR):The
timeoutoption only covers the TTFB (response-header arrival) phase. The JSON body is read later viaawait response.json()with no timer attached. If a server returns200 OKfast and then stalls while streaming the body — which is what doubao-seed did for 37 minutes — the SDK sits there forever and the caller has no signal of failure.We filed this as an upstream concern; in the meantime Midscene has to apply its own cover.
Fix
packages/core/src/ai-model/service-caller/index.tsbuildRequestAbortSignal(timeoutMs, userSignal?)that composes the caller'sabortSignalwith a hard-timeoutAbortController, and forwards it assignalon bothcompletion.createcalls (streaming + non-streaming). The SDK forwards this signal into the underlyingReadableStream, so it actually covers the body-read phase the SDK timer misses.DEFAULT_AI_CALL_TIMEOUT_MS = 180_000. Overridable per intent viaMIDSCENE_MODEL_TIMEOUT,MIDSCENE_INSIGHT_MODEL_TIMEOUT,MIDSCENE_PLANNING_MODEL_TIMEOUT, ormodelConfig.timeout. Settingtimeout: 0disables the hard timeout (onlyabortSignalcan cancel).resolveEffectiveTimeoutMs()factored out socreateChatClient(SDK-leveltimeoutoption) andcallAI(our injectedAbortSignal) share one source of truth.try/finallyso the timer + abort listener are always cleaned up even if the SSE iterator throws mid-stream.retryCountloop naturally converts a hard timeout into a retry — so a single stalled body read no longer terminates the whole task.Observability
AI_CALL_HARD_TIMEOUT_CODE = 'AI_CALL_HARD_TIMEOUT'. The AbortError raised by our timer carries thiscode, distinguishing it from caller aborts / network errors / 5xx retries.isHardTimeoutError(err)so monitors / tests can branch without string-matching the message.warnCalllogsAI call hit hard timeout (${ms}ms, attempt X/Y, model Z, intent W)so production can count how often the safety net actually fires.Behavior change (read before merging)
Prior default: OpenAI SDK's 10-minute header timeout (ineffective against body-read stalls).
New default: 180 s end-to-end hard timeout. Users running models with single-call latency > 180 s must raise
MIDSCENE_MODEL_TIMEOUTor set it to0to opt out. Documented inapps/site/docs/{en,zh}/model-config.mdx.Docs
apps/site/docs/en/model-config.mdxandapps/site/docs/zh/model-config.mdx— rewrote theMIDSCENE_MODEL_TIMEOUTrow to state the new default, the0opt-out, and a brief explanation of why the hard timeout exists.Test plan
npx vitest --run tests/unit-test/service-caller tests/unit-test/gpt-image-detail— 47/47 pass. New cases:completion.createto never resolve and asserts the call rejects with the hard-timeout errorDEFAULT_AI_CALL_TIMEOUT_MS = 180_000is applied to the OpenAI client when not configuredabortSignalstill cancels the request before the timeout firestimeout: 0disables the hard timeout (no SDKtimeout, no auto-abort)AI_CALL_HARD_TIMEOUT(isHardTimeoutError()returnstrue)pnpm run lint1.7.3-beta-20260415071518.0) for downstream smoke-testing.