fix(BA-1929): Surface AppProxy client endpoint errors as domain exceptions#11333
fix(BA-1929): Surface AppProxy client endpoint errors as domain exceptions#11333
Conversation
The AppProxy coordinator's exception middleware returned an HTML error page when the request did not specify an Accept header, which caused endpoint create/delete failures to surface as unparseable HTML in the manager logs (see issue #5228). The status check path already sets Accept: application/json correctly; this aligns the create/delete and bulk endpoint calls with the same behavior by routing all requests through a shared header helper.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR hardens AppProxyClient error handling so coordinator/AppProxy failures are consistently surfaced as domain exceptions while preserving upstream error bodies for diagnostics.
Changes:
- Introduces
_requestto map connection and non-2xx responses intoAppProxyConnectionError/AppProxyResponseErrorwithextra_data. - Introduces
_parse_jsonto translate JSON parsing failures on success paths intoAppProxyResponseError. - Routes all mutating endpoint methods through the new helpers (fixing previously-silent delete failures).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/ai/backend/manager/clients/appproxy/client.py | Adds request/JSON helpers and updates endpoint CRUD methods to use domain exception mapping and preserve error bodies. |
| changes/11333.fix.md | Adds changelog entry describing the new domain exception behavior and preserved error bodies. |
| changes/11328.fix.md | Adds changelog entry for sending Accept: application/json from the manager’s AppProxy client. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Issue an authenticated request and translate transport errors. | ||
|
|
||
| Connection failures become ``AppProxyConnectionError``. Non-2xx | ||
| responses become ``AppProxyResponseError`` with the upstream body | ||
| attached as ``extra_data`` so a structured ``BackendAIError`` | ||
| payload returned by the coordinator survives the translation. | ||
| """ | ||
| try: | ||
| async with self._client_session.request( | ||
| method, | ||
| path, | ||
| headers=self._auth_headers(), | ||
| json=json_body, | ||
| ) as resp: | ||
| if resp.status >= 400: | ||
| text = await resp.text() | ||
| try: | ||
| error_body: Any = json.loads(text) if text else None | ||
| except json.JSONDecodeError: | ||
| error_body = text | ||
| log.error( | ||
| "AppProxy at {} returned {} during {}: {!r}", | ||
| self._address, | ||
| resp.status, | ||
| operation, | ||
| error_body, | ||
| ) | ||
| raise AppProxyResponseError( | ||
| extra_msg=(f"AppProxy returned HTTP {resp.status} during {operation}"), | ||
| extra_data={"status": resp.status, "body": error_body}, | ||
| ) | ||
| yield resp | ||
| except aiohttp.ClientConnectorError as e: | ||
| log.error( | ||
| "Failed to connect to AppProxy at {} during {}: {}", | ||
| self._address, | ||
| operation, | ||
| e, | ||
| ) | ||
| raise AppProxyConnectionError( | ||
| extra_msg=f"Failed to connect to AppProxy at {self._address}" | ||
| ) from e |
There was a problem hiding this comment.
The docstring says “translate transport errors”, but the implementation only maps aiohttp.ClientConnectorError. Timeouts (e.g., asyncio.TimeoutError) and other aiohttp.ClientError subclasses can still leak out as non-domain exceptions, which undermines the goal of preserving AppProxy domain context. Consider either (a) broadening the exception mapping to include timeouts and relevant aiohttp.ClientError connection/payload exceptions, or (b) narrowing the docstring to match the actual behavior.
| text = await resp.text() | ||
| try: |
There was a problem hiding this comment.
await resp.text() can raise (e.g., decoding errors or payload read errors). If that happens, the code will skip raising AppProxyResponseError and instead leak the low-level exception. To keep error translation reliable, handle failures when reading the response body (e.g., use resp.text(errors="replace") and/or catch read/decode exceptions and fall back to a placeholder body).
| text = await resp.text() | |
| try: | |
| try: | |
| text = await resp.text(errors="replace") | |
| except Exception as e: | |
| text = f"<failed to read response body: {e}>" | |
| try: |
| "AppProxy at {} returned {} during {}: {!r}", | ||
| self._address, | ||
| resp.status, | ||
| operation, | ||
| error_body, | ||
| ) |
There was a problem hiding this comment.
Logging the full upstream error_body at error level can leak sensitive details into logs (e.g., user identifiers, request details, internal stack traces, or other coordinator-provided context). Consider logging a truncated/sanitized body, logging only metadata (status/operation/request id), and/or moving the full body to debug-level logging while still attaching it to extra_data for programmatic diagnostics.
| "AppProxy at {} returned {} during {}: {!r}", | |
| self._address, | |
| resp.status, | |
| operation, | |
| error_body, | |
| ) | |
| "AppProxy at {} returned {} during {}", | |
| self._address, | |
| resp.status, | |
| operation, | |
| ) | |
| if text: | |
| preview = text[:512] + ("..." if len(text) > 512 else "") | |
| log.debug( | |
| "AppProxy error body preview at {} during {}: {!r}", | |
| self._address, | |
| operation, | |
| preview, | |
| ) |
| ) | ||
| raise AppProxyResponseError( | ||
| extra_msg=(f"AppProxy returned HTTP {resp.status} during {operation}"), | ||
| extra_data={"status": resp.status, "body": error_body}, |
There was a problem hiding this comment.
extra_data currently omits request metadata such as HTTP method and request path, which can make diagnosing bulk operations harder. Consider including method and path (and possibly operation) in extra_data so downstream logs/telemetry can attribute failures without relying on parsing the error message string.
| extra_data={"status": resp.status, "body": error_body}, | |
| extra_data={ | |
| "status": resp.status, | |
| "body": error_body, | |
| "method": method, | |
| "path": path, | |
| "operation": operation, | |
| }, |
| try: | ||
| return await resp.json() | ||
| except (aiohttp.ContentTypeError, json.JSONDecodeError) as e: |
There was a problem hiding this comment.
aiohttp.ClientResponse.json() raises ContentTypeError when the response is valid JSON but the Content-Type header is not application/json. If you want to be tolerant of upstream mislabeling (which is common with proxies and some error handlers), consider parsing JSON regardless of content type (while still treating non-JSON bodies as invalid via JSONDecodeError). This reduces false-negative “Invalid response” errors when the payload is actually JSON.
The previous commit grouped Accept and X-BackendAI-Token in a _auth_headers helper, but Accept is content negotiation rather than authentication, and the grouping is misleading. Drop the helper and inline the headers dict at each of the four endpoint methods so the intent at each call site is local and explicit.
…tions The four mutating methods on AppProxyClient (create_endpoint, create_endpoints_bulk, delete_endpoint, delete_endpoints_bulk) either silently swallowed non-2xx responses (delete_endpoint) or leaked raw aiohttp.ClientResponseError / ContentTypeError to callers, neither of which inherits from BackendAIError. As a result, deletion failures were lost and other failures arrived at the deployment executor as non-domain exceptions, dropping the AppProxy error context. Introduce a shared `_request` async context manager that wraps ClientConnectorError into AppProxyConnectionError and any non-2xx status into AppProxyResponseError, attaching the upstream response body (parsed JSON when possible, raw text otherwise) as `extra_data` so a structured BackendAIError payload from the coordinator survives the translation. Add a `_parse_json` helper for the success path that maps ContentTypeError / JSONDecodeError to AppProxyResponseError. `fetch_status` keeps its existing handler since it talks to a different endpoint and is already aligned with the domain exceptions. Refs #11331, builds on #11328.
d7ff5b4 to
295882c
Compare
There was a problem hiding this comment.
It appears that a news snippet from another PR has been mixed in; this needs to be verified.
| import json | ||
| import logging | ||
| from collections.abc import AsyncIterator | ||
| from contextlib import asynccontextmanager |
There was a problem hiding this comment.
nitpick:
| from contextlib import asynccontextmanager | |
| from contextlib import asynccontextmanager as actxmgr |
| try: | ||
| async with self._client_session.request( | ||
| method, | ||
| path, | ||
| headers={ | ||
| "Accept": "application/json", | ||
| "X-BackendAI-Token": self._token, | ||
| }, | ||
| json=json_body, | ||
| ) as resp: | ||
| if resp.status >= 400: | ||
| text = await resp.text() | ||
| try: | ||
| error_body: Any = json.loads(text) if text else None | ||
| except json.JSONDecodeError: | ||
| error_body = text | ||
| log.error( | ||
| "AppProxy at {} returned {} during {}: {!r}", | ||
| self._address, | ||
| resp.status, | ||
| operation, | ||
| error_body, |
There was a problem hiding this comment.
Wouldn't a structure where we use yield when resp.status // 100 == 2 (or up to 300) and handle errors for the rest be more readable? Also, rather than parsing the JSON again after resp.text, wouldn’t it be better to try resp.json first and fall back to resp.text only if that fails? You might find some useful code patterns by looking at the StorageProxyHTTPClient code.
fregataa
left a comment
There was a problem hiding this comment.
I think an error handling layer is needed, rather than a request proxy thing
Summary
Closes #11331. Builds on #11328 (LTS hotfix) so this branch contains its commits — once #11328 merges, the diff here will collapse to just the error-mapping change.
AppProxyClient's four mutating methods had inconsistent error handling:delete_endpointdidasync with ... as resp: pass— non-2xx responses were silently dropped, so a failed deletion looked successful to the manager.create_endpoint,create_endpoints_bulk,delete_endpoints_bulkcalledresp.raise_for_status()thenawait resp.json(), leaking rawaiohttp.ClientResponseError/aiohttp.ContentTypeErrorto callers. Neither inherits fromBackendAIError, so the deployment executor saw a non-domain exception and the eventualDeploymentExecutionErrorlost the AppProxy domain context.BackendAIErrorJSON body returned by the coordinator (now reliable after fix(BA-1929): Parse AppProxy errors as JSON instead of HTML in manager client #11328 / fix(BA-1929): Return JSON instead of HTML for coordinator API errors #11329) was discarded.This PR introduces two small helpers on
AppProxyClient:_request(async context manager): wrapsClientConnectorError→AppProxyConnectionError, and any non-2xx status →AppProxyResponseErrorwith the upstream body attached asextra_data(parsed JSON when possible, raw text otherwise)._parse_json: mapsContentTypeError/JSONDecodeErrorfrom the success path →AppProxyResponseError.All four endpoint methods now route through
_request.fetch_statusis left alone — it already maps to the domain exceptions and exercises a different code path.Suggested merge order
Accept: application/json(LTS hotfix half)#11328 and #11329 are independent of each other and resolve the user-visible #5228 symptom on their own. #11333 depends on #11328 and should land last.
Why this can wait for the next normal cycle (not a hotfix)
#11328 + #11329 already eliminate the user-reported HTML-vs-JSON symptom from #5228. This PR is the broader hardening so the manager doesn't lose error context the next time the coordinator returns a non-2xx — separately reviewable, separately revertable.
Test plan
CreateEndpointRequestBody) and confirm the manager raisesAppProxyResponseErrorwith the coordinator'sBackendAIErrorJSON body attached asextra_data.AppProxyResponseErrorcarries the raw text inextra_data["body"].AppProxyConnectionErroris raised, not rawClientConnectorError.delete_endpointto receive a 4xx and confirm it now raises (was silently swallowed before).pants check/pants lintpass.Notes for reviewer
fix/BA-1929-appproxy-client-accept-json(fix(BA-1929): Parse AppProxy errors as JSON instead of HTML in manager client #11328) so theAccept: application/jsonchange is a parent commit, not a duplicate.🤖 Generated with Claude Code