Skip to content

fix(BA-1929): Surface AppProxy client endpoint errors as domain exceptions#11333

Open
rapsealk wants to merge 5 commits intomainfrom
fix/appproxy-client-error-mapping
Open

fix(BA-1929): Surface AppProxy client endpoint errors as domain exceptions#11333
rapsealk wants to merge 5 commits intomainfrom
fix/appproxy-client-error-mapping

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 27, 2026

Summary

Closes #11331. Builds on #11328 (LTS hotfix) so this branch contains its commits — once #11328 merges, the diff here will collapse to just the error-mapping change.

AppProxyClient's four mutating methods had inconsistent error handling:

This PR introduces two small helpers on AppProxyClient:

  • _request (async context manager): wraps ClientConnectorErrorAppProxyConnectionError, and any non-2xx status → AppProxyResponseError with the upstream body attached as extra_data (parsed JSON when possible, raw text otherwise).
  • _parse_json: maps ContentTypeError / JSONDecodeError from the success path → AppProxyResponseError.

All four endpoint methods now route through _request. fetch_status is left alone — it already maps to the domain exceptions and exercises a different code path.

Suggested merge order

  1. fix(BA-1929): Parse AppProxy errors as JSON instead of HTML in manager client #11328 — manager-side Accept: application/json (LTS hotfix half)
  2. fix(BA-1929): Return JSON instead of HTML for coordinator API errors #11329 — coordinator-side default-to-JSON (LTS hotfix half)
  3. fix(BA-1929): Surface AppProxy client endpoint errors as domain exceptions #11333 — this PR (broader hardening)

#11328 and #11329 are independent of each other and resolve the user-visible #5228 symptom on their own. #11333 depends on #11328 and should land last.

Why this can wait for the next normal cycle (not a hotfix)

#11328 + #11329 already eliminate the user-reported HTML-vs-JSON symptom from #5228. This PR is the broader hardening so the manager doesn't lose error context the next time the coordinator returns a non-2xx — separately reviewable, separately revertable.

Test plan

  • Force a 400 from the coordinator (invalid CreateEndpointRequestBody) and confirm the manager raises AppProxyResponseError with the coordinator's BackendAIError JSON body attached as extra_data.
  • Force a 500 from the coordinator and confirm the same.
  • Force a non-JSON 4xx response (e.g. via reverse proxy) and confirm AppProxyResponseError carries the raw text in extra_data["body"].
  • Take down the AppProxy and confirm AppProxyConnectionError is raised, not raw ClientConnectorError.
  • Force delete_endpoint to receive a 4xx and confirm it now raises (was silently swallowed before).
  • Verify pants check / pants lint pass.

Notes for reviewer

🤖 Generated with Claude Code

The AppProxy coordinator's exception middleware returned an HTML error
page when the request did not specify an Accept header, which caused
endpoint create/delete failures to surface as unparseable HTML in the
manager logs (see issue #5228). The status check path already sets
Accept: application/json correctly; this aligns the create/delete and
bulk endpoint calls with the same behavior by routing all requests
through a shared header helper.
Copilot AI review requested due to automatic review settings April 27, 2026 04:02
@rapsealk rapsealk added this to the 26.4 milestone Apr 27, 2026
@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component labels Apr 27, 2026
rapsealk added a commit that referenced this pull request Apr 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR hardens AppProxyClient error handling so coordinator/AppProxy failures are consistently surfaced as domain exceptions while preserving upstream error bodies for diagnostics.

Changes:

  • Introduces _request to map connection and non-2xx responses into AppProxyConnectionError / AppProxyResponseError with extra_data.
  • Introduces _parse_json to translate JSON parsing failures on success paths into AppProxyResponseError.
  • Routes all mutating endpoint methods through the new helpers (fixing previously-silent delete failures).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/ai/backend/manager/clients/appproxy/client.py Adds request/JSON helpers and updates endpoint CRUD methods to use domain exception mapping and preserve error bodies.
changes/11333.fix.md Adds changelog entry describing the new domain exception behavior and preserved error bodies.
changes/11328.fix.md Adds changelog entry for sending Accept: application/json from the manager’s AppProxy client.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +117 to +158
"""Issue an authenticated request and translate transport errors.

Connection failures become ``AppProxyConnectionError``. Non-2xx
responses become ``AppProxyResponseError`` with the upstream body
attached as ``extra_data`` so a structured ``BackendAIError``
payload returned by the coordinator survives the translation.
"""
try:
async with self._client_session.request(
method,
path,
headers=self._auth_headers(),
json=json_body,
) as resp:
if resp.status >= 400:
text = await resp.text()
try:
error_body: Any = json.loads(text) if text else None
except json.JSONDecodeError:
error_body = text
log.error(
"AppProxy at {} returned {} during {}: {!r}",
self._address,
resp.status,
operation,
error_body,
)
raise AppProxyResponseError(
extra_msg=(f"AppProxy returned HTTP {resp.status} during {operation}"),
extra_data={"status": resp.status, "body": error_body},
)
yield resp
except aiohttp.ClientConnectorError as e:
log.error(
"Failed to connect to AppProxy at {} during {}: {}",
self._address,
operation,
e,
)
raise AppProxyConnectionError(
extra_msg=f"Failed to connect to AppProxy at {self._address}"
) from e
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says “translate transport errors”, but the implementation only maps aiohttp.ClientConnectorError. Timeouts (e.g., asyncio.TimeoutError) and other aiohttp.ClientError subclasses can still leak out as non-domain exceptions, which undermines the goal of preserving AppProxy domain context. Consider either (a) broadening the exception mapping to include timeouts and relevant aiohttp.ClientError connection/payload exceptions, or (b) narrowing the docstring to match the actual behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +133
text = await resp.text()
try:
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await resp.text() can raise (e.g., decoding errors or payload read errors). If that happens, the code will skip raising AppProxyResponseError and instead leak the low-level exception. To keep error translation reliable, handle failures when reading the response body (e.g., use resp.text(errors="replace") and/or catch read/decode exceptions and fall back to a placeholder body).

Suggested change
text = await resp.text()
try:
try:
text = await resp.text(errors="replace")
except Exception as e:
text = f"<failed to read response body: {e}>"
try:

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +143
"AppProxy at {} returned {} during {}: {!r}",
self._address,
resp.status,
operation,
error_body,
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the full upstream error_body at error level can leak sensitive details into logs (e.g., user identifiers, request details, internal stack traces, or other coordinator-provided context). Consider logging a truncated/sanitized body, logging only metadata (status/operation/request id), and/or moving the full body to debug-level logging while still attaching it to extra_data for programmatic diagnostics.

Suggested change
"AppProxy at {} returned {} during {}: {!r}",
self._address,
resp.status,
operation,
error_body,
)
"AppProxy at {} returned {} during {}",
self._address,
resp.status,
operation,
)
if text:
preview = text[:512] + ("..." if len(text) > 512 else "")
log.debug(
"AppProxy error body preview at {} during {}: {!r}",
self._address,
operation,
preview,
)

Copilot uses AI. Check for mistakes.
)
raise AppProxyResponseError(
extra_msg=(f"AppProxy returned HTTP {resp.status} during {operation}"),
extra_data={"status": resp.status, "body": error_body},
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_data currently omits request metadata such as HTTP method and request path, which can make diagnosing bulk operations harder. Consider including method and path (and possibly operation) in extra_data so downstream logs/telemetry can attribute failures without relying on parsing the error message string.

Suggested change
extra_data={"status": resp.status, "body": error_body},
extra_data={
"status": resp.status,
"body": error_body,
"method": method,
"path": path,
"operation": operation,
},

Copilot uses AI. Check for mistakes.
Comment on lines +166 to +168
try:
return await resp.json()
except (aiohttp.ContentTypeError, json.JSONDecodeError) as e:
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aiohttp.ClientResponse.json() raises ContentTypeError when the response is valid JSON but the Content-Type header is not application/json. If you want to be tolerant of upstream mislabeling (which is common with proxies and some error handlers), consider parsing JSON regardless of content type (while still treating non-JSON bodies as invalid via JSONDecodeError). This reduces false-negative “Invalid response” errors when the payload is actually JSON.

Copilot uses AI. Check for mistakes.
@rapsealk rapsealk changed the title fix(manager): Surface AppProxy client endpoint errors as domain exceptions fix(BA-1929): Surface AppProxy client endpoint errors as domain exceptions Apr 27, 2026
@rapsealk rapsealk changed the base branch from main to fix/BA-1929-appproxy-client-accept-json April 27, 2026 04:06
The previous commit grouped Accept and X-BackendAI-Token in a
_auth_headers helper, but Accept is content negotiation rather than
authentication, and the grouping is misleading. Drop the helper and
inline the headers dict at each of the four endpoint methods so the
intent at each call site is local and explicit.
…tions

The four mutating methods on AppProxyClient (create_endpoint,
create_endpoints_bulk, delete_endpoint, delete_endpoints_bulk) either
silently swallowed non-2xx responses (delete_endpoint) or leaked raw
aiohttp.ClientResponseError / ContentTypeError to callers, neither of
which inherits from BackendAIError. As a result, deletion failures
were lost and other failures arrived at the deployment executor as
non-domain exceptions, dropping the AppProxy error context.

Introduce a shared `_request` async context manager that wraps
ClientConnectorError into AppProxyConnectionError and any non-2xx
status into AppProxyResponseError, attaching the upstream response
body (parsed JSON when possible, raw text otherwise) as `extra_data`
so a structured BackendAIError payload from the coordinator survives
the translation. Add a `_parse_json` helper for the success path that
maps ContentTypeError / JSONDecodeError to AppProxyResponseError.

`fetch_status` keeps its existing handler since it talks to a
different endpoint and is already aligned with the domain exceptions.

Refs #11331, builds on #11328.
@rapsealk rapsealk force-pushed the fix/appproxy-client-error-mapping branch from d7ff5b4 to 295882c Compare April 27, 2026 04:11
Base automatically changed from fix/BA-1929-appproxy-client-accept-json to main April 28, 2026 07:05
Comment thread changes/11328.fix.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that a news snippet from another PR has been mixed in; this needs to be verified.

import json
import logging
from collections.abc import AsyncIterator
from contextlib import asynccontextmanager
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
from contextlib import asynccontextmanager
from contextlib import asynccontextmanager as actxmgr

Comment on lines +118 to +139
try:
async with self._client_session.request(
method,
path,
headers={
"Accept": "application/json",
"X-BackendAI-Token": self._token,
},
json=json_body,
) as resp:
if resp.status >= 400:
text = await resp.text()
try:
error_body: Any = json.loads(text) if text else None
except json.JSONDecodeError:
error_body = text
log.error(
"AppProxy at {} returned {} during {}: {!r}",
self._address,
resp.status,
operation,
error_body,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a structure where we use yield when resp.status // 100 == 2 (or up to 300) and handle errors for the rest be more readable? Also, rather than parsing the JSON again after resp.text, wouldn’t it be better to try resp.json first and fall back to resp.text only if that fails? You might find some useful code patterns by looking at the StorageProxyHTTPClient code.

Copy link
Copy Markdown
Member

@fregataa fregataa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an error handling layer is needed, rather than a request proxy thing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AppProxyClient endpoint methods do not surface error responses as domain exceptions

4 participants