Skip to content

Make Postgres usage easier to debug and safer when writes fail#2955

Draft
ff137 wants to merge 6 commits intoodota:masterfrom
ff137:postgres-connection-safety-slice
Draft

Make Postgres usage easier to debug and safer when writes fail#2955
ff137 wants to merge 6 commits intoodota:masterfrom
ff137:postgres-connection-safety-slice

Conversation

@ff137
Copy link
Copy Markdown
Contributor

@ff137 ff137 commented May 1, 2026

What we’re changing

OpenDota talks to PostgreSQL from several places at once: the main API, the SQL explorer, and background workers that pull jobs from a queue. This PR does three small things:

  1. Give each kind of connection a clear name so that when someone looks at database activity, they can tell whether traffic is coming from the normal API, the explorer, or a queue worker—not just a bunch of anonymous connections.

  2. Add optional knobs for timeouts on the main database pool (off by default). If the team turns them on via environment variables, queries and idle-in-transaction sessions can be capped so a stuck job is less likely to hold a connection forever.

  3. When a batch of database writes fails midway, we cancel that batch properly (rollback) in the places we touched, instead of leaving the database session in an awkward state. That reduces risk of pool weirdness after errors.

Why we think this helps

We’ve seen symptoms that look like the database running out of capacity or connections (too many clients already, slow periods that improve after restarts). That can come from many processes sharing one server, heavy endpoints, or errors mid-write. This PR doesn’t try to fix that—it improves observability (named connections), adds optional guardrails (timeouts you can enable gradually), and fixes a real footgun (failed transactions not always rolled back). Those are low-risk, reviewable steps before bigger changes (explorer limits, worker concurrency, etc.).

How it works

  • Each connection type sets a PostgreSQL application_name like odota:<service>:knex, ...:explorer, or ...:queue:<queue>:<slot>.
  • New env vars (POSTGRES_STATEMENT_TIMEOUT_MS, POSTGRES_LOCK_TIMEOUT_MS, POSTGRES_IDLE_IN_TRANSACTION_SESSION_TIMEOUT_MS) are empty by default; when set to a number, new connections run the matching SET ... so ops can tune without code changes.
  • insertMatch, the rater loop, and Stripe subscriber sync wrap their manual transactions in try / catch, call rollback on failure, then rethrow so existing behavior and error handling stay the same—only the cleanup path is explicit.

What we’re not doing here

  • No change to public API behavior, explorer SQL surface, or rate limits.
  • No mandatory timeout rollout—defaults stay “off” until you set env vars.

Suggested review focus

  • Are the application_name strings and env var names acceptable for your deployment?
  • Any objection to enabling timeouts in staging first (e.g. statement timeout only)?

Co-authored-by: Cursor cursoragent@cursor.com

ff137 and others added 6 commits May 1, 2026 17:56
- Add POSTGRES_*_TIMEOUT_MS env knobs (empty = disabled) for statement,
  lock, and idle-in-transaction limits on new pool connections.
- Set PostgreSQL application_name to odota:<APP_NAME>:knex for pg_stat_activity.
- Log pool max, application_name, and whether session timeouts are configured.

Co-authored-by: Cursor <cursoragent@cursor.com>
Expose explorer sessions in pg_stat_activity as odota:<APP_NAME>:explorer.
Log pool identity alongside existing 15s timeouts.

Co-authored-by: Cursor <cursoragent@cursor.com>
Each runReliableQueue executor connects with odota:<APP_NAME>:queue:<name>:<slot>
so queue-driven locks appear distinctly in pg_stat_activity.

Co-authored-by: Cursor <cursoragent@cursor.com>
If work after db.transaction() throws, release the client with rollback
before rethrowing so the pool is not left with an open transaction.

Co-authored-by: Cursor <cursoragent@cursor.com>
Mirror insertMatch behavior so exceptions after db.transaction() always
call trx.rollback() before propagating.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread svc/store/db.ts
/** pg application_name is limited to 63 bytes; keep ASCII-safe for pg_stat_activity. */
export function pgApplicationName(role: string): string {
const app = config.APP_NAME || "unknown";
const safeApp = app.replace(/[^\w.-]/g, "_").substring(0, 40);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like having this length validation is an overkill (an LLM-ism, I presume)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100%, it is agent-assisted coding. Here it's one of those "solving a problem before it's a problem" things. So it's good, any name will work now, because it normalises the input. But at the same time, it's not really helpful when it wasn't a problem yet. I didn't ask it to solve it, but it's technically good, and once it's produced, we may as well keep it 🤷‍♂️

@howardchung
Copy link
Copy Markdown
Member

I think we could accept the try/catch wrapping in a separate PR--I think the current behavior is that the thrown exception causes the process to crash which should also end the transaction, but probably good to roll it back earlier so we don't have to wait for timeout.

The other changes are a bit more involved so I'd like to hold off on those

@ff137 ff137 marked this pull request as draft May 1, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants