Make Postgres usage easier to debug and safer when writes fail by ff137 · Pull Request #2955 · odota/core

ff137 · 2026-05-01T16:26:25Z

What we’re changing

OpenDota talks to PostgreSQL from several places at once: the main API, the SQL explorer, and background workers that pull jobs from a queue. This PR does three small things:

Give each kind of connection a clear name so that when someone looks at database activity, they can tell whether traffic is coming from the normal API, the explorer, or a queue worker—not just a bunch of anonymous connections.
Add optional knobs for timeouts on the main database pool (off by default). If the team turns them on via environment variables, queries and idle-in-transaction sessions can be capped so a stuck job is less likely to hold a connection forever.
When a batch of database writes fails midway, we cancel that batch properly (rollback) in the places we touched, instead of leaving the database session in an awkward state. That reduces risk of pool weirdness after errors.

Why we think this helps

We’ve seen symptoms that look like the database running out of capacity or connections (too many clients already, slow periods that improve after restarts). That can come from many processes sharing one server, heavy endpoints, or errors mid-write. This PR doesn’t try to fix that—it improves observability (named connections), adds optional guardrails (timeouts you can enable gradually), and fixes a real footgun (failed transactions not always rolled back). Those are low-risk, reviewable steps before bigger changes (explorer limits, worker concurrency, etc.).

How it works

Each connection type sets a PostgreSQL application_name like odota:<service>:knex, ...:explorer, or ...:queue:<queue>:<slot>.
New env vars (POSTGRES_STATEMENT_TIMEOUT_MS, POSTGRES_LOCK_TIMEOUT_MS, POSTGRES_IDLE_IN_TRANSACTION_SESSION_TIMEOUT_MS) are empty by default; when set to a number, new connections run the matching SET ... so ops can tune without code changes.
insertMatch, the rater loop, and Stripe subscriber sync wrap their manual transactions in try / catch, call rollback on failure, then rethrow so existing behavior and error handling stay the same—only the cleanup path is explicit.

What we’re not doing here

No change to public API behavior, explorer SQL surface, or rate limits.
No mandatory timeout rollout—defaults stay “off” until you set env vars.

Suggested review focus

Are the application_name strings and env var names acceptable for your deployment?
Any objection to enabling timeouts in staging first (e.g. statement timeout only)?

Co-authored-by: Cursor cursoragent@cursor.com

- Add POSTGRES_*_TIMEOUT_MS env knobs (empty = disabled) for statement, lock, and idle-in-transaction limits on new pool connections. - Set PostgreSQL application_name to odota:<APP_NAME>:knex for pg_stat_activity. - Log pool max, application_name, and whether session timeouts are configured. Co-authored-by: Cursor <cursoragent@cursor.com>

Expose explorer sessions in pg_stat_activity as odota:<APP_NAME>:explorer. Log pool identity alongside existing 15s timeouts. Co-authored-by: Cursor <cursoragent@cursor.com>

Each runReliableQueue executor connects with odota:<APP_NAME>:queue:<name>:<slot> so queue-driven locks appear distinctly in pg_stat_activity. Co-authored-by: Cursor <cursoragent@cursor.com>

If work after db.transaction() throws, release the client with rollback before rethrowing so the pool is not left with an open transaction. Co-authored-by: Cursor <cursoragent@cursor.com>

Mirror insertMatch behavior so exceptions after db.transaction() always call trx.rollback() before propagating. Co-authored-by: Cursor <cursoragent@cursor.com>

builder-247 · 2026-05-01T17:29:36Z

+/** pg application_name is limited to 63 bytes; keep ASCII-safe for pg_stat_activity. */
+export function pgApplicationName(role: string): string {
+  const app = config.APP_NAME || "unknown";
+  const safeApp = app.replace(/[^\w.-]/g, "_").substring(0, 40);


I feel like having this length validation is an overkill (an LLM-ism, I presume)

100%, it is agent-assisted coding. Here it's one of those "solving a problem before it's a problem" things. So it's good, any name will work now, because it normalises the input. But at the same time, it's not really helpful when it wasn't a problem yet. I didn't ask it to solve it, but it's technically good, and once it's produced, we may as well keep it 🤷‍♂️

howardchung · 2026-05-01T18:16:31Z

I think we could accept the try/catch wrapping in a separate PR--I think the current behavior is that the thrown exception causes the process to crash which should also end the transaction, but probably good to roll it back earlier so we don't have to wait for timeout.

The other changes are a bit more involved so I'd like to hold off on those

ff137 and others added 6 commits May 1, 2026 17:56

feat(api): set application_name on readonly /explorer pg pool

c9cdcd0

Expose explorer sessions in pg_stat_activity as odota:<APP_NAME>:explorer. Log pool identity alongside existing 15s timeouts. Co-authored-by: Cursor <cursoragent@cursor.com>

feat(queue): application_name for reliable queue pg clients

31aeddb

Each runReliableQueue executor connects with odota:<APP_NAME>:queue:<name>:<slot> so queue-driven locks appear distinctly in pg_stat_activity. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(insert): rollback insertMatch transaction on any failure

cf0be48

If work after db.transaction() throws, release the client with rollback before rethrowing so the pool is not left with an open transaction. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(db): rollback rater and syncSubs transactions on error

5b0fc5e

Mirror insertMatch behavior so exceptions after db.transaction() always call trx.rollback() before propagating. Co-authored-by: Cursor <cursoragent@cursor.com>

🎨

47a5f03

builder-247 reviewed May 1, 2026

View reviewed changes

ff137 marked this pull request as draft May 1, 2026 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Postgres usage easier to debug and safer when writes fail#2955

Make Postgres usage easier to debug and safer when writes fail#2955
ff137 wants to merge 6 commits intoodota:masterfrom
ff137:postgres-connection-safety-slice

ff137 commented May 1, 2026

Uh oh!

builder-247 May 1, 2026

Uh oh!

ff137 May 1, 2026

Uh oh!

howardchung commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ff137 commented May 1, 2026

Uh oh!

builder-247 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

ff137 May 1, 2026

Choose a reason for hiding this comment

Uh oh!

howardchung commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants