Make Postgres usage easier to debug and safer when writes fail#2955
Make Postgres usage easier to debug and safer when writes fail#2955ff137 wants to merge 6 commits intoodota:masterfrom
Conversation
- Add POSTGRES_*_TIMEOUT_MS env knobs (empty = disabled) for statement, lock, and idle-in-transaction limits on new pool connections. - Set PostgreSQL application_name to odota:<APP_NAME>:knex for pg_stat_activity. - Log pool max, application_name, and whether session timeouts are configured. Co-authored-by: Cursor <cursoragent@cursor.com>
Expose explorer sessions in pg_stat_activity as odota:<APP_NAME>:explorer. Log pool identity alongside existing 15s timeouts. Co-authored-by: Cursor <cursoragent@cursor.com>
Each runReliableQueue executor connects with odota:<APP_NAME>:queue:<name>:<slot> so queue-driven locks appear distinctly in pg_stat_activity. Co-authored-by: Cursor <cursoragent@cursor.com>
If work after db.transaction() throws, release the client with rollback before rethrowing so the pool is not left with an open transaction. Co-authored-by: Cursor <cursoragent@cursor.com>
Mirror insertMatch behavior so exceptions after db.transaction() always call trx.rollback() before propagating. Co-authored-by: Cursor <cursoragent@cursor.com>
| /** pg application_name is limited to 63 bytes; keep ASCII-safe for pg_stat_activity. */ | ||
| export function pgApplicationName(role: string): string { | ||
| const app = config.APP_NAME || "unknown"; | ||
| const safeApp = app.replace(/[^\w.-]/g, "_").substring(0, 40); |
There was a problem hiding this comment.
I feel like having this length validation is an overkill (an LLM-ism, I presume)
There was a problem hiding this comment.
100%, it is agent-assisted coding. Here it's one of those "solving a problem before it's a problem" things. So it's good, any name will work now, because it normalises the input. But at the same time, it's not really helpful when it wasn't a problem yet. I didn't ask it to solve it, but it's technically good, and once it's produced, we may as well keep it 🤷♂️
|
I think we could accept the try/catch wrapping in a separate PR--I think the current behavior is that the thrown exception causes the process to crash which should also end the transaction, but probably good to roll it back earlier so we don't have to wait for timeout. The other changes are a bit more involved so I'd like to hold off on those |
What we’re changing
OpenDota talks to PostgreSQL from several places at once: the main API, the SQL explorer, and background workers that pull jobs from a queue. This PR does three small things:
Give each kind of connection a clear name so that when someone looks at database activity, they can tell whether traffic is coming from the normal API, the explorer, or a queue worker—not just a bunch of anonymous connections.
Add optional knobs for timeouts on the main database pool (off by default). If the team turns them on via environment variables, queries and idle-in-transaction sessions can be capped so a stuck job is less likely to hold a connection forever.
When a batch of database writes fails midway, we cancel that batch properly (rollback) in the places we touched, instead of leaving the database session in an awkward state. That reduces risk of pool weirdness after errors.
Why we think this helps
We’ve seen symptoms that look like the database running out of capacity or connections (
too many clients already, slow periods that improve after restarts). That can come from many processes sharing one server, heavy endpoints, or errors mid-write. This PR doesn’t try to fix that—it improves observability (named connections), adds optional guardrails (timeouts you can enable gradually), and fixes a real footgun (failed transactions not always rolled back). Those are low-risk, reviewable steps before bigger changes (explorer limits, worker concurrency, etc.).How it works
application_namelikeodota:<service>:knex,...:explorer, or...:queue:<queue>:<slot>.POSTGRES_STATEMENT_TIMEOUT_MS,POSTGRES_LOCK_TIMEOUT_MS,POSTGRES_IDLE_IN_TRANSACTION_SESSION_TIMEOUT_MS) are empty by default; when set to a number, new connections run the matchingSET ...so ops can tune without code changes.insertMatch, the rater loop, and Stripe subscriber sync wrap their manual transactions in try / catch, call rollback on failure, then rethrow so existing behavior and error handling stay the same—only the cleanup path is explicit.What we’re not doing here
Suggested review focus
application_namestrings and env var names acceptable for your deployment?Co-authored-by: Cursor cursoragent@cursor.com