Skip to content

feat(pubsub): expose RetryableErrors for customized retry configuration#5411

Open
suzmue wants to merge 1 commit intogoogleapis:mainfrom
suzmue:expose-pubsub-retry-policies
Open

feat(pubsub): expose RetryableErrors for customized retry configuration#5411
suzmue wants to merge 1 commit intogoogleapis:mainfrom
suzmue:expose-pubsub-retry-policies

Conversation

@suzmue
Copy link
Copy Markdown
Contributor

@suzmue suzmue commented Apr 15, 2026

This change introduces a new public retry_policy module in the google-cloud-pubsub crate, exporting the RetryableErrors struct. Users can now explicitly reference and decorate this policy to customize retry limits and durations on Publisher instances.

Fixes #5366

This change introduces a new public `retry_policy` module in the
`google-cloud-pubsub` crate, exporting the `RetryableErrors` struct.
Users can now explicitly reference and decorate this policy to customize
retry limits and durations on Publisher instances.
@suzmue suzmue requested a review from a team as a code owner April 15, 2026 00:33
@product-auto-label product-auto-label bot added the api: pubsub Issues related to the Pub/Sub API. label Apr 15, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.72%. Comparing base (330dc2f) to head (006a8cd).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5411   +/-   ##
=======================================
  Coverage   97.72%   97.72%           
=======================================
  Files         216      217    +1     
  Lines       47239    47239           
=======================================
+ Hits        46165    46166    +1     
+ Misses       1074     1073    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

return RetryResult::Continue(error);
}

if error.is_io() || error.is_timeout() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know this code already existed, but)

Why would we retry a timeout error? That is at best weird and at worst a bug.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I think you are right and I am wrong.

Why would we retry a timeout error?

There is an attempt timeout and an operation timeout. If we hit an attempt timeout, we should keep retrying. If we hit the operation timeout, we should stop.

This code only applies to attempt timeouts.

Operation timeouts are set using RetryPolicyExt::with_time_limit. And that has its own logic to override a RetryResult::Continue with a RetryResult::Exhausted.

RetryResult::Continue(e) => {
if tokio::time::Instant::now().into_std() >= state.start + self.maximum_duration {
RetryResult::Exhausted(e)

use google_cloud_gax::error::rpc::Code;
return match status.code {
Code::Aborted
| Code::Cancelled
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suspicious. We don't have request cancelling, so not sure when we would receive a CANCELLED error. The docs seem to indicate they only expect CANCELLED coming from the client. They also say this could lead to a double publish, so I'd remove it.

"The client can retry the operation, but the operation may have been executed on the previous call."

return match status.code {
Code::Aborted
| Code::Cancelled
| Code::DeadlineExceeded
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suspicious.

The docs say "Typically the client would retry the operation. Keep in mind that this may result in the operation being executed more than once on the server."

I think it is better to avoid a double publish than retry on this error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud... Might be worth bringing up in our team meeting or the pubsub meeting. You probably understand this better than I do.

double publish

The more I think about publish retries, the more confused I am. Typically, we do not retry POSTs like the Publish RPC because they are not idempotent.

The application knows whether the message is idempotent, but we don't in general. e.g. if the message is some event like "your build passed", that could be published twice without side effects. But if the message is for some chat, you would not want it published twice.

My guess is that applications probably set a nonce or a request ID in the pubsub message attributes if they want retries + "exactly once" publish. And let their subscribers check for dups. (Aside: then why is exactly-once delivery a subscriber feature when applications need to do this anyway for "exactly-once" publish?? 🫨)

So I think what I convinced myself of is that retrying on DEADLINE_EXCEEDED is no different from retrying on UNAVAILABLE. You could get a double publish in either case.


I am not sure if there is any action we should take. We could add a blurb in the docs saying the publisher retries by default and that applications can use NeverRetry to turn off retries? Maybe here:

https://docs.rs/google-cloud-pubsub/0.33.2/google_cloud_pubsub/builder/publisher/struct.PublisherBuilder.html#method.with_retry_policy

// - 429: Resource Exhausted
// - 499: Cancelled Request
// - 5xx: Internal Server Error, Bad Gateway, etc.
if let Some(408 | 429 | 499 | 500..600) = error.http_status_code() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

500..600

Can we enumerate the errors we should retry on?

e.g. 501 is "not implemented", 505 is "HTTP version not supported". While Pub/Sub probably doesn't return them, they definitely shouldn't be retried.

@suzmue
Copy link
Copy Markdown
Contributor Author

suzmue commented Apr 15, 2026

Copy link
Copy Markdown
Member

@dbolduc dbolduc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we want to expose this exact API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: pubsub Issues related to the Pub/Sub API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose RetryableErrors used for pubsub publisher

2 participants