feat(pubsub): expose RetryableErrors for customized retry configuration#5411
feat(pubsub): expose RetryableErrors for customized retry configuration#5411suzmue wants to merge 1 commit intogoogleapis:mainfrom
Conversation
This change introduces a new public `retry_policy` module in the `google-cloud-pubsub` crate, exporting the `RetryableErrors` struct. Users can now explicitly reference and decorate this policy to customize retry limits and durations on Publisher instances.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5411 +/- ##
=======================================
Coverage 97.72% 97.72%
=======================================
Files 216 217 +1
Lines 47239 47239
=======================================
+ Hits 46165 46166 +1
+ Misses 1074 1073 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| return RetryResult::Continue(error); | ||
| } | ||
|
|
||
| if error.is_io() || error.is_timeout() { |
There was a problem hiding this comment.
(I know this code already existed, but)
Why would we retry a timeout error? That is at best weird and at worst a bug.
There was a problem hiding this comment.
Nevermind, I think you are right and I am wrong.
Why would we retry a timeout error?
There is an attempt timeout and an operation timeout. If we hit an attempt timeout, we should keep retrying. If we hit the operation timeout, we should stop.
This code only applies to attempt timeouts.
Operation timeouts are set using RetryPolicyExt::with_time_limit. And that has its own logic to override a RetryResult::Continue with a RetryResult::Exhausted.
google-cloud-rust/src/gax/src/retry_policy.rs
Lines 466 to 468 in 52388cc
| use google_cloud_gax::error::rpc::Code; | ||
| return match status.code { | ||
| Code::Aborted | ||
| | Code::Cancelled |
There was a problem hiding this comment.
This is suspicious. We don't have request cancelling, so not sure when we would receive a CANCELLED error. The docs seem to indicate they only expect CANCELLED coming from the client. They also say this could lead to a double publish, so I'd remove it.
"The client can retry the operation, but the operation may have been executed on the previous call."
| return match status.code { | ||
| Code::Aborted | ||
| | Code::Cancelled | ||
| | Code::DeadlineExceeded |
There was a problem hiding this comment.
This is suspicious.
The docs say "Typically the client would retry the operation. Keep in mind that this may result in the operation being executed more than once on the server."
I think it is better to avoid a double publish than retry on this error.
There was a problem hiding this comment.
Thinking out loud... Might be worth bringing up in our team meeting or the pubsub meeting. You probably understand this better than I do.
double publish
The more I think about publish retries, the more confused I am. Typically, we do not retry POSTs like the Publish RPC because they are not idempotent.
The application knows whether the message is idempotent, but we don't in general. e.g. if the message is some event like "your build passed", that could be published twice without side effects. But if the message is for some chat, you would not want it published twice.
My guess is that applications probably set a nonce or a request ID in the pubsub message attributes if they want retries + "exactly once" publish. And let their subscribers check for dups. (Aside: then why is exactly-once delivery a subscriber feature when applications need to do this anyway for "exactly-once" publish?? 🫨)
So I think what I convinced myself of is that retrying on DEADLINE_EXCEEDED is no different from retrying on UNAVAILABLE. You could get a double publish in either case.
I am not sure if there is any action we should take. We could add a blurb in the docs saying the publisher retries by default and that applications can use NeverRetry to turn off retries? Maybe here:
| // - 429: Resource Exhausted | ||
| // - 499: Cancelled Request | ||
| // - 5xx: Internal Server Error, Bad Gateway, etc. | ||
| if let Some(408 | 429 | 499 | 500..600) = error.http_status_code() { |
There was a problem hiding this comment.
500..600
Can we enumerate the errors we should retry on?
e.g. 501 is "not implemented", 505 is "HTTP version not supported". While Pub/Sub probably doesn't return them, they definitely shouldn't be retried.
|
@dbolduc Will go through each of your comments too but this is also based on the retry implementations for Go and Java: https://github.com/googleapis/google-cloud-go/blob/9d47d1ca53dc1a3183352f67ebeb174b1d194846/pubsub/v2/service.go#L51 and https://github.com/googleapis/java-pubsub/blob/880f28a11d7dc2864ae958e59940bb24bb6fa555/google-cloud-pubsub/src/main/java/com/google/cloud/pubsub/v1/StatusUtil.java#L48 |
dbolduc
left a comment
There was a problem hiding this comment.
I agree that we want to expose this exact API.
This change introduces a new public
retry_policymodule in thegoogle-cloud-pubsubcrate, exporting theRetryableErrorsstruct. Users can now explicitly reference and decorate this policy to customize retry limits and durations on Publisher instances.Fixes #5366