Skip to content

[slop]fix(workflow-engine): only commit step state after success#5010

Draft
abcxff wants to merge 1 commit into
05-12-temp_fix_wasm-packfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success
Draft

[slop]fix(workflow-engine): only commit step state after success#5010
abcxff wants to merge 1 commit into
05-12-temp_fix_wasm-packfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 9, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5010 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web May 12, 2026 at 6:45 pm
frontend-inspector ❌ Build Failed (View Logs) Web May 12, 2026 at 3:27 pm
frontend-cloud ❌ Build Failed (View Logs) Web May 12, 2026 at 3:25 pm
website 😴 Sleeping (View Logs) Web May 12, 2026 at 3:59 am
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 3:40 pm
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 3:40 pm

@abcxff abcxff changed the title fix(workflow-engine): only commit step state after success [slop]fix(workflow-engine): only commit step state after success May 9, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review: PR #5010

Note: This is a DRAFT PR with the [slop] prefix in its title, an empty checklist, and no description.

Overview

Two logically distinct changes:

  1. State rollback on step failure -- when a workflow step throws, the actor state and vars are rolled back to their pre-step snapshot via structuredClone.
  2. Defer storage flushes after step failures -- removes all await this.flushStorage() calls from the step error paths, letting dirty entries be flushed by the next natural checkpoint instead.

Concerns

1. Durability regression -- crash between retries loses retry count

Before this PR, after every step failure flushStorage() persisted metadata.status, metadata.attempts, and metadata.error. After this PR those values are only in memory until the next natural flush.

A crash between a step failure and the next flush causes the workflow to restart with the previous metadata.attempts value, meaning maxRetries enforcement can be bypassed. A step with maxRetries: 1 could silently execute more than twice if the process crashes after each failed attempt.

The removed StripStepHistoryErrorDriver test was specifically designed to catch this class of bug. Removing it along with the defensive stepData.error ?? metadata.error fallback strips the crash-resilience story without any explanation in the PR description.

Question for the author: does the StepFailedError -> sleep path guarantee a flush before process exit? If so, document it here.

2. Removed defensive error fallback with no justification

The diff changes const lastError = stepData.error ?? metadata.error to const lastError = metadata.error. The removed comment explained this fallback existed for partial writes/crashes between attempts. Both variables hold String(error) so the change is probably safe in practice, but the test covering this scenario was also removed.

3. Misleading test name

The test "should not commit step error data to entry on failure" then asserts expect(entry.kind.data.error).toBe("Error: step failed") -- the error IS committed. The intent appears to be "should not commit step OUTPUT data on failure." Rename accordingly.

4. structuredClone limitation for state rollback

structuredClone throws a DataCloneError for non-serializable values (class instances with methods, functions, DOM nodes, etc.). Actor state containing any of these would cause the step to fail before it runs, with a confusing error unrelated to the step logic. This constraint should be documented on the RAW_STATE_SYMBOL interface entry.

5. vars rollback vs. held references

this.#runCtx.vars = varsSnapshot replaces the object reference. Any code that captured a reference to the old vars before the step ran will not see the rollback. Worth a brief comment noting this limitation.

6. Loop test expectation shift unexplained

The expectation changed from iterationsExecuted[0] being 2 to 3, with no explanation for why the behavior shifted. Is this intentional or a regression?


Positives

  • Correct rollback intent. Rolling back actor state/vars on step failure is the right semantic. The RAW_STATE_SYMBOL approach to bypass the write-through proxy before snapshotting is clean.
  • Code deduplication. Hoisting the duplicate entry.dirty = true and error-string assignments above the error-type if ladder removes real repetition.
  • New test. The new step test verifies the output-vs-error commit distinction, even if the name needs fixing.

Recommendations Before Merge

  1. Fill in the PR description explaining the motivation for deferring flushes.
  2. Verify and document that the StepFailedError -> sleep path guarantees a flush before process exit.
  3. Restore the StripStepHistoryErrorDriver test (adapted to new semantics) or document why the partial-write scenario is no longer a concern.
  4. Rename the misleading test.
  5. Remove the [slop] prefix before converting from draft.

@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 11, 2026 03:41
@abcxff abcxff force-pushed the graphite-base/5010 branch from 56af1d1 to aac9634 Compare May 12, 2026 12:55
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from cc1baa6 to 9304f04 Compare May 12, 2026 12:55
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan May 12, 2026 12:55
@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 12, 2026 13:21
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the graphite-base/5010 branch from aac9634 to 5f7bac7 Compare May 12, 2026 15:24
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9304f04 to 0668fa7 Compare May 12, 2026 15:24
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-12-temp_fix_wasm-pack May 12, 2026 15:24
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 0668fa7 to 2c029c2 Compare May 12, 2026 17:46
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 2c029c2 to 9b64fa5 Compare May 12, 2026 18:03
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9b64fa5 to 0eb8f51 Compare May 12, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant