fix(wait): poll partially_resumed rows so chained waits resume#4514
fix(wait): poll partially_resumed rows so chained waits resume#4514TheodoreSpeaks wants to merge 2 commits intostagingfrom
Conversation
The chained-pause flow leaves a row in 'partially_resumed' status (wait1 done, wait2 still waiting). The poll's WHERE filter only matched 'paused', so wait2 was never picked up. Include 'partially_resumed' in the filter.
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Makes the Wait block’s in-process sleep threshold configurable via new Reviewed by Cursor Bugbot for commit 91f88a7. Bugbot is set up for automated code reviews on this repo. Configure here. |
Greptile SummaryThis PR fixes a bug where chained-wait workflows (
Confidence Score: 3/5The chained-wait stall is correctly fixed, but dispatched partially_resumed rows are never removed from the cron queue, potentially starving new paused rows under load. The apps/sim/lib/workflows/executor/human-in-the-loop-manager.ts — Important Files Changed
Sequence DiagramsequenceDiagram
participant Cron as Cron Poll
participant DB as pausedExecutions
participant HITL as PauseResumeManager
Note over DB: wait1 fires, row becomes partially_resumed
Cron->>DB: "SELECT WHERE status IN ('paused','partially_resumed') AND next_resume_at <= now"
DB-->>Cron: row (partially_resumed, elapsed next_resume_at)
Cron->>HITL: enqueueOrStartResume(wait2 contextId)
HITL->>DB: "UPDATE resumeStatus='resuming', INSERT resumeQueue"
HITL-->>Cron: "status='starting'"
Cron->>HITL: setNextResumeAt(null)
Note over HITL: WHERE status='paused' guard - no-op for partially_resumed!
Note over DB: next_resume_at remains elapsed (BUG)
Note over Cron,DB: Next cron tick
Cron->>DB: SELECT - same row matches again
DB-->>Cron: row returned (no eligible points to dispatch)
Note over Cron: dispatched=0, batch slot wasted every tick until execution completes
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ab96907. Configure here.
| // 'partially_resumed' rows occur when a chained-pause workflow advanced past | ||
| // an earlier wait — e.g. wait1 → agent → wait2 — and now wait2's time pause | ||
| // is the one waiting for the cron. Include it alongside fresh 'paused' rows. | ||
| inArray(pausedExecutions.status, ['paused', 'partially_resumed']), |
There was a problem hiding this comment.
setNextResumeAt guard misses partially_resumed rows
Medium Severity
setNextResumeAt contains a WHERE status = 'paused' guard that silently skips updates for partially_resumed rows. Since the cron poll now fetches both paused and partially_resumed rows, the post-dispatch call to setNextResumeAt in dispatchRow does nothing for partially_resumed rows. Their nextResumeAt stays at the old elapsed value, so every subsequent cron tick re-fetches them, hits an empty duePoints list (all already-dispatched points have resumeStatus beyond 'paused'), does no useful work, and still can't clear the timestamp — creating an infinite no-op scan loop until the execution fully resolves.
Reviewed by Cursor Bugbot for commit ab96907. Configure here.
Adds WAIT_INPROCESS_MAX_MS env var (default 300000ms = 5 min). Lower it locally (e.g. 5000) to exercise the suspend/cron-resume path with short waits.


Summary
persistPauseResultmerges new pause points with existing ones, so a chained-wait workflow ends up with a row instatus = 'partially_resumed'(wait1 marked resumed, wait2 still paused) once wait1 finishes.WHERE status = 'paused'filter excluded those rows, so wait2 was never dispatched. Verified in prod logs: execution2e9e4780...had wait1 fire, agent run, then suspended for wait2 — and wait2 sat for 1h+ withnext_resume_atlong elapsed while every cron tick reportedclaimedRows: 0.partially_resumedalongsidepaused. Single-pause workflows are unaffected.Type of Change
Testing
Tested manually against prod logs.
bun run lintclean.bun run check:api-validation:strictpassing. Type-check clean.The existing partial index is gated on
status = 'paused'so partially-resumed rows fall through to a sequential scan; volume is small enough that this is acceptable. Widening the index predicate to cover both statuses can be a follow-up if scan time becomes an issue.Checklist