Skip to content

feat(db) resource version cas#1292

Open
derekwaynecarr wants to merge 2 commits intoNVIDIA:mainfrom
derekwaynecarr:feat/db-resource-version-cas
Open

feat(db) resource version cas#1292
derekwaynecarr wants to merge 2 commits intoNVIDIA:mainfrom
derekwaynecarr:feat/db-resource-version-cas

Conversation

@derekwaynecarr
Copy link
Copy Markdown
Collaborator

Summary

Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations
and migrate critical paths to use it. This prevents lost updates in HA
deployments with multiple gateway replicas.

Core infrastructure:

  • Add resource_version field to ObjectMeta proto (uint64)
  • Add resource_version column to objects table (SQLite: INTEGER, Postgres: BIGINT)
  • Add WriteCondition enum (MustCreate, MatchResourceVersion, Unconditional)
  • Add PersistenceError::Conflict variant for version mismatch
  • Add Store::put_if() and Store::delete_if() CAS methods
  • Add Store::update_message_cas() with bounded retry for mutations
  • Implement CAS operations for both SQLite and Postgres backends
  • Hydrate resource_version on all typed reads (defaults to 1 for backfill)

Migrations:

  • Migrate policy mutations to CAS (draft operations, settings)
  • Migrate provider updates to CAS (credentials, config merging)
  • Migrate sandbox updates to CAS (phase transitions, status reconciliation)
  • Migrate compute status updates to CAS (driver watch event handling)

Database migrations backfill existing rows with resource_version = 1.
CAS updates increment atomically: resource_version = resource_version + 1.

gRPC handlers map PersistenceError::Conflict to ABORTED status code
to signal clients to retry with fresh data. Server-side retries use
bounded retry (5 attempts) with fresh reads on each iteration.

Test coverage includes concurrent update scenarios and handler-level
resource_version round-trip tests.

Related Issue

Fixes #1255

Changes

Testing

  • [x ] mise run pre-commit passes
  • [ x] Unit tests added/updated
  • [ x] E2E tests added/updated (if applicable)

Checklist

  • [ x] Follows Conventional Commits
  • [ x] Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

Switch from Docker Mount API to string-based binds API with :z labels
to enable SELinux-enforcing systems to access bind-mounted files.

The :z option applies a shared SELinux content label, allowing
containers to read supervisor binaries and TLS certificates.
Docker safely ignores :z on non-SELinux systems.

Signed-off-by: Derek Carr <decarr@redhat.com>
Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations
and migrate critical paths to use it. This prevents lost updates in HA
deployments with multiple gateway replicas.

Core infrastructure:
- Add resource_version field to ObjectMeta proto (uint64)
- Add resource_version column to objects table (SQLite: INTEGER, Postgres: BIGINT)
- Add WriteCondition enum (MustCreate, MatchResourceVersion, Unconditional)
- Add PersistenceError::Conflict variant for version mismatch
- Add Store::put_if() and Store::delete_if() CAS methods
- Add Store::update_message_cas() with bounded retry for mutations
- Implement CAS operations for both SQLite and Postgres backends
- Hydrate resource_version on all typed reads (defaults to 1 for backfill)

Migrations:
- Migrate policy mutations to CAS (draft operations, settings)
- Migrate provider updates to CAS (credentials, config merging)
- Migrate sandbox updates to CAS (phase transitions, status reconciliation)
- Migrate compute status updates to CAS (driver watch event handling)

Database migrations backfill existing rows with resource_version = 1.
CAS updates increment atomically: resource_version = resource_version + 1.

gRPC handlers map PersistenceError::Conflict to ABORTED status code
to signal clients to retry with fresh data. Server-side retries use
bounded retry (5 attempts) with fresh reads on each iteration.

Test coverage includes concurrent update scenarios and handler-level
resource_version round-trip tests.

Signed-off-by: Derek Carr <decarr@redhat.com>
@derekwaynecarr derekwaynecarr requested review from a team, maxamillion and mrunalp as code owners May 9, 2026 13:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@derekwaynecarr derekwaynecarr changed the title Feat/db resource version cas feat(db) resource version cas May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(gateway): add DB-backed resource_version CAS for stored objects

1 participant