Skip to content

feat(gpu): honor device IDs in Docker and Podman#1253

Open
elezar wants to merge 3 commits intomainfrom
feat/docker-podman-gpu-device-id
Open

feat(gpu): honor device IDs in Docker and Podman#1253
elezar wants to merge 3 commits intomainfrom
feat/docker-podman-gpu-device-id

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented May 7, 2026

Summary

Honor the existing GPU device ID field for Docker and Podman GPU sandboxes without changing the protobuf API shape. Add Docker and Podman GPU e2e coverage that only runs per-device cases for CDI device IDs discovered by the container runtime, so WSL2 hosts that expose only nvidia.com/gpu=all skip index-specific cases.

Related Issue

None.

Changes

  • Added a shared helper that maps the existing GPU request fields to CDI device IDs for Docker and Podman.
  • Updated Docker device requests to pass explicit GPU device IDs through and keep the default all-GPU CDI request.
  • Updated Podman container devices with the same explicit GPU device ID handling.
  • Clarified CLI help so --gpu-device is documented as driver-specific: Docker and Podman use CDI IDs, while the VM driver uses PCI BDF or index.
  • Added Rust e2e coverage for Docker and Podman GPU device selection, including default GPU requests, discovered per-device CDI IDs, nvidia.com/gpu=all, and invalid device IDs.
  • Made GPU device-selection e2e compare OpenShell output against a plain Docker or Podman control container for the same CDI device request.
  • Read GPU CDI device IDs from runtime info --format json discovered device entries instead of synthesizing nvidia.com/gpu=<index> from nvidia-smi output.
  • Read the GPU probe image from the gateway stage in deploy/docker/Dockerfile.images, with OPENSHELL_E2E_GPU_PROBE_IMAGE available as an override.
  • Removed local host nvidia-smi assumptions from the Podman GPU wrapper and Docker GPU workflow preflight.
  • Switched the GPU CI workflow from the Python/k3s-oriented GPU suite to the Docker GPU e2e task.
  • Repointed e2e:gpu to Docker GPU coverage and kept the previous Python GPU task available as e2e:k3s:gpu.
  • Updated Docker, Podman, and testing documentation notes.

Testing

  • mise run pre-commit
  • rustfmt --edition 2024 --check crates/openshell-cli/src/main.rs e2e/rust/tests/gpu_device_selection.rs
  • env -u RUSTC_WRAPPER cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-gpu --test gpu_device_selection parse_cdi_gpu_device_ids -- --nocapture
  • mise run e2e:docker:gpu on a Docker CDI GPU host after the latest rebase
  • mise run e2e:podman:gpu on a Podman CDI GPU host after the latest rebase

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@elezar elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 7, 2026 21:24
@elezar elezar force-pushed the feat/docker-podman-gpu-device-id branch from dc3cae4 to 9735e15 Compare May 7, 2026 21:53
@elezar elezar added the test:e2e-gpu Requires GPU end-to-end coverage label May 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Label test:e2e-gpu applied for 293b1f0. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

@elezar elezar force-pushed the feat/docker-podman-gpu-device-id branch 2 times, most recently from 3517ac2 to 817a97a Compare May 8, 2026 08:49
Copy link
Copy Markdown
Collaborator

@maxamillion maxamillion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to the docker ComputeDriver changes, but this looks good to me from the podman side. 👍

@elezar elezar force-pushed the feat/docker-podman-gpu-device-id branch from 817a97a to b086c67 Compare May 8, 2026 14:34
@elezar
Copy link
Copy Markdown
Member Author

elezar commented May 8, 2026

I can't speak to the docker ComputeDriver changes, but this looks good to me from the podman side. 👍

Thanks @maxamillion. I'm trying to spin up an instance with a resonably recent podman version to test locally. One question I also had was whether there was something similar to the display of "discovered" CDI devices when running docker info in podman? (This may be useful to have if not).

elezar added 3 commits May 8, 2026 16:54
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/docker-podman-gpu-device-id branch from b086c67 to 17a9485 Compare May 8, 2026 15:02
@ericcurtin
Copy link
Copy Markdown

docker side looks correct FWIW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants