feat(gpu): honor device IDs in Docker and Podman#1253
Conversation
dc3cae4 to
9735e15
Compare
|
Label |
3517ac2 to
817a97a
Compare
maxamillion
left a comment
There was a problem hiding this comment.
I can't speak to the docker ComputeDriver changes, but this looks good to me from the podman side. 👍
817a97a to
b086c67
Compare
Thanks @maxamillion. I'm trying to spin up an instance with a resonably recent podman version to test locally. One question I also had was whether there was something similar to the display of "discovered" CDI devices when running |
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
b086c67 to
17a9485
Compare
|
docker side looks correct FWIW |
Summary
Honor the existing GPU device ID field for Docker and Podman GPU sandboxes without changing the protobuf API shape. Add Docker and Podman GPU e2e coverage that only runs per-device cases for CDI device IDs discovered by the container runtime, so WSL2 hosts that expose only
nvidia.com/gpu=allskip index-specific cases.Related Issue
None.
Changes
--gpu-deviceis documented as driver-specific: Docker and Podman use CDI IDs, while the VM driver uses PCI BDF or index.nvidia.com/gpu=all, and invalid device IDs.info --format jsondiscovered device entries instead of synthesizingnvidia.com/gpu=<index>fromnvidia-smioutput.gatewaystage indeploy/docker/Dockerfile.images, withOPENSHELL_E2E_GPU_PROBE_IMAGEavailable as an override.nvidia-smiassumptions from the Podman GPU wrapper and Docker GPU workflow preflight.e2e:gputo Docker GPU coverage and kept the previous Python GPU task available ase2e:k3s:gpu.Testing
mise run pre-commitrustfmt --edition 2024 --check crates/openshell-cli/src/main.rs e2e/rust/tests/gpu_device_selection.rsenv -u RUSTC_WRAPPER cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-gpu --test gpu_device_selection parse_cdi_gpu_device_ids -- --nocapturemise run e2e:docker:gpuon a Docker CDI GPU host after the latest rebasemise run e2e:podman:gpuon a Podman CDI GPU host after the latest rebaseChecklist