Skip to content

MAINT: GCG in AzureML fix & improved test coverage, remove mlflow#1705

Open
romanlutz wants to merge 22 commits intomicrosoft:mainfrom
romanlutz:gcg-refactor
Open

MAINT: GCG in AzureML fix & improved test coverage, remove mlflow#1705
romanlutz wants to merge 22 commits intomicrosoft:mainfrom
romanlutz:gcg-refactor

Conversation

@romanlutz
Copy link
Copy Markdown
Contributor

Description

This PR is groundwork to make the existing GCG implementation safe to refactor in follow-ups. It does not change GCG's public API or its algorithm — it adds real test coverage at multiple tiers, gets the Azure ML baseline running end-to-end again so we can validate behavior survives later changes, removes mlflow (which was breaking on Azure ML), and cleans up a few things that fell out of getting the baseline working.

Related: #960 (umbrella), #965 (fastchat), #963 (closed — GCG notebook hung on Azure).

What's in the PR

Tests added (the bulk of the PR)

Tier File What it covers
Unit tests/unit/auxiliary_attacks/gcg/test_gcg_core.py Algorithm invariants — gradient computation, candidate sampling, slice arithmetic, init validation.
Unit tests/unit/auxiliary_attacks/gcg/test_data_and_config.py get_goals_and_targets, config loading from YAML, worker-creation contracts.
Unit tests/unit/auxiliary_attacks/gcg/test_lifecycle.py Worker start/stop on success and on failure — caught a real worker-leak bug we then fixed.
Unit tests/unit/auxiliary_attacks/gcg/test_attack_wiring.py Constructs IndividualPromptAttack / ProgressiveMultiPromptAttack with the real GCG manager classes (no mocking) so a kwarg mismatch with MultiPromptAttack.__init__ actually surfaces. Catches the dead mpa_kwargs TypeError bug.
Integration tests/integration/auxiliary_attacks/test_gcg_integration.py GPT-2 on CPU exercising token gradients, GCGAttackPrompt, sample_control, embedding helpers. Includes a TestGCGAttackPromptNonLlamaTemplate class with vicuna xfail(strict=True) tests that document #965 (conv_template.system AttributeError on every non-llama-2 fastchat template) — the marker will flip to "unexpectedly passed" when the fastchat replacement lands and remind us to remove it.
End-to-end tests/end_to_end/auxiliary_attacks/test_gcg_aml_e2e.py Real Azure ML pipeline test. Runs the AML notebook itself via runpy.run_path() (the jupytext percent format is valid Python), then polls the submitted job to a terminal state and asserts Completed. Skipped unless RUN_ALL_TESTS=true and the AML/HF credentials are set. Cancels the submitted job in finally so an interrupted test never leaks paid compute.

Bugs found and fixed

These were uncovered by writing the tests above, then running the actual Azure ML baseline end-to-end:

  • Dead mpa_kwargs causing TypeError at MultiPromptAttack.__init__ (deterministic, lr, etc. weren't accepted — leftover from GBDA). Removed.
  • Dead gbda_deterministic parameter all the way through train.py / attack_manager.py. Removed.
  • mlflow / azureml-mlflow incompatibility caused Azure ML failures. Removed mlflow entirely (and dropped mlflow + azureml-mlflow from the gcg and all extras); replaced its uses with Python standard logging as a placeholder. Proper experiment logging will land in a follow-up (CentralMemory or Azure storage).
  • Worker leak on training failure (process pool not cleaned up if generate_suffix raised). Caught by the new lifecycle test, fixed in train.py.
  • Phi-3 + every other non-llama fastchat template crashes in _update_ids because conv_template.system doesn't exist. Documented as #965; the AML baseline is switched to llama-2 and the new vicuna integration tests are xfail(strict=True, raises=AttributeError) so they alert when the fix lands.
  • Dockerfile missing uv in the AML image. Fixed (now uses NVIDIA CUDA base + Python 3.11 + uv + pip install -e ".[gcg]").

Azure ML notebook (doc/code/auxiliary_attacks/1_gcg_azure_ml.py/.ipynb) overhauled

  • Switched from phi_3_mini (broken, see FEAT replace fastchat in GCG #965) to llama_2 for the baseline.
  • Job command now invokes the runner as python -m pyrit.auxiliary_attacks.gcg.experiments.run --model_name llama_2 ... instead of going through a scripts/ launcher. python -m puts cwd at the front of sys.path so the uploaded code snapshot still wins over the Docker-installed package — the previous explicit sys.path.insert(0, ...) hack is no longer needed.
  • Build context for the AML environment moved to repo root so the Dockerfile can COPY pyproject.toml + pyrit/, then pip install -e ".[gcg]".
  • Notebook now declares outputs={"results": Output(type="uri_folder")} and passes --output_dir ${{outputs.results}} so AML uploads the result JSON as a named output (auto-capture of ./outputs/ is not available in SDK v2 command jobs — they require declared outputs).
  • New cell at the end polls the submitted job until completion, downloads the named output, parses the JSON, and prints the final loss + generated adversarial suffix so the tutorial actually shows users a result instead of stopping at "Job submitted, monitor in Studio."
  • Best-effort suppression of azure-ai-ml SDK telemetry / experimental warnings via logging + warnings.filterwarnings. The worst noise (ActivityCompleted: ListSecrets ... HowEnded=Failure blob) is fixed by bumping azure-ai-ml >= 1.32.0, whose changelog explicitly "Skip _list_secrets for identity-based datastores to prevent noisy telemetry traces."

Other cleanups

  • pyrit/auxiliary_attacks/gcg/experiments/run.py is now self-locating (chdirs to its own directory in __main__) and accepts --output_dir, so it works as python -m ... from any cwd — including PyPI installs. The scripts/ launchers (run_gcg_aml.py, submit_gcg_job.py) were redundant with the notebook + this change, so they're deleted along with the now-empty scripts/ directory.
  • pyarrow>=22; python_version >= '3.14' pin moved from core deps into the gcg extra (only needed when the gcg extra is installed, since something in that extra constrains the resolver toward a pyarrow without cp314 wheels). Inline comment explains why.
  • azure-ai-ml >= 1.32.0 in both gcg and all extras (kept aligned).

Tests and Documentation

  • Unit tests (make unit-test / pytest tests/unit/auxiliary_attacks/gcg/): all new + existing GCG unit tests pass.
  • Integration tests (pytest tests/integration/auxiliary_attacks/test_gcg_integration.py): GPT-2 path passes; vicuna tests xfail strictly with AttributeError (FEAT replace fastchat in GCG #965).
  • E2E test (RUN_ALL_TESTS=true pytest tests/end_to_end/auxiliary_attacks/test_gcg_aml_e2e.py): verified during this PR by submitting real Llama-2 GCG jobs to Azure ML on Standard_NC24ads_A100_v4. Completed successfully end-to-end (loss decreased; generated suffix retrievable).
  • JupyText: 1_gcg_azure_ml.ipynb was regenerated from 1_gcg_azure_ml.py with jupytext --to ipynb --execute --set-kernel <venv> against real Azure ML — the executed notebook with cell outputs (workspace name, submitted job name + Studio URL, polled status transitions, final loss, generated suffix) is committed alongside the .py.

Notes for reviewers

  • The 20 commits split into clear themes: 4 test additions, 1 bug fix, 1 doc-execution refresh, 1 user-facing feature (the polling cell), and the rest small cleanups (deps, dead code removal, scripts directory). Reviewing commit-by-commit may be easier than the diff in aggregate.
  • fastchat is intentionally not added as a declared dependency despite being a hard import in attack_manager.py — that's tracked in FEAT replace fastchat in GCG #965 and goes away when the fastchat replacement lands, so adding it now would be churn we'd just undo.
  • Some Azure ML SDK telemetry / upload-progress noise still appears in the executed notebook's cell outputs even after the azure-ai-ml >= 1.32.0 bump and the suppression block. Those go through the SDK's own propagate=False stderr handler and aren't reachable via the standard logging config; we'd need sys.stderr redirection (hacky) to fully silence them. Left as-is for transparency about what the SDK actually emits at runtime.

romanlutz and others added 22 commits May 4, 2026 04:58
Add 26 new unit tests covering:
- get_filtered_cands: filtering, clamping, padding behavior
- target_loss / control_loss: shape, finiteness, loss ordering
- sample_control: shape, vocab bounds, single-position changes, non-ASCII filtering
- _build_params: ConfigDict construction from kwargs
- _apply_target_augmentation: length preservation, modification, seed reproducibility
- _create_attack: transfer flag routing (Progressive vs Individual)
- Embedding helpers: error handling for unknown model types
- PromptManager init: validation of goals/targets
- EvaluateAttack init: worker count validation

Total GCG test count: 24 -> 50

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Data & config tests (test_data_and_config.py, 12 tests):
- YAML loading: valid files, list values, missing file error
- Real config validation: all 11 shipped configs parse, have required keys,
  individual vs transfer configs have correct settings
- get_goals_and_targets: seed reproducibility, different seeds differ,
  separate test data files, n_train_data limiting
- run_trainer validation: unsupported model names, missing HF token

Lifecycle tests (test_lifecycle.py, 7 tests):
- GPU memory: nvidia-smi parsing (single/multi GPU), MLflow logging, failure handling
- generate_suffix lifecycle: MLflow started before training, workers stopped
  after training, BUG CHARACTERIZATION: workers NOT stopped on failure (leak)

Total GCG test count: 24 -> 69

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 10 integration tests that exercise the GCG attack pipeline with a real
GPT-2 model on CPU, validating end-to-end correctness:

- token_gradients: gradient shape matches (n_control, vocab_size), values
  are finite and non-zero
- GCGAttackPrompt: initializes with valid non-overlapping slices, grad()
  returns correct shape, test_loss() returns finite positive float
- GCGPromptManager.sample_control: sampled candidates are decodable,
  correct batch size
- Embedding helpers: layer/matrix/embeddings work with GPT2LMHeadModel,
  get_nonascii_toks returns non-empty tensor

Uses llama-2 conversation template (has explicit handling in _update_ids).
Marked @run_only_if_all_tests (requires RUN_ALL_TESTS=true + torch/transformers).
Runs in ~18s on CPU.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
These tests only need optional Python packages (torch, transformers, fastchat),
not external services or credentials. The importorskip at the top already
handles skipping when deps are not installed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move class references to module level to fix N806 (variable naming)
- Add noqa: E402 for imports after importorskip guards
- Fix ruff format issues
- Remove outdated RUN_ALL_TESTS reference in docstring

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove mlflow and azureml-mlflow from GCG dependencies entirely:
- Replace mlflow logging in log.py with Python standard logging
- Remove mlflow.start_run()/end_run() from train.py and attack_manager.py
- Remove mlflow and azureml-mlflow from gcg and all extras in pyproject.toml
- Update tests to not mock mlflow
- Fix Dockerfile: use nvidia/cuda base + Python 3.11 + uv + pip install -e .[gcg]
- Add pyarrow>=22 pin for Python 3.14 compatibility

The mlflow dependency caused Azure ML failures due to version incompatibility
between mlflow 3.x and azureml-mlflow. Proper experiment logging will be
added later via CentralMemory or Azure storage (tracked in plan).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	tests/unit/auxiliary_attacks/gcg/test_lifecycle.py
Remove gbda_deterministic/mpa_deterministic — dead code from GBDA attack
that was never consumed by any GCG class. Its presence caused a TypeError
in individual mode because MultiPromptAttack.__init__() doesn't accept it.
This was a pre-existing bug from the original llm-attacks research repo
(silently swallowed by **kwargs there, but our copy removed **kwargs).

Also adds scripts/run_gcg_aml.py (launcher with sys.path fix for Azure ML)
and scripts/submit_gcg_job.py (job submission reading from .env files).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All mpa_kwargs (deterministic, lr, batch_size, n_steps) were silently
absorbed by **kwargs in the original llm-attacks repo's MultiPromptAttack.
__init__() but never read. Our copy removed **kwargs, exposing the bug.

The original repo even has a typo: 'self.mpa_kewargs' in
IndividualPromptAttack (line 1114 of llm-attacks/attack_manager.py).

Verified: none of these kwargs are consumed by MultiPromptAttack in either
the original repo or our copy.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…crosoft#965)

Phi-3-mini hits 'Conversation has no attribute system' in _update_ids()
due to fastchat API change. Llama-2 has dedicated handling path that works.

GCG baseline VALIDATED on Azure ML:
- Model: meta-llama/Llama-2-7b-chat-hf
- Config: 5 prompts, 5 steps, batch_size 64
- Result: loss decreases across steps (1.9 -> 0.86 on best prompt)
- Runtime: ~6 min on Standard_NC24ads_A100_v4
- Job: silly_vinegar_82x7td6gpn (Completed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The existing TestCreateAttack tests mock the manager classes, so they never
exercise MultiPromptAttack.__init__() with real kwargs. That's why the dead
mpa_kwargs bug only surfaced on Azure (TypeError when MPA didn't accept
deterministic / lr / etc). This test constructs the real GCG manager classes
and verifies IndividualPromptAttack and ProgressiveMultiPromptAttack can
create an internal MultiPromptAttack without error.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The existing GPT-2 integration tests only use the llama-2 conversation
template path. Bugs in the else branch of AttackPrompt._update_ids -- like
the Phi-3 conv_template.system AttributeError we hit on Azure -- would
never be caught.

The two new tests construct GCGAttackPrompt with the vicuna template,
which exercises the same code path. They are marked xfail (strict=True)
because vicuna's fastchat conversation template lacks a .system attribute,
reproducing the same bug. The xfail marker references issue microsoft#965 and will
flip to 'unexpectedly passed' when the fastchat replacement lands, prompting
removal of the marker.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updates the AML notebook to reflect the actual flow we ran during Phase 1c
baseline validation: llama-2 baseline (phi-3 has fastchat microsoft#965 bug),
run_gcg_aml.py launcher script (so the uploaded code snapshot wins over the
Docker-installed package), repo-root build context (Dockerfile needs to COPY
pyproject.toml + pyrit/ for pip install -e .[gcg]), and PyRIT-style env
file loading via _load_environment_files.

Adds tests/end_to_end/auxiliary_attacks/test_gcg_aml_e2e.py mirroring that
same flow as a real e2e test. Submits a small (5-step, 5-train, batch 64)
llama-2 GCG job, polls until terminal state, asserts Completed. Skipped
unless RUN_ALL_TESTS=true and AZURE_ML_* + HUGGINGFACE_TOKEN env vars are
set (since it submits real paid Azure ML compute). Always cancels the
submitted job on test failure or interruption to avoid leaking compute.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The scripts/ directory is not packaged for PyPI installs, so the AML
launcher there was inaccessible to anyone who pip-installed pyrit[gcg].
Move the entry-point cwd handling into pyrit/auxiliary_attacks/gcg/
experiments/run.py itself: when run as `__main__`, chdir into the file's
own directory so the relative `configs/` and `results/` paths resolve
regardless of where the script is invoked from.

AML jobs (notebook and e2e test) now run
  python -m pyrit.auxiliary_attacks.gcg.experiments.run --model_name ...
which also makes the previous sys.path hack unnecessary -- `python -m`
puts cwd at the front of sys.path, so the uploaded code snapshot still
wins over the Docker-installed package.

Deletes scripts/run_gcg_aml.py and scripts/submit_gcg_job.py (the latter
was a CLI duplicate of the notebook's submission flow).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The pyarrow>=22.0.0; python_version >= '3.14' pin was added in c98af28 to
the core dependencies, but pyrit core does not actually need it -- without
the gcg extra, the resolver picks a 3.14-compatible pyarrow on its own via
the transitive datasets -> pyarrow chain. The pin is only needed when the
gcg extra is installed because something in that extra constrains the
resolution toward an older pyarrow that lacks cp314 wheels and fails to
build from source on Python 3.14.

Moves the pin to the gcg extra and adds an inline comment explaining
why it is there, matching the existing precedent for the spacy cp314
wheel comment in the all extra.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The AML e2e test previously rebuilt the MLClient, environment, and command
from scratch -- a near-copy of the notebook's submission flow. Replace that
with `runpy.run_path()` of `doc/code/auxiliary_attacks/1_gcg_azure_ml.py`.
The notebook is jupytext percent format (the `# %%` markers are plain
comments) so the file is valid Python and runs as a script. The test then
pulls `returned_job` and `ml_client` out of the executed namespace and
polls the job to a terminal state.

Result: the notebook is the single source of truth for the submission flow,
and the test verifies that what we ask users to run actually works
end-to-end. Net diff is -27 lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ran jupytext --to ipynb --execute against the notebook .py to capture cell
outputs (workspace name, environment build status, submitted job name +
status + Studio URL) per PyRIT convention. The submitted job
('lucid_muscle_nt947p71s0') ran to completion on Azure ML, doubling as a
verification that the refactored notebook (which now invokes the GCG runner
via 'python -m pyrit.auxiliary_attacks.gcg.experiments.run' instead of the
old scripts/ launcher) still works end-to-end.

The captured stderr cells include some Azure ML SDK telemetry noise
('ActivityCompleted: ... HowEnded=Failure' for benign UserError conditions
like 'environment already at version N'). Will be cleaned up in a follow-up
by suppressing the azure.ai.ml._telemetry logger in the notebook source.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds two pieces to make the AML tutorial actually show users an end-to-end
result instead of stopping at "Job submitted, monitor in Studio":

1. ``run.py`` now accepts ``--output_dir``. The runner writes its result
   JSON under whatever path the caller passes, defaulting to ``outputs/``.
   The notebook's command declares ``outputs={"results": Output(uri_folder)}``
   and passes ``--output_dir ${{outputs.results}}`` so AML mounts a path,
   the runner writes there, and the contents are uploaded as a named output
   artifact (auto-capture of ``./outputs/`` is *not* available in SDK v2
   command jobs -- you have to declare named outputs explicitly).

2. A new poll-and-inspect cell at the end polls the submitted job, then
   downloads with ``all=True``, finds the result JSON under
   ``<download_dir>/named-outputs/results/``, and prints the final loss
   and generated adversarial suffix.

Also adds a (best-effort) logging suppression block early in the notebook
for azure.ai.ml SDK telemetry. It catches the python-logging warnings but
not the "ActivityCompleted: HowEnded=Failure" lines or the upload progress
bars -- those go through the SDK's own stderr handler with
propagate=False and are not reachable via standard logging config (see
azure-ai-ml _utils/_logger_utils.py). The remaining noise is benign
telemetry for expected UserError conditions like "environment already at
this version".

Notebook re-executed end-to-end against AML (job stoic_parcel_6clfs67hp9,
llama-2, 5 train data, 5 steps): completed successfully, suffix downloaded
and printed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…noise

The 1.32.0 release includes:

  Skip _list_secrets for identity-based datastores to prevent noisy
  telemetry traces.

That bullet is exactly the source of the ``ActivityCompleted:
Activity=Datastore.ListSecrets, HowEnded=Failure ... UserError ... No
secrets for credentials of type None`` blob that was showing up in our
Azure ML notebook's executed cell outputs and made it look like the env
build / job submission was failing when it actually wasn't.

After bumping, a quick smoke (build MLClient, list envs) drops from many
lines of telemetry noise to a single ``Class X is experimental`` info
message -- much more reasonable for a tutorial. Bumped both the ``gcg``
extra and the ``all`` extra so they stay aligned.

The upload progress bars and the experimental-class warning still show
up; those are separate noise sources that this SDK release does not
address.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CI's coverage check (>=90% on diff) was failing on the log_gpu_memory
try/except added in c98af28: lines 70-74 of log.py weren't reached.

Two issues:

1. The TestGpuMemoryLogging class lived in test_lifecycle.py, which
   does pytest.importorskip on the GCG train module. CI installs the
   'all' extra but not 'gcg', so ml_collections is missing and the
   train import fails, skipping the whole test_lifecycle.py module --
   including the GPU memory tests, even though they only need stdlib.
   Moved them into test_log.py (which only importorskips the log
   module, all stdlib) so they actually run in CI.

2. The new test_log_gpu_memory_swallows_nvidia_smi_failure exercises
   the except branch (lines 73-74) that the old success-only test
   never hit. log_gpu_memory must swallow nvidia-smi failures so the
   training loop never crashes when run on a host without nvidia-smi.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant