Skip to content

AutoscalingListener stuck in recreate loop after duplicate EphemeralRunnerSet creation on fresh install (v0.14.0 / v0.14.1) #4489

@Prateek-stepsecurity

Description

@Prateek-stepsecurity

Checks

Controller Version

0.14.0 and 0.14.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. `helm install` both `gha-runner-scale-set-controller` and `gha-runner-scale-set` v0.14.0 or v0.14.1 on a fresh EKS cluster.
2. Use any reasonable values file (we have reproduced this on both `containerMode: dind` and `containerMode: k8s`).
3. Observe `kubectl get pods -n <runner-namespace>` — the AutoscalingListener pod cycles through `ContainerCreating → Running → Error → Terminating → ContainerCreating ...` every few seconds and never stabilizes.
4. Submit any workflow targeting the scale set — it stays in `queued` indefinitely; no runner pods are created.

This is intermittent — most fresh installs are fine. We have reproduced it on multiple separate fresh EKS clusters with completely different configurations (one `dind` + Cilium, one `k8s` mode no Cilium), suggesting the trigger is timing in the controller's reconcile loop rather than any environment-specific factor.

Describe the bug

On a fresh helm install of gha-runner-scale-set v0.14.0 / v0.14.1, the AutoscalingRunnerSet controller's reconciler creates multiple EphemeralRunnerSets in rapid succession (within the same second) and then deletes some of them, but the AutoscalingListener resource — and its config secret and RBAC Role — are populated with the name of an EphemeralRunnerSet that gets deleted. The listener can never recover, because every restart inherits the same stale EphemeralRunnerSet name from its config secret and RBAC Role.

Trace of the duplicate-create burst on a fresh install (k8s container mode, runnerScaleSetID 93):

16:48:00Z Created EphemeralRunnerSet k8smode-sidecar-eks-scaleset-hpns6
16:48:00Z Created EphemeralRunnerSet k8smode-sidecar-eks-scaleset-k7phr ← duplicate
16:48:00Z Deleted EphemeralRunnerSet k8smode-sidecar-eks-scaleset-hpns6
16:48:00Z Deleted EphemeralRunnerSet k8smode-sidecar-eks-scaleset-k7phr ← both originals gone
16:48:00Z Created EphemeralRunnerSet k8smode-sidecar-eks-scaleset-8g45d ← third created

After the burst, cluster state:

  • Surviving EphemeralRunnerSet: k8smode-sidecar-eks-scaleset-8g45d
  • AutoscalingListener.spec.ephemeralRunnerSetName: k8smode-sidecar-eks-scaleset-k7phr (deleted)
  • Listener RBAC Role.rules[0].resourceNames: ["k8smode-sidecar-eks-scaleset-k7phr"] (deleted)
  • Listener config secret: contains the same deleted name

The listener pod boots, authenticates with api.github.com, opens a broker session against runnerScaleSetID 93, receives the initial scale instruction (replicas=2), and then attempts to PATCH ephemeralrunnersets/k7phr against the kube-apiserver — which correctly responds 404 because k7phr was deleted at 16:48:00. The listener exits with Error.

Listener log (consistent across every restart of the recreate loop):

msg="Starting listener"
msg="Handling initial session statistics"  totalAssignedJobs=0
msg="Calculated target runner count"  min=2 max=3 currentRunnerCount=2
msg="Preparing EphemeralRunnerSet update"  json="{\"spec\":{\"patchID\":0,\"replicas\":2}}"
Application returned an error: handling initial message failed:
  could not patch ephemeral runner set,
  patch JSON: {"spec":{"patchID":0,"replicas":2}},
  error: ephemeralrunnersets.actions.github.com "k8smode-sidecar-eks-scaleset-k7phr" not found

The arc-gha-rs-controller observes the exited listener pod, deletes it, and creates a new one. The new pod mounts the same config secret (still pointing at k7phr), follows the same sequence, hits the same 404, and exits. This loop repeats every ~5 seconds indefinitely. No EphemeralRunner / runner pods are ever created, and any workflow targeting this scale set sits in queued state until its job timeout fires.

EKS audit log for the same window confirms both create calls and the delete calls came from the same arc-gha-rs-controller ServiceAccount and same controller pod IP — ruling out parallel helm install, multiple controller replicas, or any external trigger. The duplicate creation is generated by the AutoscalingRunnerSet reconciler itself.

Describe the expected behavior

Exactly one EphemeralRunnerSet should be created per AutoscalingRunnerSet on initial install. The AutoscalingListener (and its config secret and RBAC Role) should be created bound to the surviving EphemeralRunnerSet, and the cluster should reach a stable, working state on first install without operator intervention.

Additional Context

runner_scale_set_values.yaml passed to helm install:

containerMode:
  type: "k8s"
minRunners: 2
maxRunners: 3

Install command (token redacted):

   helm install <release> \
     --namespace <ns> --create-namespace \
     --version 0.14.1 \
     --set githubConfigUrl=https://github.com/<org> \
     --set githubConfigSecret.github_token=<REDACTED> \
    -f runner_scale_set_values.yaml \
     oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set

Environment:
Kubernetes: AWS EKS
Container: k8s mode
Chart: gha-runner-scale-set 0.14.0 and 0.14.1 (both versions reproduce)
Chart: gha-runner-scale-set-controller 0.14.0 and 0.14.1
Cluster: fresh, short-lived test cluster

This is intermittent. Most fresh helm installs reach a healthy state on the
first try. Occasionally the AutoscalingRunnerSet reconciler creates multiple
EphemeralRunnerSets in the same second and the AutoscalingListener ends up
bound to one of the deleted duplicates, producing the failure above.

Reports describing the same family of stuck-runner symptom on older versions
(0.9.3, 0.10.1, etc.) include #3821 — closed without a root-cause fix.
This issue may share an underlying cause but the duplicate-EphemeralRunnerSet
creation pattern on the AutoscalingRunnerSet's first reconcile appears to be
more reliably reproducible starting in 0.14.0.

Controller Logs

https://gist.github.com/Prateek-stepsecurity/da333ca3486c74829facfe8eef9230d0

Runner Pod Logs

No runner pods are ever created — the listener cannot complete its initial scale instruction.
The listener pod log below is identical across every restart attempt — same error, same deleted ERS reference):

https://gist.github.com/Prateek-stepsecurity/d68b719d6ccfa84ceb9a6c9230f473e7

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions