AutoscalingListener stuck in recreate loop after duplicate EphemeralRunnerSet creation on fresh install (v0.14.0 / v0.14.1)

### Checks

- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided

### Controller Version

0.14.0 and 0.14.1

### Deployment Method

Helm

### Checks

- [x] This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions)).
- [x] I've read the [Changelog](https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

### To Reproduce

```markdown
1. `helm install` both `gha-runner-scale-set-controller` and `gha-runner-scale-set` v0.14.0 or v0.14.1 on a fresh EKS cluster.
2. Use any reasonable values file (we have reproduced this on both `containerMode: dind` and `containerMode: k8s`).
3. Observe `kubectl get pods -n <runner-namespace>` — the AutoscalingListener pod cycles through `ContainerCreating → Running → Error → Terminating → ContainerCreating ...` every few seconds and never stabilizes.
4. Submit any workflow targeting the scale set — it stays in `queued` indefinitely; no runner pods are created.

This is intermittent — most fresh installs are fine. We have reproduced it on multiple separate fresh EKS clusters with completely different configurations (one `dind` + Cilium, one `k8s` mode no Cilium), suggesting the trigger is timing in the controller's reconcile loop rather than any environment-specific factor.
```

### Describe the bug

On a fresh `helm install` of gha-runner-scale-set v0.14.0 / v0.14.1, the AutoscalingRunnerSet controller's reconciler creates multiple EphemeralRunnerSets in rapid succession (within the same second) and then deletes some of them, but the AutoscalingListener resource — and its config secret and RBAC Role — are populated with the name of an EphemeralRunnerSet that gets deleted. The listener can never recover, because every restart inherits the same stale EphemeralRunnerSet name from its config secret and RBAC Role.

Trace of the duplicate-create burst on a fresh install (k8s container mode, runnerScaleSetID 93):

  16:48:00Z  Created EphemeralRunnerSet  k8smode-sidecar-eks-scaleset-hpns6
  16:48:00Z  Created EphemeralRunnerSet  k8smode-sidecar-eks-scaleset-k7phr   ← duplicate
  16:48:00Z  Deleted EphemeralRunnerSet  k8smode-sidecar-eks-scaleset-hpns6
  16:48:00Z  Deleted EphemeralRunnerSet  k8smode-sidecar-eks-scaleset-k7phr   ← both originals gone
  16:48:00Z  Created EphemeralRunnerSet  k8smode-sidecar-eks-scaleset-8g45d   ← third created

After the burst, cluster state:
  - Surviving EphemeralRunnerSet:  k8smode-sidecar-eks-scaleset-8g45d
  - AutoscalingListener.spec.ephemeralRunnerSetName:  k8smode-sidecar-eks-scaleset-k7phr  (deleted)
  - Listener RBAC Role.rules[0].resourceNames:        ["k8smode-sidecar-eks-scaleset-k7phr"]  (deleted)
  - Listener config secret:  contains the same deleted name

The listener pod boots, authenticates with api.github.com, opens a broker session against runnerScaleSetID 93, receives the initial scale instruction (replicas=2), and then attempts to PATCH ephemeralrunnersets/k7phr against the kube-apiserver — which correctly responds 404 because k7phr was deleted at 16:48:00. The listener exits with Error.

Listener log (consistent across every restart of the recreate loop):

    msg="Starting listener"
    msg="Handling initial session statistics"  totalAssignedJobs=0
    msg="Calculated target runner count"  min=2 max=3 currentRunnerCount=2
    msg="Preparing EphemeralRunnerSet update"  json="{\"spec\":{\"patchID\":0,\"replicas\":2}}"
    Application returned an error: handling initial message failed:
      could not patch ephemeral runner set,
      patch JSON: {"spec":{"patchID":0,"replicas":2}},
      error: ephemeralrunnersets.actions.github.com "k8smode-sidecar-eks-scaleset-k7phr" not found

The arc-gha-rs-controller observes the exited listener pod, deletes it, and creates a new one. The new pod mounts the same config secret (still pointing at k7phr), follows the same sequence, hits the same 404, and exits. This loop repeats every ~5 seconds indefinitely. No EphemeralRunner / runner pods are ever created, and any workflow targeting this scale set sits in `queued` state until its job timeout fires.

EKS audit log for the same window confirms both `create` calls and the `delete` calls came from the same `arc-gha-rs-controller` ServiceAccount and same controller pod IP — ruling out parallel `helm install`, multiple controller replicas, or any external trigger. The duplicate creation is generated by the AutoscalingRunnerSet reconciler itself.



### Describe the expected behavior

Exactly one EphemeralRunnerSet should be created per AutoscalingRunnerSet on initial install. The AutoscalingListener (and its config secret and RBAC Role) should be created bound to the surviving EphemeralRunnerSet, and the cluster should reach a stable, working state on first install without operator intervention.


### Additional Context

 runner_scale_set_values.yaml passed to helm install:
```
containerMode:
  type: "k8s"
minRunners: 2
maxRunners: 3
```

 Install command (token redacted):
```
   helm install <release> \
     --namespace <ns> --create-namespace \
     --version 0.14.1 \
     --set githubConfigUrl=https://github.com/<org> \
     --set githubConfigSecret.github_token=<REDACTED> \
    -f runner_scale_set_values.yaml \
     oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set
```

 Environment:
   Kubernetes:    AWS EKS
   Container:     k8s mode
   Chart:         gha-runner-scale-set 0.14.0 and 0.14.1 (both versions reproduce)
   Chart:         gha-runner-scale-set-controller 0.14.0 and 0.14.1
   Cluster:       fresh, short-lived test cluster

 This is intermittent. Most fresh helm installs reach a healthy state on the
 first try. Occasionally the AutoscalingRunnerSet reconciler creates multiple
 EphemeralRunnerSets in the same second and the AutoscalingListener ends up
 bound to one of the deleted duplicates, producing the failure above.

 Reports describing the same family of stuck-runner symptom on older versions
 (0.9.3, 0.10.1, etc.) include #3821 — closed without a root-cause fix.
 This issue may share an underlying cause but the duplicate-EphemeralRunnerSet
 creation pattern on the AutoscalingRunnerSet's first reconcile appears to be
 more reliably reproducible starting in 0.14.0.

### Controller Logs

```shell
https://gist.github.com/Prateek-stepsecurity/da333ca3486c74829facfe8eef9230d0
```

### Runner Pod Logs

```shell
No runner pods are ever created — the listener cannot complete its initial scale instruction.
The listener pod log below is identical across every restart attempt — same error, same deleted ERS reference):

https://gist.github.com/Prateek-stepsecurity/d68b719d6ccfa84ceb9a6c9230f473e7
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoscalingListener stuck in recreate loop after duplicate EphemeralRunnerSet creation on fresh install (v0.14.0 / v0.14.1) #4489

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AutoscalingListener stuck in recreate loop after duplicate EphemeralRunnerSet creation on fresh install (v0.14.0 / v0.14.1) #4489

Description

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions