feat(observability): add gateway OTLP traces and Helm monitoring surface#1270
Draft
TaylorMutch wants to merge 4 commits intomainfrom
Draft
feat(observability): add gateway OTLP traces and Helm monitoring surface#1270TaylorMutch wants to merge 4 commits intomainfrom
TaylorMutch wants to merge 4 commits intomainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Adds a label-gated GitHub Actions workflow that exercises the Helm chart end-to-end against the Rust e2e suite via `mise run e2e:helm`. Pipeline: - pr_metadata gates on the `test:e2e-helm` label via the pr-gate action. - build-gateway / build-supervisor build and push Docker images using the reusable docker-build.yml workflow. - helm-e2e (bare runner): apt-installs z3 build deps so cargo can compile the openshell-policy crate's z3-sys backend, creates a kind cluster via helm/kind-action, materializes the kind kubeconfig at the path mise's [env] block expects, side-loads the freshly built gateway/supervisor images, applies deploy/kube/manifests/agent-sandbox.yaml so the sandboxes.agents.x-k8s.io CRD and reconciling StatefulSet are in place, and finally runs `mise run e2e:helm`. Also expands the `e2e:helm` task to run the full Rust e2e suite (matching `e2e:podman`) instead of only the smoke test, with OPENSHELL_E2E_KUBE_TEST as an opt-in single-test override for local debugging. Extends the e2e-label-help workflow so applying `test:e2e-helm` posts the next-step hint pointing at this workflow. Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
Adds opt-in OpenTelemetry trace export and a Prometheus ServiceMonitor to
the gateway Helm chart. The exporter and chart toggles are independent
from the existing /metrics surface and the OCSF sandbox log fan-out.
- Gateway: append a tracing-opentelemetry layer to TracingLogBus when an
OTLP/gRPC endpoint is configured; flush spans on shutdown. CLI gains
--otlp-endpoint; standard OTEL_* env vars drive sampling and resource
attributes.
- Helm: monitoring.serviceMonitor.* renders a Prometheus-Operator
ServiceMonitor; monitoring.tracing.* projects OTEL_* env vars onto the
gateway container. Both default off.
- Tooling: observability:k8s:{setup,teardown,port-forward} mise tasks
install kube-prometheus-stack + Jaeger all-in-one for local dev.
- Docs: new docs/kubernetes/monitoring.mdx; cross-links from observability
overview and architecture/gateway.md; helm-dev-environment and
debug-openshell-cluster skills updated.
…files The kube-prometheus-stack and Jaeger releases were configured via long chains of `--set` flags, which obscure the configuration and make the script hard to extend. Extract them into two checked-in values files the setup script consumes via `--values`. - tasks/scripts/observability-prometheus-values.yaml — slim chart config plus Grafana auto-provisioning of a Jaeger datasource (stable uid so dashboards can reference it). - tasks/scripts/observability-jaeger-values.yaml — all-in-one Jaeger. - PROMSTACK_VALUES and JAEGER_VALUES env vars allow pointing at custom files for local experimentation.
a551804 to
c6463bf
Compare
|
🌿 Preview your docs: https://nvidia-preview-pr-1270.docs.buildwithfern.com/openshell |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds opt-in OpenTelemetry trace export to the gateway and a Prometheus
ServiceMonitorto the Helm chart. Both surfaces are independent from the existing/metricsendpoint and the OCSF sandbox log fan-out, default off, and configured via standardOTEL_*env vars or chart values.Changes
Gateway (
crates/openshell-server)0.29/tracing-opentelemetry 0.30(the latest set compatible with the workspace'stonic 0.12+prost 0.13).TracingLogBus::install_subscribernow optionally appends atracing-opentelemetrylayer when an OTLP endpoint is configured. The existingtower_http::trace::TraceLayerper-request span automatically becomes the OTLP root — no#[instrument]rewrites required.OtlpTracingConfig::resolvehonorsOTEL_EXPORTER_OTLP_TRACES_ENDPOINT→OTEL_EXPORTER_OTLP_ENDPOINT→--otlp-endpointprecedence.OTEL_TRACES_SAMPLER/OTEL_TRACES_SAMPLER_ARG; defaultparent_based_traceidratio(1.0).shutdown()flushes theBatchSpanProcessorfrom the gateway shutdown path onSIGTERM.Helm chart
monitoring.serviceMonitor.*andmonitoring.tracing.*blocks invalues.yaml(off by default).templates/servicemonitor.yaml(gated, scrapes the existing namedmetricsport).OTEL_*env vars when tracing is enabled, including mergedOTEL_RESOURCE_ATTRIBUTES.ci/values-monitoring.yamloverlay and commented-inkube-prometheus-stack+jaegerHelm releases inskaffold.yaml.deploy/helm/openshell/README.md.Tooling
tasks/observability.tomlexposingobservability:k8s:setup,observability:k8s:teardown, andobservability:port-forward.tasks/scripts/mirroring the existingkeycloak-k8s-setup.shshape: install slimkube-prometheus-stack+ Jaeger all-in-one, idempotent re-runs.Docs / agent skills
docs/kubernetes/monitoring.mdx(operator + local-dev guide).docs/observability/overview.mdxand a new "Observability surface" subsection inarchitecture/gateway.md.helm-dev-environmentanddebug-openshell-clusterskills updated.Testing
mise run pre-commitpasses (lint, format, license headers, clippy, helm-lint matrix, full workspace tests).OtlpTracingConfig::resolveandsampler_from_env.observability:k8s:setup, deployed gateway withci/values-monitoring.yaml, drove 5ListSandboxes+ 3HealthgRPC calls. Verified:up{job=\"openshell\"} == 1;openshell_server_grpc_requests_totaltotals match driven traffic (8).openshell-gatewayservice; 8requestspans with correctmethod,path,request_idattributes; resource attributes includeservice.namespace=openshell,service.version=0.0.0,deployment.environment=dev,telemetry.sdk.version=0.29.0.Out of scope (follow-ups)
protocol: grpc.#[tracing::instrument]annotations on gRPC handlers.Checklist