v0.1.x. Jump to latest →Troubleshooting
Common install + runtime issues and their fixes. Clu's chat is the fastest troubleshooting surface for most of these — the agent already knows how to diagnose its own errors — but the common symptoms are collected here as a pre-chat reference.
Install
Pod stuck in Pending after helm install
-
Most common cause: the
ServiceAccountannotation doesn't match a real IAM role. Clu won't start if IRSA is misconfigured in production mode. Fix:kubectl annotate sa -n clu-ops clu-ops-agent \ eks.amazonaws.com/role-arn=arn:aws:iam::...:role/clu-irsa --overwrite kubectl rollout restart -n clu-ops deploy/clu-ops-agent -
Less common: resource requests don't fit the node. Check
kubectl describe pod -n clu-ops— the scheduler'sFailedSchedulingevent names the exact constraint that failed.
Pod in CrashLoopBackOff
kubectl logs -n clu-ops deploy/clu-ops-agent -c backend
pydantic_core._pydantic_core.ValidationErroron startup — a Helm value doesn't match the expected type. The traceback names the field.NoCredentialsErrorfrom botocore — IRSA isn't wired. Verify the SA annotation (above) + the role's trust policy (see Install Clu on AWS EKS).
Chat returns errors
error_kind: permission_denied in chat
The agent hit a 403 — K8s RBAC, Prometheus auth, or AWS IAM. Clu surfaces the exact fix inline. Example:
error_kind: permission_denied
The agent was denied 'list' on Secret (namespace 'prod') — forbidden.
Fix:
Add this rule to the reader ClusterRole (...):
- apiGroups: [""]
resources:
- secrets
verbs: ['get', 'list', 'watch']
After editing the chart, re-run `bash scripts/dev-deploy.sh` (or the
equivalent helm upgrade) to roll the new RBAC into the cluster.
The Fix: block is paste-ready — apply it to the policy named in
the hint. Same shape for:
- K8s RBAC → paste into your customer-side fork of the reader ClusterRole (the rules are inlined in Permissions).
- Prometheus 401/403 → set
integrations.prometheus.urlto the unauthenticated in-cluster Service, or open a NetworkPolicy. - AWS IAM → paste into the Cloud IAM policy (the JSON is inlined in IAM setup).
tool execution error: no scripted response for expr=...
You're running Clu without Prometheus configured (or auto-discovery didn't find it). Two fixes:
-
Install a kube-prometheus-stack release — Clu auto-detects Services whose name contains
prometheuswith port 9090. -
Set the URL explicitly — if your Prometheus runs under a non-standard name:
integrations: prometheus: url: http://my-prometheus.monitoring.svc:9090
Bedrock returns AccessDeniedException
Two distinct failure modes:
- IAM action missing — the IRSA role doesn't have
bedrock:InvokeModel. Fix via Terraform or IAM console. - Model access not requested — Bedrock in a fresh AWS account blocks every model by default until you opt in. Go to the Bedrock console → Model access → enable Claude Sonnet + Haiku in the target region.
The error message distinguishes them: "is not authorized" → IAM action missing; "You don't have access to the model" → model access not requested.
Reports + approvals
Reports view is empty
First check takes up to 30 minutes if the pod was restarted
immediately after install (the first scheduled tick hasn't fired
yet). Force one from the UI (Run now) or from chat ("run a health
check now"). The endpoint behind both is
POST /api/reports/run-now.
Approvals view is empty overnight
By default reports.recommenderMode=true — findings carry
recommendations but the operator has to manually stage them. For
"approvals populate overnight" behavior, set:
reports:
recommenderMode: false
maxAutoStagedPerTick: 10 # cap per scheduler tick
And bump the TTL if a weekend might pass between stage + approve:
approvals:
ttlSeconds: 172800 # 48h
Approved a recommendation but nothing happened in the cluster
Check the Approvals view for a "Approved, but the write didn't land" banner — the dispatch failed post-approval. The audit log has the specific error:
kubectl get cm -n clu-ops clu-ops-audit -o jsonpath='{.data.audit\.json}' | jq '.[-5:]'
Common causes:
- Core Plus entitlement lapsed between stage + approve. Re-check the Modules view.
- RBAC changed. Re-look at the writer
ClusterRolediff. - The target workload was deleted between stage + approve (the dry-run from 12 hours ago was against a now-gone resource).
UI
UI shows "Failed to load approvals" / "Failed to load reports"
The ConfigMap client can't read the store. Verify the state
Role exists + is bound:
kubectl get role -n clu-ops | grep state
kubectl get rolebinding -n clu-ops | grep state
Missing → re-render the chart (the role-state.yaml template is part
of the default install). Present but the API server still 403s → look
for custom NetworkPolicy or AdmissionController rules blocking
the agent's SA.
SSE chat stream disconnects after 60 seconds
Ingress timeout. Configure long-lived timeouts for /api/query:
- nginx:
proxy-read-timeout: 3600on the Ingress annotation. - ALB:
idle_timeout.timeout_seconds=3600on the load balancer attributes.
See UI Access for the full examples.
Performance
First chat response is slow
First conversation turn builds the system prompt, which reads the
scan. If the scanner hasn't run yet, the prompt has no cluster
context and the agent will call more tools to compensate. Wait for
the first scheduled knowledge refresh (or force a scan via run a cluster scan), then chat is snappier.
Reporter scan blocks for minutes
Large clusters (10k+ pods) can spend significant time listing resources. Reduce the scanner's surface via Helm:
agent:
namespaces:
include: [prod, staging] # Only these namespaces
or
agent:
namespaces:
exclude: [kube-system, monitoring, istio-system, argo, argocd]
Bedrock latency feels high
Switch the interactive model tier to fast (Haiku) if the workload is
tool-call-heavy rather than reasoning-heavy:
agent:
model:
interactive: fast
Haiku is ~3-5x faster than Sonnet; the tradeoff is weaker multi-step reasoning. Fine for "fetch logs + describe pod" flows; worse for "debug this cross-cluster connectivity puzzle."
Logs + debugging
# Backend logs (streaming)
kubectl logs -n clu-ops deploy/clu-ops-agent -c backend -f
# Frontend / ingress logs
kubectl logs -n clu-ops deploy/clu-ops-agent -c frontend
# State ConfigMaps
kubectl get cm -n clu-ops -o name | grep clu-ops-
# → clu-ops-reports
# → clu-ops-approvals
# → clu-ops-snoozes
# → clu-ops-audit
# → clu-ops-scan
# Inspect the audit log
kubectl get cm -n clu-ops clu-ops-audit \
-o jsonpath='{.data.audit\.json}' | jq '.[-20:]'
Still stuck
- Check Getting Started for the happy-path walkthrough.
- Re-read Install Clu on AWS EKS if anything AWS-shaped is failing.
- File an issue with the commit SHA (
kubectl get deploy -n clu-ops clu-ops-agent -o jsonpath='{.metadata.annotations}'), the pod logs, and the relevantkubectl describeoutput.