You're viewing docs for v0.1.x. Jump to latest →

Troubleshooting

Common install + runtime issues and their fixes. Clu's chat is the fastest troubleshooting surface for most of these — the agent already knows how to diagnose its own errors — but the common symptoms are collected here as a pre-chat reference.

Install

Pod stuck in Pending after helm install

  • Most common cause: the ServiceAccount annotation doesn't match a real IAM role. Clu won't start if IRSA is misconfigured in production mode. Fix:

    kubectl annotate sa -n clu-ops clu-ops-agent \
      eks.amazonaws.com/role-arn=arn:aws:iam::...:role/clu-irsa --overwrite
    kubectl rollout restart -n clu-ops deploy/clu-ops-agent
    
  • Less common: resource requests don't fit the node. Check kubectl describe pod -n clu-ops — the scheduler's FailedScheduling event names the exact constraint that failed.

Pod in CrashLoopBackOff

kubectl logs -n clu-ops deploy/clu-ops-agent -c backend
  • pydantic_core._pydantic_core.ValidationError on startup — a Helm value doesn't match the expected type. The traceback names the field.
  • NoCredentialsError from botocore — IRSA isn't wired. Verify the SA annotation (above) + the role's trust policy (see Install Clu on AWS EKS).

Chat returns errors

error_kind: permission_denied in chat

The agent hit a 403 — K8s RBAC, Prometheus auth, or AWS IAM. Clu surfaces the exact fix inline. Example:

error_kind: permission_denied
The agent was denied 'list' on Secret (namespace 'prod') — forbidden.

Fix:
Add this rule to the reader ClusterRole (...):
- apiGroups: [""]
  resources:
    - secrets
  verbs: ['get', 'list', 'watch']

After editing the chart, re-run `bash scripts/dev-deploy.sh` (or the
equivalent helm upgrade) to roll the new RBAC into the cluster.

The Fix: block is paste-ready — apply it to the policy named in the hint. Same shape for:

  • K8s RBAC → paste into your customer-side fork of the reader ClusterRole (the rules are inlined in Permissions).
  • Prometheus 401/403 → set integrations.prometheus.url to the unauthenticated in-cluster Service, or open a NetworkPolicy.
  • AWS IAM → paste into the Cloud IAM policy (the JSON is inlined in IAM setup).

tool execution error: no scripted response for expr=...

You're running Clu without Prometheus configured (or auto-discovery didn't find it). Two fixes:

  1. Install a kube-prometheus-stack release — Clu auto-detects Services whose name contains prometheus with port 9090.

  2. Set the URL explicitly — if your Prometheus runs under a non-standard name:

    integrations:
      prometheus:
        url: http://my-prometheus.monitoring.svc:9090
    

Bedrock returns AccessDeniedException

Two distinct failure modes:

  • IAM action missing — the IRSA role doesn't have bedrock:InvokeModel. Fix via Terraform or IAM console.
  • Model access not requested — Bedrock in a fresh AWS account blocks every model by default until you opt in. Go to the Bedrock console → Model access → enable Claude Sonnet + Haiku in the target region.

The error message distinguishes them: "is not authorized" → IAM action missing; "You don't have access to the model" → model access not requested.

Reports + approvals

Reports view is empty

First check takes up to 30 minutes if the pod was restarted immediately after install (the first scheduled tick hasn't fired yet). Force one from the UI (Run now) or from chat ("run a health check now"). The endpoint behind both is POST /api/reports/run-now.

Approvals view is empty overnight

By default reports.recommenderMode=true — findings carry recommendations but the operator has to manually stage them. For "approvals populate overnight" behavior, set:

reports:
  recommenderMode: false
  maxAutoStagedPerTick: 10  # cap per scheduler tick

And bump the TTL if a weekend might pass between stage + approve:

approvals:
  ttlSeconds: 172800  # 48h

Approved a recommendation but nothing happened in the cluster

Check the Approvals view for a "Approved, but the write didn't land" banner — the dispatch failed post-approval. The audit log has the specific error:

kubectl get cm -n clu-ops clu-ops-audit -o jsonpath='{.data.audit\.json}' | jq '.[-5:]'

Common causes:

  • Core Plus entitlement lapsed between stage + approve. Re-check the Modules view.
  • RBAC changed. Re-look at the writer ClusterRole diff.
  • The target workload was deleted between stage + approve (the dry-run from 12 hours ago was against a now-gone resource).

UI

UI shows "Failed to load approvals" / "Failed to load reports"

The ConfigMap client can't read the store. Verify the state Role exists + is bound:

kubectl get role -n clu-ops | grep state
kubectl get rolebinding -n clu-ops | grep state

Missing → re-render the chart (the role-state.yaml template is part of the default install). Present but the API server still 403s → look for custom NetworkPolicy or AdmissionController rules blocking the agent's SA.

SSE chat stream disconnects after 60 seconds

Ingress timeout. Configure long-lived timeouts for /api/query:

  • nginx: proxy-read-timeout: 3600 on the Ingress annotation.
  • ALB: idle_timeout.timeout_seconds=3600 on the load balancer attributes.

See UI Access for the full examples.

Performance

First chat response is slow

First conversation turn builds the system prompt, which reads the scan. If the scanner hasn't run yet, the prompt has no cluster context and the agent will call more tools to compensate. Wait for the first scheduled knowledge refresh (or force a scan via run a cluster scan), then chat is snappier.

Reporter scan blocks for minutes

Large clusters (10k+ pods) can spend significant time listing resources. Reduce the scanner's surface via Helm:

agent:
  namespaces:
    include: [prod, staging]      # Only these namespaces

or

agent:
  namespaces:
    exclude: [kube-system, monitoring, istio-system, argo, argocd]

Bedrock latency feels high

Switch the interactive model tier to fast (Haiku) if the workload is tool-call-heavy rather than reasoning-heavy:

agent:
  model:
    interactive: fast

Haiku is ~3-5x faster than Sonnet; the tradeoff is weaker multi-step reasoning. Fine for "fetch logs + describe pod" flows; worse for "debug this cross-cluster connectivity puzzle."

Logs + debugging

# Backend logs (streaming)
kubectl logs -n clu-ops deploy/clu-ops-agent -c backend -f

# Frontend / ingress logs
kubectl logs -n clu-ops deploy/clu-ops-agent -c frontend

# State ConfigMaps
kubectl get cm -n clu-ops -o name | grep clu-ops-
# → clu-ops-reports
# → clu-ops-approvals
# → clu-ops-snoozes
# → clu-ops-audit
# → clu-ops-scan

# Inspect the audit log
kubectl get cm -n clu-ops clu-ops-audit \
  -o jsonpath='{.data.audit\.json}' | jq '.[-20:]'

Still stuck

  • Check Getting Started for the happy-path walkthrough.
  • Re-read Install Clu on AWS EKS if anything AWS-shaped is failing.
  • File an issue with the commit SHA (kubectl get deploy -n clu-ops clu-ops-agent -o jsonpath='{.metadata.annotations}'), the pod logs, and the relevant kubectl describe output.