v0.1.x. Jump to latest →Verify your install
Walk through this after helm install against a fresh EKS cluster.
Items are ordered by how quickly they should respond. If any of
them fail, the Troubleshooting doc covers
the common causes.
0. Pre-flight
| Check | Command | Expected |
|---|---|---|
| Cluster reachable | kubectl get nodes | At least 1 node Ready |
| Pod running | kubectl get pod -n clu-ops | 1 pod Running 2/2 |
| Backend logs clean | kubectl logs -n clu-ops deploy/clu-ops-agent -c backend | tail -20 | Startup banner + feature flags resolved, no tracebacks |
| IRSA bound | kubectl exec -n clu-ops deploy/clu-ops-agent -c backend -- env | grep AWS | AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE set |
| UI reachable | kubectl port-forward -n clu-ops svc/clu-ops-agent 8080:8080 → http://localhost:8080 | Dashboard renders with the capability-status panel |
If any of those fail, stop — fix before continuing. Most failures at this stage are IRSA / kubeconfig / Bedrock model-access issues.
1. Core (always on)
The always-on baseline for every subscribed customer.
| Test | How | Expected |
|---|---|---|
| Status endpoint | curl localhost:8080/api/status | JSON with cloud_provider: "aws", capabilities + integrations listed |
| First scan completes | UI → Dashboard | Cluster snapshot populated (namespaces, deployments, pod counts) |
| Chat: broad health | "What's the health of my cluster?" | Calls reports_run_now or reports_latest, returns severity breakdown |
| Chat: specific pod | "Why is the kube-system CoreDNS pod configured this way?" | Calls k8s_describe, returns pod-level detail |
| Chat: logs | "Show me the last 50 log lines from the CoreDNS pod" | Calls k8s_logs, returns log content |
| Chat: events | "What events have fired in kube-system in the last hour?" | Calls k8s_events, returns event list |
| Chat: knowledge | "What conventions have you detected?" | Calls knowledge_conventions, lists detected patterns |
| Chat: rollout | "Deploy something then ask why did the deploy fail?" | k8s_rollout_status surfaces ReplicaSet + event info |
| Prometheus (if installed) | Chat "are any alerts firing?" | prometheus_alerts or an actionable "not configured" message |
| Prometheus auto-discovery | Install kube-prometheus-stack, wait 1h or restart pod | Prometheus integration flips to active in the UI |
| Report compare | Wait for 2 scans, open Reports, select two, Compare | Four buckets + severity-count summary strip |
2. Cloud
Requires modules.cloud.enabled=true in Helm values + the Cloud
Agent IAM policy attached to the IRSA role
(see IAM setup).
| Test | How | Expected |
|---|---|---|
| IAM role listing | Chat "list IAM roles in the account with the clu- prefix" | aws_iam_roles returns role names + attached policies |
| IRSA cross-reference | Chat "verify the IRSA binding for the clu-ops-agent SA" | aws_irsa_mapping reports trusts_sa: true, lists attached policies |
| Managed DB | Chat "what RDS instances does this account have?" (if any) | aws_rds returns list (or empty) |
| S3 listing | Chat "what S3 buckets are in this account?" | aws_s3 returns names + created dates (never content) |
| Secrets listing | Chat "list Secrets Manager entries" | aws_secrets returns names + ARNs only |
| VPC topology | Chat "summarize the VPC topology" | aws_vpc_networking returns VPC + subnets + SGs |
| Cost context | Chat "show me the cost breakdown for the last 7 days" | aws_cost_summary returns per-service totals |
| Missing IAM action | Remove iam:ListRoles from the policy, retry | Chat surfaces error_kind: permission_denied with paste-ready fragment |
3. Core Plus
Requires modules.corePlus.enabled=true + modules.corePlus.writeOperations.enabled=true
in Helm values. Uses the clu-ops-agent-writer ClusterRole
(see Permissions).
Create a test workload first:
kubectl create ns idp-test
kubectl -n idp-test create deployment demo --image=nginx:1.27 --replicas=1
| Test | How | Expected |
|---|---|---|
| Manifest generation | Chat "generate a standard web service Deployment for api in idp-test" | manifest_generate returns YAML with app label + requests |
| Dry-run validation | Chat "validate this manifest: <paste>" | manifest_validate reports OK or names the violation |
| Apply with approval | Chat "scale the demo deployment to 3 replicas" | Tool stages a change; the inline approval card appears in chat |
| Approval → write | Click Approve | Pod count reaches 3; the Audit log gets a new entry |
| Protected namespace block | Chat "restart the coredns deployment in kube-system" | Tool rejects with "protected namespace" — no approval card created |
| Rejection stops retry | Stage + reject an apply | Agent acknowledges, asks what you'd prefer, does NOT re-stage |
| Audit log persists | Approve something, kubectl rollout restart the pod, reopen Audit view | History survives restart |
4. Persistence
Smoke test with a rollout restart:
kubectl rollout restart -n clu-ops deploy/clu-ops-agent
kubectl rollout status -n clu-ops deploy/clu-ops-agent
| After restart | Expected |
|---|---|
| Reports list | Previous scheduled reports still there |
| Approvals queue | Any pending (not yet expired) approvals still queued |
| Snoozes | Active snoozes still active |
| Audit log | Every past write still visible |
| Scan cache | Dashboard has cluster context immediately, doesn't wait for first scan |
Verify each store's ConfigMap directly:
kubectl get cm -n clu-ops | grep clu-ops-
kubectl get cm -n clu-ops clu-ops-approvals \
-o jsonpath='{.data.approvals\.json}' | jq '[.[] | {id, tool, state}]'
5. Error surfacing
The persona's narration rules + the structured error taxonomy. Deliberately break things:
| Break | Expected chat response |
|---|---|
Remove get on Pods from the reader ClusterRole, try kubectl get pod equivalent | error_kind: permission_denied with paste-ready get/list/watch rule |
Point integrations.prometheus.url at a 401-returning proxy | Prometheus 401 → "Prometheus at ... returned HTTP 401. Clu's default deployment targets the unauthenticated in-cluster service…" |
Remove iam:ListRoles from the Cloud policy, call aws_iam_roles | error_kind: permission_denied with iam:ListRoles named in the fix hint |
| Stage a write, approve, watch the dispatch happen | Narration uses "Applied." / "Done." (past-tense) ONLY after the real apply, not at stage time |
| Stage a write, reject it | Narration: "The change was rejected — I won't re-apply it. What would you like instead?" (not an immediate retry) |
6. Multi-arch image
Quick check that the arm64 image works. If the cluster node is amd64
(e.g. c6i.large), pull the arm64 image separately:
docker pull --platform=linux/arm64 \
709825985650.dkr.ecr.us-east-1.amazonaws.com/cloudology/clu-ops-agent-backend:0.1.4
docker inspect --format '{{.Architecture}}' \
709825985650.dkr.ecr.us-east-1.amazonaws.com/cloudology/clu-ops-agent-backend:0.1.4
# → arm64
Or run the backend locally against the cluster from a Graviton node by tainting the nodegroup.
Known non-validation gaps
These work in principle but aren't easy to validate without additional setup:
- Marketplace entitlement — requires a real Marketplace
subscription. Dev installs use
license.key=devwhich bypasses. - Slack / Teams webhooks — outbound only; validate by wiring a webhook + watching for a test report.
What a "pass" looks like
For a green signoff, every item in sections 0–5 should respond as described. Section 6 (multi-arch) is the last check before declaring the install good.
If something regresses on a subsequent build, the audit log + backend logs + the structured error output from the chat should be enough to triangulate. File findings against the version shown on the Dashboard's About panel.