Verify your install

Walk through this after helm install against a fresh EKS cluster. Items are ordered by how quickly they should respond. If any of them fail, the Troubleshooting doc covers the common causes.

0. Pre-flight

Check	Command	Expected
Cluster reachable	`kubectl get nodes`	At least 1 node `Ready`
Pod running	`kubectl get pod -n clu-ops`	1 pod `Running` 2/2
Backend logs clean	`kubectl logs -n clu-ops deploy/clu-ops-agent -c backend \| tail -20`	Startup banner + feature flags resolved, no tracebacks
IRSA bound	`kubectl exec -n clu-ops deploy/clu-ops-agent -c backend -- env \| grep AWS`	`AWS_ROLE_ARN` + `AWS_WEB_IDENTITY_TOKEN_FILE` set
UI reachable	`kubectl port-forward -n clu-ops svc/clu-ops-agent 8080:8080` → http://localhost:8080	Dashboard renders with the capability-status panel

If any of those fail, stop — fix before continuing. Most failures at this stage are IRSA / kubeconfig / Bedrock model-access issues.

1. Core (always on)

The always-on baseline for every subscribed customer.

Test	How	Expected
Status endpoint	`curl localhost:8080/api/status`	JSON with `cloud_provider: "aws"`, capabilities + integrations listed
First scan completes	UI → Dashboard	Cluster snapshot populated (namespaces, deployments, pod counts)
Chat: broad health	"What's the health of my cluster?"	Calls `reports_run_now` or `reports_latest`, returns severity breakdown
Chat: specific pod	"Why is the kube-system CoreDNS pod configured this way?"	Calls `k8s_describe`, returns pod-level detail
Chat: logs	"Show me the last 50 log lines from the CoreDNS pod"	Calls `k8s_logs`, returns log content
Chat: events	"What events have fired in kube-system in the last hour?"	Calls `k8s_events`, returns event list
Chat: knowledge	"What conventions have you detected?"	Calls `knowledge_conventions`, lists detected patterns
Chat: rollout	"Deploy something then ask why did the deploy fail?"	`k8s_rollout_status` surfaces ReplicaSet + event info
Prometheus (if installed)	Chat "are any alerts firing?"	`prometheus_alerts` or an actionable "not configured" message
Prometheus auto-discovery	Install kube-prometheus-stack, wait 1h or restart pod	Prometheus integration flips to active in the UI
Report compare	Wait for 2 scans, open Reports, select two, Compare	Four buckets + severity-count summary strip

2. Cloud

Requires modules.cloud.enabled=true in Helm values + the Cloud Agent IAM policy attached to the IRSA role (see IAM setup).

Test	How	Expected
IAM role listing	Chat "list IAM roles in the account with the `clu-` prefix"	`aws_iam_roles` returns role names + attached policies
IRSA cross-reference	Chat "verify the IRSA binding for the clu-ops-agent SA"	`aws_irsa_mapping` reports `trusts_sa: true`, lists attached policies
Managed DB	Chat "what RDS instances does this account have?" (if any)	`aws_rds` returns list (or empty)
S3 listing	Chat "what S3 buckets are in this account?"	`aws_s3` returns names + created dates (never content)
Secrets listing	Chat "list Secrets Manager entries"	`aws_secrets` returns names + ARNs only
VPC topology	Chat "summarize the VPC topology"	`aws_vpc_networking` returns VPC + subnets + SGs
Cost context	Chat "show me the cost breakdown for the last 7 days"	`aws_cost_summary` returns per-service totals
Missing IAM action	Remove `iam:ListRoles` from the policy, retry	Chat surfaces `error_kind: permission_denied` with paste-ready fragment

3. Core Plus

Requires modules.corePlus.enabled=true + modules.corePlus.writeOperations.enabled=true in Helm values. Uses the clu-ops-agent-writer ClusterRole (see Permissions).

Create a test workload first:

kubectl create ns idp-test
kubectl -n idp-test create deployment demo --image=nginx:1.27 --replicas=1

Test	How	Expected
Manifest generation	Chat "generate a standard web service Deployment for `api` in `idp-test`"	`manifest_generate` returns YAML with app label + requests
Dry-run validation	Chat "validate this manifest: `<paste>`"	`manifest_validate` reports OK or names the violation
Apply with approval	Chat "scale the `demo` deployment to 3 replicas"	Tool stages a change; the inline approval card appears in chat
Approval → write	Click Approve	Pod count reaches 3; the Audit log gets a new entry
Protected namespace block	Chat "restart the coredns deployment in kube-system"	Tool rejects with "protected namespace" — no approval card created
Rejection stops retry	Stage + reject an apply	Agent acknowledges, asks what you'd prefer, does NOT re-stage
Audit log persists	Approve something, `kubectl rollout restart` the pod, reopen Audit view	History survives restart

4. Persistence

Smoke test with a rollout restart:

kubectl rollout restart -n clu-ops deploy/clu-ops-agent
kubectl rollout status  -n clu-ops deploy/clu-ops-agent

After restart	Expected
Reports list	Previous scheduled reports still there
Approvals queue	Any pending (not yet expired) approvals still queued
Snoozes	Active snoozes still active
Audit log	Every past write still visible
Scan cache	Dashboard has cluster context immediately, doesn't wait for first scan

Verify each store's ConfigMap directly:

kubectl get cm -n clu-ops | grep clu-ops-
kubectl get cm -n clu-ops clu-ops-approvals \
  -o jsonpath='{.data.approvals\.json}' | jq '[.[] | {id, tool, state}]'

5. Error surfacing

The persona's narration rules + the structured error taxonomy. Deliberately break things:

Break	Expected chat response
Remove `get` on Pods from the reader ClusterRole, try `kubectl get pod` equivalent	`error_kind: permission_denied` with paste-ready `get/list/watch` rule
Point `integrations.prometheus.url` at a 401-returning proxy	Prometheus 401 → "Prometheus at ... returned HTTP 401. Clu's default deployment targets the unauthenticated in-cluster service…"
Remove `iam:ListRoles` from the Cloud policy, call `aws_iam_roles`	`error_kind: permission_denied` with `iam:ListRoles` named in the fix hint
Stage a write, approve, watch the dispatch happen	Narration uses "Applied." / "Done." (past-tense) ONLY after the real apply, not at stage time
Stage a write, reject it	Narration: "The change was rejected — I won't re-apply it. What would you like instead?" (not an immediate retry)

6. Multi-arch image

Quick check that the arm64 image works. If the cluster node is amd64 (e.g. c6i.large), pull the arm64 image separately:

docker pull --platform=linux/arm64 \
  709825985650.dkr.ecr.us-east-1.amazonaws.com/cloudology/clu-ops-agent-backend:0.1.4

docker inspect --format '{{.Architecture}}' \
  709825985650.dkr.ecr.us-east-1.amazonaws.com/cloudology/clu-ops-agent-backend:0.1.4
# → arm64

Or run the backend locally against the cluster from a Graviton node by tainting the nodegroup.

Known non-validation gaps

These work in principle but aren't easy to validate without additional setup:

Marketplace entitlement — requires a real Marketplace subscription. Dev installs use license.key=dev which bypasses.
Slack / Teams webhooks — outbound only; validate by wiring a webhook + watching for a test report.

What a "pass" looks like

For a green signoff, every item in sections 0–5 should respond as described. Section 6 (multi-arch) is the last check before declaring the install good.

If something regresses on a subsequent build, the audit log + backend logs + the structured error output from the chat should be enough to triangulate. File findings against the version shown on the Dashboard's About panel.