You're viewing docs for v0.1.x. Jump to latest →

Verify your install

Walk through this after helm install against a fresh EKS cluster. Items are ordered by how quickly they should respond. If any of them fail, the Troubleshooting doc covers the common causes.

0. Pre-flight

CheckCommandExpected
Cluster reachablekubectl get nodesAt least 1 node Ready
Pod runningkubectl get pod -n clu-ops1 pod Running 2/2
Backend logs cleankubectl logs -n clu-ops deploy/clu-ops-agent -c backend | tail -20Startup banner + feature flags resolved, no tracebacks
IRSA boundkubectl exec -n clu-ops deploy/clu-ops-agent -c backend -- env | grep AWSAWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE set
UI reachablekubectl port-forward -n clu-ops svc/clu-ops-agent 8080:8080http://localhost:8080Dashboard renders with the capability-status panel

If any of those fail, stop — fix before continuing. Most failures at this stage are IRSA / kubeconfig / Bedrock model-access issues.

1. Core (always on)

The always-on baseline for every subscribed customer.

TestHowExpected
Status endpointcurl localhost:8080/api/statusJSON with cloud_provider: "aws", capabilities + integrations listed
First scan completesUI → DashboardCluster snapshot populated (namespaces, deployments, pod counts)
Chat: broad health"What's the health of my cluster?"Calls reports_run_now or reports_latest, returns severity breakdown
Chat: specific pod"Why is the kube-system CoreDNS pod configured this way?"Calls k8s_describe, returns pod-level detail
Chat: logs"Show me the last 50 log lines from the CoreDNS pod"Calls k8s_logs, returns log content
Chat: events"What events have fired in kube-system in the last hour?"Calls k8s_events, returns event list
Chat: knowledge"What conventions have you detected?"Calls knowledge_conventions, lists detected patterns
Chat: rollout"Deploy something then ask why did the deploy fail?"k8s_rollout_status surfaces ReplicaSet + event info
Prometheus (if installed)Chat "are any alerts firing?"prometheus_alerts or an actionable "not configured" message
Prometheus auto-discoveryInstall kube-prometheus-stack, wait 1h or restart podPrometheus integration flips to active in the UI
Report compareWait for 2 scans, open Reports, select two, CompareFour buckets + severity-count summary strip

2. Cloud

Requires modules.cloud.enabled=true in Helm values + the Cloud Agent IAM policy attached to the IRSA role (see IAM setup).

TestHowExpected
IAM role listingChat "list IAM roles in the account with the clu- prefix"aws_iam_roles returns role names + attached policies
IRSA cross-referenceChat "verify the IRSA binding for the clu-ops-agent SA"aws_irsa_mapping reports trusts_sa: true, lists attached policies
Managed DBChat "what RDS instances does this account have?" (if any)aws_rds returns list (or empty)
S3 listingChat "what S3 buckets are in this account?"aws_s3 returns names + created dates (never content)
Secrets listingChat "list Secrets Manager entries"aws_secrets returns names + ARNs only
VPC topologyChat "summarize the VPC topology"aws_vpc_networking returns VPC + subnets + SGs
Cost contextChat "show me the cost breakdown for the last 7 days"aws_cost_summary returns per-service totals
Missing IAM actionRemove iam:ListRoles from the policy, retryChat surfaces error_kind: permission_denied with paste-ready fragment

3. Core Plus

Requires modules.corePlus.enabled=true + modules.corePlus.writeOperations.enabled=true in Helm values. Uses the clu-ops-agent-writer ClusterRole (see Permissions).

Create a test workload first:

kubectl create ns idp-test
kubectl -n idp-test create deployment demo --image=nginx:1.27 --replicas=1
TestHowExpected
Manifest generationChat "generate a standard web service Deployment for api in idp-test"manifest_generate returns YAML with app label + requests
Dry-run validationChat "validate this manifest: <paste>"manifest_validate reports OK or names the violation
Apply with approvalChat "scale the demo deployment to 3 replicas"Tool stages a change; the inline approval card appears in chat
Approval → writeClick ApprovePod count reaches 3; the Audit log gets a new entry
Protected namespace blockChat "restart the coredns deployment in kube-system"Tool rejects with "protected namespace" — no approval card created
Rejection stops retryStage + reject an applyAgent acknowledges, asks what you'd prefer, does NOT re-stage
Audit log persistsApprove something, kubectl rollout restart the pod, reopen Audit viewHistory survives restart

4. Persistence

Smoke test with a rollout restart:

kubectl rollout restart -n clu-ops deploy/clu-ops-agent
kubectl rollout status  -n clu-ops deploy/clu-ops-agent
After restartExpected
Reports listPrevious scheduled reports still there
Approvals queueAny pending (not yet expired) approvals still queued
SnoozesActive snoozes still active
Audit logEvery past write still visible
Scan cacheDashboard has cluster context immediately, doesn't wait for first scan

Verify each store's ConfigMap directly:

kubectl get cm -n clu-ops | grep clu-ops-
kubectl get cm -n clu-ops clu-ops-approvals \
  -o jsonpath='{.data.approvals\.json}' | jq '[.[] | {id, tool, state}]'

5. Error surfacing

The persona's narration rules + the structured error taxonomy. Deliberately break things:

BreakExpected chat response
Remove get on Pods from the reader ClusterRole, try kubectl get pod equivalenterror_kind: permission_denied with paste-ready get/list/watch rule
Point integrations.prometheus.url at a 401-returning proxyPrometheus 401 → "Prometheus at ... returned HTTP 401. Clu's default deployment targets the unauthenticated in-cluster service…"
Remove iam:ListRoles from the Cloud policy, call aws_iam_roleserror_kind: permission_denied with iam:ListRoles named in the fix hint
Stage a write, approve, watch the dispatch happenNarration uses "Applied." / "Done." (past-tense) ONLY after the real apply, not at stage time
Stage a write, reject itNarration: "The change was rejected — I won't re-apply it. What would you like instead?" (not an immediate retry)

6. Multi-arch image

Quick check that the arm64 image works. If the cluster node is amd64 (e.g. c6i.large), pull the arm64 image separately:

docker pull --platform=linux/arm64 \
  709825985650.dkr.ecr.us-east-1.amazonaws.com/cloudology/clu-ops-agent-backend:0.1.4

docker inspect --format '{{.Architecture}}' \
  709825985650.dkr.ecr.us-east-1.amazonaws.com/cloudology/clu-ops-agent-backend:0.1.4
# → arm64

Or run the backend locally against the cluster from a Graviton node by tainting the nodegroup.

Known non-validation gaps

These work in principle but aren't easy to validate without additional setup:

  • Marketplace entitlement — requires a real Marketplace subscription. Dev installs use license.key=dev which bypasses.
  • Slack / Teams webhooks — outbound only; validate by wiring a webhook + watching for a test report.

What a "pass" looks like

For a green signoff, every item in sections 0–5 should respond as described. Section 6 (multi-arch) is the last check before declaring the install good.

If something regresses on a subsequent build, the audit log + backend logs + the structured error output from the chat should be enough to triangulate. File findings against the version shown on the Dashboard's About panel.