v0.1.x. Jump to latest →Architecture
A high-level look at how Clu fits into your cluster. This page describes the deployment shape and runtime behavior at the level operators need to plan around — not the internal code organization.
Where it runs
Clu installs as a single pod in its own namespace (clu-ops). The pod
has two containers — a small frontend that serves the in-cluster UI and
an agent backend that orchestrates everything else. They communicate
only over localhost; there's no cross-container networking outside the
pod.
The pod talks to three external surfaces, and only when it needs to:
- The Kubernetes API — read access by default, plus optional write access when the Core Plus tier is active (gated behind explicit approvals; see Write operations).
- Amazon Bedrock — for inference, called from inside your AWS account via IRSA. Your prompts and tool results never leave your account.
- Optional integrations — Prometheus, CloudWatch, AWS service inventory APIs (RDS, S3, IAM…). All read-only, all opt-in via Helm values, all scoped per capability.
That's the entire network footprint. There's no SaaS dependency, no phone-home telemetry, no shared backend — every Clu pod runs self-contained in its own customer cluster.
The agent loop
When an operator asks Clu a question, the backend builds a prompt that combines the operations persona, the active capability flags, and a fresh snapshot of cluster state. It sends that to Bedrock with a list of tools the model is allowed to call.
The model returns a stream of text deltas (which the UI renders live) and structured tool-call requests. Each tool call is dispatched through a registry that:
- Enforces feature gating — capabilities the customer hasn't enabled return a structured "not active" response, not silent failure.
- Catches permission errors from Kubernetes, Prometheus, or AWS and
rewrites them into operator-friendly fix hints (e.g., "the
ServiceAccount needs
events:liston this namespace"). - Records every write attempt to an append-only audit log before it ever reaches the cluster.
The loop continues until the model produces a final answer with no more tool calls. Multiple chat sessions multiplex through asyncio; one request at a time per session.
What's in the cluster, what's outside
| Stays in your cluster | Crosses your account boundary |
|---|---|
| Cluster scan results, conventions, topology | Inference requests to Bedrock (your account) |
| Pending and decided write approvals | (nothing else) |
| Audit log of every write Clu has applied | |
| Health-check findings and snoozes | |
| Operator chat history |
State persists to Kubernetes ConfigMaps in the Clu namespace, so a pod restart resumes with everything intact. Approvals, audit entries, and chat history all survive — operators don't lose context when AWS rolls the addon to a new version.
Safety model for writes
Writes are off by default. When a customer enables the Core Plus tier, the backend layers three independent gates on every write tool:
- Dry-run first — the tool plans the change, returns the diff, and pauses. Nothing reaches the API server yet.
- Operator approval — the diff lands in the Approvals view in the UI; a human clicks Approve or Reject. The agent has no path to approve its own writes.
- Protected-namespace block —
kube-system,kube-public,kube-node-lease, andclu-opsitself reject writes regardless of approval. The block lives in the writer'sClusterRole, not just in application code, so it can't be bypassed by a misbehaving agent.
Every approved write — and every rejected one — is recorded in the audit log with operator id, approval id, payload hash, and the result the cluster returned. See Write operations for the full safety model.
Scheduled work
Clu runs three background jobs against your cluster, each on a tunable cadence:
- Health check (default 30 minutes) — runs the rule set, files findings, fires notifications when the critical-finding set changes.
- Knowledge refresh (default 1 hour) — re-scans the cluster for new resources and integrations. Newly-installed Argo or Kyverno is picked up without restarting the Clu pod.
- Entitlement re-verification (default 1 hour) — re-checks the AWS Marketplace entitlement so cancellations propagate without a restart.
Every job is wrapped in error handling that logs failures at ERROR
level and keeps the schedule firing.
Bring your own model
Bedrock is the default, but the LLM client is provider-neutral. The deployment can point at any OpenAI-compatible endpoint via Helm values — Azure OpenAI, OpenAI, or self-hosted Ollama all work for customers who prefer a non-Bedrock model. The tool surface and persona are identical regardless of provider; only the inference call changes.
Packaging
Clu ships as a multi-arch image (linux/amd64 + linux/arm64). The chart
runs both containers as non-root UIDs so clusters with runAsNonRoot
admission policies install cleanly. Both images and the Helm chart are
distributed exclusively through AWS Marketplace ECR during the
preview — see Install for the full
flow.
Related:
- Capabilities overview — what each tier does and how they layer.
- IAM setup — every IAM policy needed, inline.
- Write operations — the approval-gated safety model in detail.
- Configuration reference — every Helm value with defaults.