Write Operations

Core Plus (Core Plus) is the only tier that can change cluster state. Every write goes through an approval gate — Clu never silently applies a change.

The safety model

Non-negotiable invariants (hard-coded, not configurable):

  1. Dry-run always runs first. The tool handler renders the change and sends it to the K8s API with dryRun=All before asking for approval. If the dry-run fails (invalid manifest, admission webhook rejection, etc.), the request never reaches the approval queue.
  2. Operator approval is required. The tool's first invocation creates an ApprovalRequest and returns status: pending_approval. The second invocation (with approval_id) proceeds only if the operator clicked Approve.
  3. Protected namespaces are off-limits. kube-system, kube-public, kube-node-lease, and clu-ops reject writes at the gate regardless of RBAC or approval state.
  4. Narrow write ClusterRole. clu-ops-agent-writer has create/update/patch on workload resources. No delete. No cluster-admin. No secret-content reads.
  5. Every approved action is audit-logged. The ConfigMap-backed audit store captures every write with timestamp, tool name, module, duration, error state, args summary, and result excerpt.

These invariants are tested in backend/tests/unit/test_idp_base.py, test_write_safety.py, and test_permission_errors.py.

The two-source approval flow

Approvals come from two places:

Source: chat

An operator asks Clu to do something. The agent calls a write tool. The tool dry-runs, creates an ApprovalRequest, returns a pending status. The UI shows an inline approval card in the chat with the rendered manifest + Approve / Reject buttons.

Operator clicks Approve → the backend flips the approval state to approved. The next agent turn re-invokes the same tool with approval_id, the tool's gate sees the approved state, and the write actually runs.

Source: report

A scheduled health-check rule emits a Recommendation attached to a finding. When reports.recommenderMode=false (Helm value), the Reporter auto-creates ApprovalRequests at cron time — the Approvals view populates overnight without any chat interaction.

The morning oncall opens the UI, sees "3 proposed fixes from last night's scan", reviews each, clicks Approve. The approve endpoint itself dispatches the write (there's no agent turn waiting to re-invoke the tool in this flow). Every dispatch lands in the audit log identically to chat-sourced approvals.

Advisory-only recommendations (no dispatchable tool payload — "raise memory limit" has no safe generic value) render an Acknowledge button instead of Approve/Reject. The operator clicks to dismiss; Clu applies nothing. The intent is honest UX: "we flagged the fix direction; size and apply manually."

The 24-hour TTL

Approvals expire 24 hours after creation by default. This was 5 minutes pre-v0.0.1 — that only made sense for actively-watched chat approvals. Report-sourced approvals obviously need longer.

Teams who expect weekends between cron + approval should bump to 48h:

approvals:
  ttlSeconds: 172800

Expired approvals flip to expired state on the next read; they no longer appear in the Pending list but remain in the audit trail for forensic traceability.

De-dupe

A rule firing every 30 minutes against the same finding doesn't stack duplicate approvals. Before creating a new approval, the Reporter checks for existing pending or approved approvals tagged with the same source_finding_id and skips if any exist.

Rejected / expired approvals don't block re-staging — the operator said no (or ignored it long enough to expire), so when the rule fires again, a fresh approval gets created. Design choice: the rule's job is to keep flagging what it sees; the operator's job is to decide.

Per-tick cap

reports.maxAutoStagedPerTick (default 10) caps auto-staged approvals per scheduler tick so a rule that fires across every deployment in the cluster doesn't produce 50 simultaneous approval cards.

The cap is severity-ordered: critical findings stage first, warnings next, infos last. If the cap kicks in, the low-severity tail is dropped rather than starving the high-severity head.

Dispatch failures

When the operator clicks Approve on a report-sourced approval, the dispatched write can still fail — RBAC denied after staging, the target workload was deleted, the admission webhook rejected the patch. The approval state flipped to approved regardless (the operator's decision succeeded), but nothing changed in the cluster.

The UI surfaces a warning banner with the dispatch error on the next poll cycle. The audit log records both the approval + the failed dispatch attempt. The operator is expected to re-stage the fix from the finding or approve the write manually via kubectl.

Audit log

Every write tool dispatch (from either source) produces an audit entry with:

  • timestamp
  • kindtool_call for writes (reads aren't audited)
  • name — tool name (k8s_apply, etc.)
  • modulecluster | cloud | idp
  • duration_ms
  • is_error
  • args_summary — compact summary of the tool input (capped at 280 chars).
  • result_excerpt — compact summary of the tool output (capped at 600 chars).

The audit log is persisted to a dedicated ConfigMap so a pod restart doesn't erase forensic history. Retention is capped at 500 entries by default (persistence.auditMaxRetained) — enough for a busy week while staying well under the K8s 1 MiB ConfigMap limit.

View the audit log in the Audit tab of the UI, or directly:

kubectl get cm -n clu-ops clu-ops-audit -o jsonpath='{.data.audit\.json}' | jq .

Golden paths

golden_path_apply renders a pre-blessed workload template + applies it as a single approval bundle. The catalog ships with:

  • standard-web-service — Deployment + Service + HPA + PDB.
  • scheduled-job — CronJob with sensible defaults.
  • internal-api — same as standard-web-service plus NetworkPolicy.
  • worker — Deployment sized for async workload patterns.

Each path auto-detects matching examples in the current cluster so the operator can see "this path matches the 5 services you already run in prod" before committing.