v0.1.x. Jump to latest →Write Operations
Core Plus (Core Plus) is the only tier that can change cluster state. Every write goes through an approval gate — Clu never silently applies a change.
The safety model
Non-negotiable invariants (hard-coded, not configurable):
- Dry-run always runs first. The tool handler renders the change
and sends it to the K8s API with
dryRun=Allbefore asking for approval. If the dry-run fails (invalid manifest, admission webhook rejection, etc.), the request never reaches the approval queue. - Operator approval is required. The tool's first invocation
creates an
ApprovalRequestand returnsstatus: pending_approval. The second invocation (withapproval_id) proceeds only if the operator clicked Approve. - Protected namespaces are off-limits.
kube-system,kube-public,kube-node-lease, andclu-opsreject writes at the gate regardless of RBAC or approval state. - Narrow write ClusterRole.
clu-ops-agent-writerhascreate/update/patchon workload resources. Nodelete. No cluster-admin. No secret-content reads. - Every approved action is audit-logged. The
ConfigMap-backed audit store captures every write with timestamp, tool name, module, duration, error state, args summary, and result excerpt.
These invariants are tested in backend/tests/unit/test_idp_base.py,
test_write_safety.py, and test_permission_errors.py.
The two-source approval flow
Approvals come from two places:
Source: chat
An operator asks Clu to do something. The agent calls a write tool.
The tool dry-runs, creates an ApprovalRequest, returns a pending
status. The UI shows an inline approval card in the chat with the
rendered manifest + Approve / Reject buttons.
Operator clicks Approve → the backend flips the approval state to
approved. The next agent turn re-invokes the same tool with
approval_id, the tool's gate sees the approved state, and the write
actually runs.
Source: report
A scheduled health-check rule emits a Recommendation attached to a
finding. When reports.recommenderMode=false (Helm value), the
Reporter auto-creates ApprovalRequests at cron time — the Approvals
view populates overnight without any chat interaction.
The morning oncall opens the UI, sees "3 proposed fixes from last night's scan", reviews each, clicks Approve. The approve endpoint itself dispatches the write (there's no agent turn waiting to re-invoke the tool in this flow). Every dispatch lands in the audit log identically to chat-sourced approvals.
Advisory-only recommendations (no dispatchable tool payload — "raise memory limit" has no safe generic value) render an Acknowledge button instead of Approve/Reject. The operator clicks to dismiss; Clu applies nothing. The intent is honest UX: "we flagged the fix direction; size and apply manually."
The 24-hour TTL
Approvals expire 24 hours after creation by default. This was 5 minutes pre-v0.0.1 — that only made sense for actively-watched chat approvals. Report-sourced approvals obviously need longer.
Teams who expect weekends between cron + approval should bump to 48h:
approvals:
ttlSeconds: 172800
Expired approvals flip to expired state on the next read; they no
longer appear in the Pending list but remain in the audit trail for
forensic traceability.
De-dupe
A rule firing every 30 minutes against the same finding doesn't stack
duplicate approvals. Before creating a new approval, the Reporter
checks for existing pending or approved approvals tagged with the
same source_finding_id and skips if any exist.
Rejected / expired approvals don't block re-staging — the operator said no (or ignored it long enough to expire), so when the rule fires again, a fresh approval gets created. Design choice: the rule's job is to keep flagging what it sees; the operator's job is to decide.
Per-tick cap
reports.maxAutoStagedPerTick (default 10) caps auto-staged approvals
per scheduler tick so a rule that fires across every deployment in the
cluster doesn't produce 50 simultaneous approval cards.
The cap is severity-ordered: critical findings stage first, warnings next, infos last. If the cap kicks in, the low-severity tail is dropped rather than starving the high-severity head.
Dispatch failures
When the operator clicks Approve on a report-sourced approval, the
dispatched write can still fail — RBAC denied after staging, the
target workload was deleted, the admission webhook rejected the
patch. The approval state flipped to approved regardless (the
operator's decision succeeded), but nothing changed in the cluster.
The UI surfaces a warning banner with the dispatch error on the next poll cycle. The audit log records both the approval + the failed dispatch attempt. The operator is expected to re-stage the fix from the finding or approve the write manually via kubectl.
Audit log
Every write tool dispatch (from either source) produces an audit entry with:
timestampkind—tool_callfor writes (reads aren't audited)name— tool name (k8s_apply, etc.)module—cluster|cloud|idpduration_msis_errorargs_summary— compact summary of the tool input (capped at 280 chars).result_excerpt— compact summary of the tool output (capped at 600 chars).
The audit log is persisted to a dedicated ConfigMap so a pod
restart doesn't erase forensic history. Retention is capped at 500
entries by default (persistence.auditMaxRetained) — enough for a
busy week while staying well under the K8s 1 MiB ConfigMap limit.
View the audit log in the Audit tab of the UI, or directly:
kubectl get cm -n clu-ops clu-ops-audit -o jsonpath='{.data.audit\.json}' | jq .
Golden paths
golden_path_apply renders a pre-blessed workload template + applies
it as a single approval bundle. The catalog ships with:
standard-web-service— Deployment + Service + HPA + PDB.scheduled-job— CronJob with sensible defaults.internal-api— same as standard-web-service plus NetworkPolicy.worker— Deployment sized for async workload patterns.
Each path auto-detects matching examples in the current cluster so
the operator can see "this path matches the 5 services you already
run in prod" before committing.