Skip to main content

Automation & Playbooks

If-this-then-that automation: when DQ fails / anomalies fire / SLA breaches / freshness goes stale / schema drifts, run a remediation playbook.

Overview

A playbook is a named, ordered sequence of actions that run automatically when a trigger fires. Built-in triggers come from the rest of the platform — DQ failures, anomalies, SLA breaches, freshness staleness, schema drift, RTBF completion, clone failures — and built-in actions cover the obvious remediation paths plus an arbitrary "run SQL" / "run notebook" / "POST webhook" escape hatch.

Source: src/playbooks.py · /api/playbooks · UI at /automation/playbooks.

Trigger types

TriggerFires when
dq_failureA DQ rule or DQX check returns a passed=false row
anomalyThe anomaly detection engine emits a high-severity event
sla_breachSLA monitor's freshness/availability/quality threshold is crossed
freshness_staleA monitored table's last_updated_at is older than its SLA
schema_driftSchema drift detector finds a breaking column change

Trigger filters narrow the rule:

trigger_type: dq_failure
filter:
table_pattern: "prod.fraud.*"
severity: critical

Action types

Every playbook can compose any of these actions:

ActionWhat it does
notify_slackPost a formatted message to a Slack channel
notify_pagerdutyTrigger a PD incident
notify_emailSend an email to a list / mailing list
quarantine_tableMove the offending table to a quarantine catalog and mark it do_not_consume
run_sqlExecute arbitrary SQL on a chosen warehouse
run_notebookRun a Databricks notebook with parameters
post_webhookPOST a JSON payload to a URL
roll_back_cloneRevert the most recent clone via Delta RESTORE
pause_pipelinePause a downstream DLT pipeline
disable_serving_endpointTake an inference endpoint out of rotation
create_jira_issueOpen a tracked ticket in Jira
attach_dq_event_to_incidentBundle the event into an existing incident for triage

Author a playbook

name: fraud-table-quarantine
description: Move a fraud-detection feature table out of the way when DQ critical fails
trigger_type: dq_failure
filter:
table_pattern: "prod.fraud.feature_*"
severity: critical
max_executions_per_hour: 10
enabled: true
actions:
- type: notify_pagerduty
params:
service: fraud-oncall
summary: "Quarantining {{table_fqn}} due to {{rule_name}}"
- type: quarantine_table
params:
target_catalog: prod_quarantine
- type: pause_pipeline
params:
pipeline_id: fraud_scoring_pipeline
- type: create_jira_issue
params:
project: DATA
assignee: fraud-oncall
labels: ["fraud", "auto-quarantine"]

POST /api/playbooks/ with the YAML above (as JSON), or use the form at /automation/playbooks to author interactively.

Rate limiting

max_executions_per_hour (default 10) prevents storm-of-failures from triggering infinite playbook runs. Once exceeded, the trigger's run count is incremented but no actions fire — the rejection is logged to clone_audit.governance.playbook_executions so you can see what was suppressed.

Templates

Built-in templates seed the playbooks list from /automation/playbooks:

  • PII leak triage — when PII detection finds untagged PII, notify owner, tag the column, schedule re-scan in 24h
  • Freshness rescue — when freshness goes stale, ping pipeline owner, rerun upstream Job, alert if still stale after one hour
  • Schema drift block — when breaking schema drift detected on a contract-bound table, pause the consumer pipeline, page the producer
  • Anomaly investigation — when an anomaly correlates with an upstream root cause, page the upstream owner with the correlation group attached

Execution history

Every playbook run produces a row in clone_audit.governance.playbook_executions:

execution_id, playbook_id, triggered_by, trigger_payload, started_at, completed_at, status, action_results[]

Query directly or browse via the Executions tab in the UI. Failed actions don't kill the run — playbooks continue past per-action failures and surface them as degraded so operators can fix root cause and rerun the failed action only.