Data Quality (DQX)

The DQX Quality Engine provides comprehensive data quality management for Databricks Unity Catalog — profiling, rule execution, anomaly detection, scheduling, and governance across your entire data estate.

Overview

DQX integrates with the open-source databricks-labs-dqx library and extends it with scheduling, audit logging, failure sampling, segmented checks, coverage reporting, quality gates, impact analysis, cross-table checks, root cause correlation, and profile drift detection.

Key capabilities

Capability	Description
57+ Check Functions	Null, range, format, temporal, geospatial, PII, custom SQL, aggregation
Table Profiling	Auto-discover data patterns and generate DQ checks
Anomaly Detection	Z-score based anomaly detection with configurable thresholds
Monitoring Scheduler	Background metric collection (row_count, null_rate, distinct_count, min, max, mean)
Expectation Suites	Group checks into reusable suites
DQ Coverage Report	See which tables have checks vs. which are uncovered
Failed Row Sampling	Concrete examples of failing rows for quick root cause analysis
Segmented Checks	Per-region, per-date quality breakdowns
DQ Gate	Block clone/sync if data quality fails a threshold
Cross-Table Checks	Referential integrity, aggregate match, row count comparison
Check Audit Log	Track who changed what check and when
DQ Schedules	Cron-based recurring DQ check runs
Root Cause Analysis	Correlate anomalies with upstream events
Profile Drift	Detect when data patterns change and recommend new checks

Profiling & auto-generated checks

Profile a table to discover data patterns and auto-generate DQX checks:

# Via API
curl -X POST http://localhost:8080/api/governance/dqx/profile \
  -H "Content-Type: application/json" \
  -d '{"table_fqn": "catalog.schema.my_table"}'

# Generate checks from profiles
curl -X POST http://localhost:8080/api/governance/dqx/generate \
  -H "Content-Type: application/json" \
  -d '{"table_fqn": "catalog.schema.my_table"}'

Profile drift detection

When data evolves, re-profile a table to detect drift and get recommendations:

curl -X POST http://localhost:8080/api/governance/dqx/profile-drift \
  -H "Content-Type: application/json" \
  -d '{"table_fqn": "catalog.schema.my_table"}'

Response:

{
  "table_fqn": "catalog.schema.my_table",
  "new_columns": ["new_col_a", "new_col_b"],
  "removed_columns": ["old_col"],
  "recommendations": [
    {
      "type": "new_column",
      "severity": "info",
      "column": "new_col_a",
      "description": "New column 'new_col_a' detected - consider adding check: is_not_null",
      "suggested_check": { "name": "Auto: is_not_null on new_col_a", "check_function": "is_not_null" }
    },
    {
      "type": "removed_column",
      "severity": "warning",
      "column": "old_col",
      "description": "Column 'old_col' no longer exists but has 2 active check(s) - consider removing",
      "orphan_check_ids": ["abc123", "def456"]
    }
  ],
  "total_recommendations": 3
}

Recommendation types:

new_column — New column discovered, suggest adding a check
removed_column — Column removed but still has active checks (orphans)
changed_pattern — Column's data pattern changed (e.g., was is_not_null, now is_in_list)
uncovered — Column has a profile but no corresponding check

DQ check management

Create a check

curl -X POST http://localhost:8080/api/governance/dqx/checks \
  -H "Content-Type: application/json" \
  -d '{
    "table_fqn": "catalog.schema.orders",
    "name": "Order amount must be positive",
    "check_function": "is_not_less_than",
    "arguments": {"column": "amount", "limit": 0},
    "criticality": "error"
  }'

Run checks

# Run all enabled checks for a table
curl -X POST http://localhost:8080/api/governance/dqx/run \
  -H "Content-Type: application/json" \
  -d '{"table_fqn": "catalog.schema.orders"}'

Response:

{
  "run_id": "a1b2c3d4",
  "table_fqn": "catalog.schema.orders",
  "total_rows": 50000,
  "valid_rows": 49850,
  "invalid_rows": 150,
  "pass_rate": 99.7,
  "checks_applied": 5,
  "execution_time_ms": 1234,
  "status": "completed"
}

Check audit log

Every check modification (create, update, enable/disable, delete) is tracked:

# Get audit log for a specific check
curl "http://localhost:8080/api/governance/dqx/checks/audit-log?check_id=abc123"

# Get audit log for all checks on a table
curl "http://localhost:8080/api/governance/dqx/checks/audit-log?table_fqn=catalog.schema.orders"

Response:

[
  {
    "audit_id": "x1y2z3",
    "check_id": "abc123",
    "table_fqn": "catalog.schema.orders",
    "action": "update",
    "changes": "{\"criticality\": \"warning\"}",
    "performed_by": "user@company.com",
    "performed_at": "2026-04-06T10:30:00Z"
  }
]

Tracked actions: create, update, delete, enable, disable, bulk_delete

Failed row sampling

When a DQ check fails, up to 10 sample failing rows are automatically persisted for drill-down analysis:

# Get failure samples for a run
curl "http://localhost:8080/api/data-quality/failure-samples?run_id=a1b2c3d4"

# Get failure samples for a table (latest runs)
curl "http://localhost:8080/api/data-quality/failure-samples?table_fqn=catalog.schema.orders"

Response:

[
  {
    "run_id": "a1b2c3d4",
    "table_fqn": "catalog.schema.orders",
    "sample_index": 0,
    "row_data": {"order_id": "ORD-999", "amount": -5.00, "customer": "John Doe"},
    "failed_checks": {"_is_not_less_than_amount": "FAILED"}
  }
]

tip

Failed row samples show you the actual data that failed, not just counts. This dramatically speeds up root cause analysis — instead of guessing why 150 rows failed, you can see the exact values.

Segmented checks

Run DQ checks segmented by a dimension column to catch localized quality problems:

curl -X POST http://localhost:8080/api/data-quality/segmented-run \
  -H "Content-Type: application/json" \
  -d '{
    "table_fqn": "catalog.schema.orders",
    "segment_column": "region"
  }'

Response:

{
  "run_id": "s1e2g3",
  "table_fqn": "catalog.schema.orders",
  "segment_column": "region",
  "segments": 4,
  "results": [
    {"segment_value": "APAC", "total_rows": 12000, "pass_rate": 72.5},
    {"segment_value": "EMEA", "total_rows": 15000, "pass_rate": 98.1},
    {"segment_value": "LATAM", "total_rows": 8000, "pass_rate": 99.2},
    {"segment_value": "NA", "total_rows": 15000, "pass_rate": 99.8}
  ],
  "status": "completed"
}

caution

In the example above, the overall table might pass at 95%+, but APAC is at 72.5% — a localized problem invisible to aggregate metrics.

DQ coverage report

See how much of your catalog is monitored:

curl "http://localhost:8080/api/data-quality/coverage/my_catalog"

Response:

{
  "catalog": "my_catalog",
  "total_tables": 92,
  "covered": 34,
  "uncovered": 58,
  "coverage_pct": 37.0,
  "covered_tables": [
    {"table_fqn": "my_catalog.sales.orders", "check_count": 5, "enabled_count": 5}
  ],
  "uncovered_tables": [
    {"table_fqn": "my_catalog.staging.raw_events"}
  ]
}

DQ check schedules

Schedule recurring DQ check runs using cron expressions:

# Create a schedule to run all checks for a table every hour
curl -X POST http://localhost:8080/api/data-quality/schedules \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Hourly orders check",
    "schedule_type": "table",
    "table_fqn": "catalog.schema.orders",
    "cron": "0 * * * *"
  }'

# Create a schedule to run an expectation suite daily at midnight
curl -X POST http://localhost:8080/api/data-quality/schedules \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Nightly full suite",
    "schedule_type": "suite",
    "suite_id": "suite_abc",
    "cron": "0 0 * * *"
  }'

# Run all enabled checks across all tables every 6 hours
curl -X POST http://localhost:8080/api/data-quality/schedules \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Full DQ sweep",
    "schedule_type": "all",
    "cron": "0 */6 * * *"
  }'

Schedule types

Type	Description
`table`	Run all enabled checks for a specific table
`suite`	Run an expectation suite
`checks`	Run specific check IDs
`all`	Run all enabled checks across all tables

Schedule management

# List all schedules
curl http://localhost:8080/api/data-quality/schedules

# Pause a schedule
curl -X POST http://localhost:8080/api/data-quality/schedules/{id}/pause

# Resume a schedule
curl -X POST http://localhost:8080/api/data-quality/schedules/{id}/resume

# Run a schedule immediately
curl -X POST http://localhost:8080/api/data-quality/schedules/{id}/run

# Delete a schedule
curl -X DELETE http://localhost:8080/api/data-quality/schedules/{id}

DQ gate for clone/sync

Block clone operations when data quality falls below a threshold:

Configuration

# clone_config.yaml
dq_gate:
  enabled: true
  table_fqn: "production.sales.orders"   # Check this table before clone
  # suite_id: "suite_abc"                # Or check an expectation suite
  min_pass_rate: 95.0                    # Block if below 95%

When dq_gate.enabled is true, the clone process runs the specified DQ checks before proceeding. If the pass rate is below min_pass_rate, the clone is blocked with an explanation.

Test the gate without cloning

curl -X POST http://localhost:8080/api/data-quality/gate/evaluate \
  -H "Content-Type: application/json" \
  -d '{
    "table_fqn": "production.sales.orders",
    "min_pass_rate": 95.0
  }'

Response (blocked):

{
  "passed": false,
  "pass_rate": 87.3,
  "min_pass_rate": 95.0,
  "total_rows": 50000,
  "invalid_rows": 6350,
  "checks_applied": 5,
  "reason": "DQ gate failed: pass rate 87.3% < threshold 95.0% (6350 invalid rows out of 50000)"
}

Cross-table consistency checks

Check data consistency across multiple tables:

Aggregate match

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
  -H "Content-Type: application/json" \
  -d '{
    "check_type": "aggregate_match",
    "params": {
      "source_table": "catalog.sales.orders",
      "dest_table": "catalog.sales.daily_summary",
      "source_expr": "sum(amount)",
      "dest_expr": "sum(total_amount)",
      "group_by": "order_date"
    }
  }'

Referential integrity

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
  -H "Content-Type: application/json" \
  -d '{
    "check_type": "referential_integrity",
    "params": {
      "child_table": "catalog.sales.orders",
      "child_column": "customer_id",
      "parent_table": "catalog.sales.customers",
      "parent_column": "id"
    }
  }'

Row count match

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
  -H "Content-Type: application/json" \
  -d '{
    "check_type": "row_count_match",
    "params": {
      "source_table": "catalog.bronze.events",
      "dest_table": "catalog.silver.events_cleaned",
      "tolerance_pct": 5
    }
  }'

Custom SQL

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
  -H "Content-Type: application/json" \
  -d '{
    "check_type": "custom_sql",
    "params": {
      "sql": "SELECT (SELECT count(*) FROM catalog.a.t1) = (SELECT count(*) FROM catalog.b.t2) AS passed, '\''counts match'\'' AS message"
    }
  }'

Impact analysis

When a DQ check fails, see which downstream tables, views, and jobs are affected:

curl "http://localhost:8080/api/data-quality/impact/catalog.schema.source_table"

Response:

{
  "table_fqn": "catalog.schema.source_table",
  "downstream_tables": [
    "catalog.silver.enriched_orders",
    "catalog.gold.revenue_dashboard"
  ],
  "downstream_views": [
    {"catalog": "catalog", "schema": "reports", "view": "monthly_revenue"}
  ],
  "referencing_jobs": [
    {"job_name": "nightly_etl", "job_id": "12345"}
  ],
  "total_affected": 4,
  "risk_level": "medium"
}

Root cause analysis

When an anomaly is detected, find correlated events that may explain it:

curl "http://localhost:8080/api/data-quality/root-cause/catalog.schema.orders?hours=24"

Response:

{
  "table_fqn": "catalog.schema.orders",
  "time_window_hours": 24,
  "probable_causes": [
    {
      "type": "correlated_anomalies",
      "confidence": "high",
      "description": "Multiple metrics anomalous on the same table: null_rate, row_count"
    },
    {
      "type": "upstream_anomaly",
      "confidence": "high",
      "description": "Anomalies detected on 2 upstream table(s)"
    },
    {
      "type": "freshness_gap",
      "confidence": "medium",
      "description": "Table is stale - last updated 36h ago"
    },
    {
      "type": "schema_change",
      "confidence": "medium",
      "description": "2 recent schema/metadata change(s) on this table"
    }
  ],
  "total_causes_found": 4
}

Cause types ranked by confidence:

correlated_anomalies — Multiple metrics spiking on the same table
upstream_anomaly — Anomalies on tables that feed this one (via lineage)
freshness_gap — Table went stale (not updated within threshold)
schema_change — Recent schema or metadata changes

Delta tables

DQX stores all data in Delta tables within the governance schema:

Table	Purpose
`dqx_profiles`	Profiled rules and patterns
`dqx_checks`	Check definitions
`dqx_run_results`	Check execution results
`dqx_check_definitions`	Check definitions from YAML/Delta sources
`dqx_check_audit_log`	Audit trail for all check modifications
`dqx_failure_samples`	Sample failing rows from runs
`dqx_segment_results`	Per-segment check results
`dq_check_schedules`	Scheduled DQ check run definitions

API reference

DQX checks (under `/api/governance`)

Method	Endpoint	Description
`POST`	`/dqx/checks`	Create a DQX check
`GET`	`/dqx/checks`	List checks (filter by `table_fqn`)
`PUT`	`/dqx/checks/{id}`	Update a check
`DELETE`	`/dqx/checks/{id}`	Delete a check
`POST`	`/dqx/checks/{id}/toggle`	Enable/disable a check
`POST`	`/dqx/checks/delete-bulk`	Bulk delete checks
`GET`	`/dqx/checks/audit-log`	Check modification audit log
`GET`	`/dqx/checks/export`	Export checks as YAML
`POST`	`/dqx/checks/import`	Import checks from YAML
`POST`	`/dqx/checks/save-to-delta`	Save checks to a Delta table
`POST`	`/dqx/run`	Execute DQX checks
`POST`	`/dqx/profile`	Profile a table
`POST`	`/dqx/profile-drift`	Detect profile drift & recommendations
`GET`	`/dqx/profiles`	List profiles
`GET`	`/dqx/dashboard`	DQX dashboard summary
`GET`	`/dqx/check-functions`	List available check functions
`POST`	`/dq/cross-table-check`	Cross-table consistency check

Data Quality endpoints (under `/api/data-quality`)

Method	Endpoint	Description
`GET`	`/coverage/{catalog}`	DQ coverage report
`GET`	`/failure-samples`	Failed row samples
`POST`	`/segmented-run`	Run checks by segment
`GET`	`/segment-results`	Segmented check results
`POST`	`/gate/evaluate`	Evaluate DQ quality gate
`GET`	`/impact/{table_fqn}`	Lineage-aware failure impact
`GET`	`/root-cause/{table_fqn}`	Root cause correlation
`GET`	`/schedules`	List DQ check schedules
`POST`	`/schedules`	Create a DQ check schedule
`DELETE`	`/schedules/{id}`	Delete a schedule
`POST`	`/schedules/{id}/pause`	Pause a schedule
`POST`	`/schedules/{id}/resume`	Resume a schedule
`POST`	`/schedules/{id}/run`	Run a schedule now
`GET`	`/health-score/{catalog}`	Aggregate DQ health score
`GET`	`/scorecard/{table_fqn}`	Per-table DQ scorecard
`GET`	`/freshness/{catalog}`	Table freshness check
`GET`	`/anomalies`	Recent anomalies
`GET`	`/incidents`	Unified incident feed

Configuration

# clone_config.yaml

# Anomaly detection thresholds
anomaly_detection:
  baseline_window: 30          # Number of measurements for baseline
  warning_threshold: 2.0       # z-score for warning
  critical_threshold: 3.0      # z-score for critical
  system_table_sources:
    billing: true
    compute: true
    query_history: true
    storage: true

# DQX engine settings
dqx:
  auto_save_to_delta: true
  default_target_table: "catalog.schema.dqx_rules"

# DQ gate for clone operations
dq_gate:
  enabled: false
  table_fqn: ""               # Table to check before clone
  suite_id: ""                # Or expectation suite to run
  min_pass_rate: 95.0         # Block if below this threshold

Next steps

Data Observability — unified health dashboard
Governance — SLA rules, data contracts, RBAC
PII Detection — detect and protect sensitive data

Overview​

Key capabilities​

Profiling & auto-generated checks​

Profile drift detection​

DQ check management​

Create a check​

Run checks​

Check audit log​

Failed row sampling​

Segmented checks​

DQ coverage report​

DQ check schedules​

Schedule types​

Schedule management​

DQ gate for clone/sync​

Configuration​

Test the gate without cloning​

Cross-table consistency checks​

Aggregate match​

Referential integrity​

Row count match​

Custom SQL​

Impact analysis​

Root cause analysis​

Delta tables​

API reference​

DQX checks (under /api/governance)​

Data Quality endpoints (under /api/data-quality)​

Configuration​

Next steps​

Overview

Key capabilities

Profiling & auto-generated checks

Profile drift detection

DQ check management

Create a check

Run checks

Check audit log

Failed row sampling

Segmented checks

DQ coverage report

DQ check schedules

Schedule types

Schedule management

DQ gate for clone/sync

Configuration

Test the gate without cloning

Cross-table consistency checks

Aggregate match

Referential integrity

Row count match

Custom SQL

Impact analysis

Root cause analysis

Delta tables

API reference

DQX checks (under `/api/governance`)

Data Quality endpoints (under `/api/data-quality`)

Configuration

Next steps