Skip to main content

Data Quality (DQX)

The DQX Quality Engine provides comprehensive data quality management for Databricks Unity Catalog — profiling, rule execution, anomaly detection, scheduling, and governance across your entire data estate.

Overview

DQX integrates with the open-source databricks-labs-dqx library and extends it with scheduling, audit logging, failure sampling, segmented checks, coverage reporting, quality gates, impact analysis, cross-table checks, root cause correlation, and profile drift detection.

Key capabilities

CapabilityDescription
57+ Check FunctionsNull, range, format, temporal, geospatial, PII, custom SQL, aggregation
Table ProfilingAuto-discover data patterns and generate DQ checks
Anomaly DetectionZ-score based anomaly detection with configurable thresholds
Monitoring SchedulerBackground metric collection (row_count, null_rate, distinct_count, min, max, mean)
Expectation SuitesGroup checks into reusable suites
DQ Coverage ReportSee which tables have checks vs. which are uncovered
Failed Row SamplingConcrete examples of failing rows for quick root cause analysis
Segmented ChecksPer-region, per-date quality breakdowns
DQ GateBlock clone/sync if data quality fails a threshold
Cross-Table ChecksReferential integrity, aggregate match, row count comparison
Check Audit LogTrack who changed what check and when
DQ SchedulesCron-based recurring DQ check runs
Root Cause AnalysisCorrelate anomalies with upstream events
Profile DriftDetect when data patterns change and recommend new checks

Profiling & auto-generated checks

Profile a table to discover data patterns and auto-generate DQX checks:

# Via API
curl -X POST http://localhost:8080/api/governance/dqx/profile \
-H "Content-Type: application/json" \
-d '{"table_fqn": "catalog.schema.my_table"}'

# Generate checks from profiles
curl -X POST http://localhost:8080/api/governance/dqx/generate \
-H "Content-Type: application/json" \
-d '{"table_fqn": "catalog.schema.my_table"}'

Profile drift detection

When data evolves, re-profile a table to detect drift and get recommendations:

curl -X POST http://localhost:8080/api/governance/dqx/profile-drift \
-H "Content-Type: application/json" \
-d '{"table_fqn": "catalog.schema.my_table"}'

Response:

{
"table_fqn": "catalog.schema.my_table",
"new_columns": ["new_col_a", "new_col_b"],
"removed_columns": ["old_col"],
"recommendations": [
{
"type": "new_column",
"severity": "info",
"column": "new_col_a",
"description": "New column 'new_col_a' detected - consider adding check: is_not_null",
"suggested_check": { "name": "Auto: is_not_null on new_col_a", "check_function": "is_not_null" }
},
{
"type": "removed_column",
"severity": "warning",
"column": "old_col",
"description": "Column 'old_col' no longer exists but has 2 active check(s) - consider removing",
"orphan_check_ids": ["abc123", "def456"]
}
],
"total_recommendations": 3
}

Recommendation types:

  • new_column — New column discovered, suggest adding a check
  • removed_column — Column removed but still has active checks (orphans)
  • changed_pattern — Column's data pattern changed (e.g., was is_not_null, now is_in_list)
  • uncovered — Column has a profile but no corresponding check

DQ check management

Create a check

curl -X POST http://localhost:8080/api/governance/dqx/checks \
-H "Content-Type: application/json" \
-d '{
"table_fqn": "catalog.schema.orders",
"name": "Order amount must be positive",
"check_function": "is_not_less_than",
"arguments": {"column": "amount", "limit": 0},
"criticality": "error"
}'

Run checks

# Run all enabled checks for a table
curl -X POST http://localhost:8080/api/governance/dqx/run \
-H "Content-Type: application/json" \
-d '{"table_fqn": "catalog.schema.orders"}'

Response:

{
"run_id": "a1b2c3d4",
"table_fqn": "catalog.schema.orders",
"total_rows": 50000,
"valid_rows": 49850,
"invalid_rows": 150,
"pass_rate": 99.7,
"checks_applied": 5,
"execution_time_ms": 1234,
"status": "completed"
}

Check audit log

Every check modification (create, update, enable/disable, delete) is tracked:

# Get audit log for a specific check
curl "http://localhost:8080/api/governance/dqx/checks/audit-log?check_id=abc123"

# Get audit log for all checks on a table
curl "http://localhost:8080/api/governance/dqx/checks/audit-log?table_fqn=catalog.schema.orders"

Response:

[
{
"audit_id": "x1y2z3",
"check_id": "abc123",
"table_fqn": "catalog.schema.orders",
"action": "update",
"changes": "{\"criticality\": \"warning\"}",
"performed_by": "user@company.com",
"performed_at": "2026-04-06T10:30:00Z"
}
]

Tracked actions: create, update, delete, enable, disable, bulk_delete


Failed row sampling

When a DQ check fails, up to 10 sample failing rows are automatically persisted for drill-down analysis:

# Get failure samples for a run
curl "http://localhost:8080/api/data-quality/failure-samples?run_id=a1b2c3d4"

# Get failure samples for a table (latest runs)
curl "http://localhost:8080/api/data-quality/failure-samples?table_fqn=catalog.schema.orders"

Response:

[
{
"run_id": "a1b2c3d4",
"table_fqn": "catalog.schema.orders",
"sample_index": 0,
"row_data": {"order_id": "ORD-999", "amount": -5.00, "customer": "John Doe"},
"failed_checks": {"_is_not_less_than_amount": "FAILED"}
}
]
tip

Failed row samples show you the actual data that failed, not just counts. This dramatically speeds up root cause analysis — instead of guessing why 150 rows failed, you can see the exact values.


Segmented checks

Run DQ checks segmented by a dimension column to catch localized quality problems:

curl -X POST http://localhost:8080/api/data-quality/segmented-run \
-H "Content-Type: application/json" \
-d '{
"table_fqn": "catalog.schema.orders",
"segment_column": "region"
}'

Response:

{
"run_id": "s1e2g3",
"table_fqn": "catalog.schema.orders",
"segment_column": "region",
"segments": 4,
"results": [
{"segment_value": "APAC", "total_rows": 12000, "pass_rate": 72.5},
{"segment_value": "EMEA", "total_rows": 15000, "pass_rate": 98.1},
{"segment_value": "LATAM", "total_rows": 8000, "pass_rate": 99.2},
{"segment_value": "NA", "total_rows": 15000, "pass_rate": 99.8}
],
"status": "completed"
}
caution

In the example above, the overall table might pass at 95%+, but APAC is at 72.5% — a localized problem invisible to aggregate metrics.


DQ coverage report

See how much of your catalog is monitored:

curl "http://localhost:8080/api/data-quality/coverage/my_catalog"

Response:

{
"catalog": "my_catalog",
"total_tables": 92,
"covered": 34,
"uncovered": 58,
"coverage_pct": 37.0,
"covered_tables": [
{"table_fqn": "my_catalog.sales.orders", "check_count": 5, "enabled_count": 5}
],
"uncovered_tables": [
{"table_fqn": "my_catalog.staging.raw_events"}
]
}

DQ check schedules

Schedule recurring DQ check runs using cron expressions:

# Create a schedule to run all checks for a table every hour
curl -X POST http://localhost:8080/api/data-quality/schedules \
-H "Content-Type: application/json" \
-d '{
"name": "Hourly orders check",
"schedule_type": "table",
"table_fqn": "catalog.schema.orders",
"cron": "0 * * * *"
}'

# Create a schedule to run an expectation suite daily at midnight
curl -X POST http://localhost:8080/api/data-quality/schedules \
-H "Content-Type: application/json" \
-d '{
"name": "Nightly full suite",
"schedule_type": "suite",
"suite_id": "suite_abc",
"cron": "0 0 * * *"
}'

# Run all enabled checks across all tables every 6 hours
curl -X POST http://localhost:8080/api/data-quality/schedules \
-H "Content-Type: application/json" \
-d '{
"name": "Full DQ sweep",
"schedule_type": "all",
"cron": "0 */6 * * *"
}'

Schedule types

TypeDescription
tableRun all enabled checks for a specific table
suiteRun an expectation suite
checksRun specific check IDs
allRun all enabled checks across all tables

Schedule management

# List all schedules
curl http://localhost:8080/api/data-quality/schedules

# Pause a schedule
curl -X POST http://localhost:8080/api/data-quality/schedules/{id}/pause

# Resume a schedule
curl -X POST http://localhost:8080/api/data-quality/schedules/{id}/resume

# Run a schedule immediately
curl -X POST http://localhost:8080/api/data-quality/schedules/{id}/run

# Delete a schedule
curl -X DELETE http://localhost:8080/api/data-quality/schedules/{id}

DQ gate for clone/sync

Block clone operations when data quality falls below a threshold:

Configuration

# clone_config.yaml
dq_gate:
enabled: true
table_fqn: "production.sales.orders" # Check this table before clone
# suite_id: "suite_abc" # Or check an expectation suite
min_pass_rate: 95.0 # Block if below 95%

When dq_gate.enabled is true, the clone process runs the specified DQ checks before proceeding. If the pass rate is below min_pass_rate, the clone is blocked with an explanation.

Test the gate without cloning

curl -X POST http://localhost:8080/api/data-quality/gate/evaluate \
-H "Content-Type: application/json" \
-d '{
"table_fqn": "production.sales.orders",
"min_pass_rate": 95.0
}'

Response (blocked):

{
"passed": false,
"pass_rate": 87.3,
"min_pass_rate": 95.0,
"total_rows": 50000,
"invalid_rows": 6350,
"checks_applied": 5,
"reason": "DQ gate failed: pass rate 87.3% < threshold 95.0% (6350 invalid rows out of 50000)"
}

Cross-table consistency checks

Check data consistency across multiple tables:

Aggregate match

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
-H "Content-Type: application/json" \
-d '{
"check_type": "aggregate_match",
"params": {
"source_table": "catalog.sales.orders",
"dest_table": "catalog.sales.daily_summary",
"source_expr": "sum(amount)",
"dest_expr": "sum(total_amount)",
"group_by": "order_date"
}
}'

Referential integrity

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
-H "Content-Type: application/json" \
-d '{
"check_type": "referential_integrity",
"params": {
"child_table": "catalog.sales.orders",
"child_column": "customer_id",
"parent_table": "catalog.sales.customers",
"parent_column": "id"
}
}'

Row count match

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
-H "Content-Type: application/json" \
-d '{
"check_type": "row_count_match",
"params": {
"source_table": "catalog.bronze.events",
"dest_table": "catalog.silver.events_cleaned",
"tolerance_pct": 5
}
}'

Custom SQL

curl -X POST http://localhost:8080/api/governance/dq/cross-table-check \
-H "Content-Type: application/json" \
-d '{
"check_type": "custom_sql",
"params": {
"sql": "SELECT (SELECT count(*) FROM catalog.a.t1) = (SELECT count(*) FROM catalog.b.t2) AS passed, '\''counts match'\'' AS message"
}
}'

Impact analysis

When a DQ check fails, see which downstream tables, views, and jobs are affected:

curl "http://localhost:8080/api/data-quality/impact/catalog.schema.source_table"

Response:

{
"table_fqn": "catalog.schema.source_table",
"downstream_tables": [
"catalog.silver.enriched_orders",
"catalog.gold.revenue_dashboard"
],
"downstream_views": [
{"catalog": "catalog", "schema": "reports", "view": "monthly_revenue"}
],
"referencing_jobs": [
{"job_name": "nightly_etl", "job_id": "12345"}
],
"total_affected": 4,
"risk_level": "medium"
}

Root cause analysis

When an anomaly is detected, find correlated events that may explain it:

curl "http://localhost:8080/api/data-quality/root-cause/catalog.schema.orders?hours=24"

Response:

{
"table_fqn": "catalog.schema.orders",
"time_window_hours": 24,
"probable_causes": [
{
"type": "correlated_anomalies",
"confidence": "high",
"description": "Multiple metrics anomalous on the same table: null_rate, row_count"
},
{
"type": "upstream_anomaly",
"confidence": "high",
"description": "Anomalies detected on 2 upstream table(s)"
},
{
"type": "freshness_gap",
"confidence": "medium",
"description": "Table is stale - last updated 36h ago"
},
{
"type": "schema_change",
"confidence": "medium",
"description": "2 recent schema/metadata change(s) on this table"
}
],
"total_causes_found": 4
}

Cause types ranked by confidence:

  • correlated_anomalies — Multiple metrics spiking on the same table
  • upstream_anomaly — Anomalies on tables that feed this one (via lineage)
  • freshness_gap — Table went stale (not updated within threshold)
  • schema_change — Recent schema or metadata changes

Delta tables

DQX stores all data in Delta tables within the governance schema:

TablePurpose
dqx_profilesProfiled rules and patterns
dqx_checksCheck definitions
dqx_run_resultsCheck execution results
dqx_check_definitionsCheck definitions from YAML/Delta sources
dqx_check_audit_logAudit trail for all check modifications
dqx_failure_samplesSample failing rows from runs
dqx_segment_resultsPer-segment check results
dq_check_schedulesScheduled DQ check run definitions

API reference

DQX checks (under /api/governance)

MethodEndpointDescription
POST/dqx/checksCreate a DQX check
GET/dqx/checksList checks (filter by table_fqn)
PUT/dqx/checks/{id}Update a check
DELETE/dqx/checks/{id}Delete a check
POST/dqx/checks/{id}/toggleEnable/disable a check
POST/dqx/checks/delete-bulkBulk delete checks
GET/dqx/checks/audit-logCheck modification audit log
GET/dqx/checks/exportExport checks as YAML
POST/dqx/checks/importImport checks from YAML
POST/dqx/checks/save-to-deltaSave checks to a Delta table
POST/dqx/runExecute DQX checks
POST/dqx/profileProfile a table
POST/dqx/profile-driftDetect profile drift & recommendations
GET/dqx/profilesList profiles
GET/dqx/dashboardDQX dashboard summary
GET/dqx/check-functionsList available check functions
POST/dq/cross-table-checkCross-table consistency check

Data Quality endpoints (under /api/data-quality)

MethodEndpointDescription
GET/coverage/{catalog}DQ coverage report
GET/failure-samplesFailed row samples
POST/segmented-runRun checks by segment
GET/segment-resultsSegmented check results
POST/gate/evaluateEvaluate DQ quality gate
GET/impact/{table_fqn}Lineage-aware failure impact
GET/root-cause/{table_fqn}Root cause correlation
GET/schedulesList DQ check schedules
POST/schedulesCreate a DQ check schedule
DELETE/schedules/{id}Delete a schedule
POST/schedules/{id}/pausePause a schedule
POST/schedules/{id}/resumeResume a schedule
POST/schedules/{id}/runRun a schedule now
GET/health-score/{catalog}Aggregate DQ health score
GET/scorecard/{table_fqn}Per-table DQ scorecard
GET/freshness/{catalog}Table freshness check
GET/anomaliesRecent anomalies
GET/incidentsUnified incident feed

Configuration

# clone_config.yaml

# Anomaly detection thresholds
anomaly_detection:
baseline_window: 30 # Number of measurements for baseline
warning_threshold: 2.0 # z-score for warning
critical_threshold: 3.0 # z-score for critical
system_table_sources:
billing: true
compute: true
query_history: true
storage: true

# DQX engine settings
dqx:
auto_save_to_delta: true
default_target_table: "catalog.schema.dqx_rules"

# DQ gate for clone operations
dq_gate:
enabled: false
table_fqn: "" # Table to check before clone
suite_id: "" # Or expectation suite to run
min_pass_rate: 95.0 # Block if below this threshold

Next steps