Skip to main content

Column Profiling

The Profiling page at /data-quality/profiling runs statistical profiles on every column of a target table — distributions, null rates, distinct counts, type inference, and pattern detection. Use it to understand a new dataset, baseline expectations, or spot quality issues without writing any rules.

What's profiled

Per column, Clone-Xs computes:

StatNotes
TypeInferred + declared
Null count / rateAcross the full table
Distinct countApproximate via HLL for big tables
Min / maxFor numeric and date columns
Mean / stddevNumeric only
QuantilesP25, P50, P75, P95, P99
Top valuesTop 20 by frequency
PatternsRegex inference (e.g. emails, UUIDs, phone numbers)
Cardinality classCategorical / continuous / unique

Running a profile

Pick a table and click Profile. The page calls:

POST /data-quality/profiling
{ "table_fqn": "prod_warehouse.sales.customers" }

For tables ≥ 1 GB, the profiler samples at min(1M rows, 5%) by default. Override:

POST /data-quality/profiling
{ "table_fqn": "...", "sample": "full" } # or { "rows": 5000000 }

Result UI

Per-column cards show:

  • Histogram (numeric) or top-values bar chart (categorical)
  • Stat table (the metrics above)
  • Pattern badges (e.g. Email-like, UUID-like)
  • Expand-to-detail with sample values

Saving baselines

Profiles can be saved as a baseline for later comparison (Schema Drift uses the saved baseline to flag changes):

POST /data-quality/profiling/baseline
{ "table_fqn": "...", "name": "2026-04-30_baseline" }

Generating rules from profiles

The Generate Rules button on each column scaffolds DQX rules from the profile:

  • Null-rate thresholds from observed null %
  • Range checks from observed min/max
  • Pattern checks from inferred regex
  • Distinct-count expectations from cardinality

You review and edit before saving — nothing is auto-saved.

API

POST /data-quality/profiling
GET /data-quality/profiling/baselines?table_fqn=...
POST /data-quality/profiling/baseline
GET /data-quality/profiling/{run_id}