Advanced Features
Status note: Two of the features below (Continuous Sync, Streaming/MV Data Clone) are preview only in v0.11.0 — the endpoints return a runnable plan but Clone-Xs does not auto-execute it. Full execution ships in v0.12.0.
Four production-ready features + two scaffolds for capabilities that sit one release away. Each has a dedicated endpoint; most have UI surfacing.
Schema evolution
Source:
src/schema_evolution.py·api/routers/schema_evolution.py
When to use: the source catalog picked up new columns since your last clone and you want to update the destination in place rather than re-clone the whole table. Additive changes (ADD COLUMN, compatible type widening) apply cleanly; destructive changes (DROP COLUMN) require explicit opt-in.
How it works
POST /api/schema-evolution/detect— runs a column-level diff viainformation_schema.columnsfor one table. Returnsadded_columns,removed_columns,changed_columns, and anis_compatibleflag.POST /api/schema-evolution/apply— generatesALTER TABLE … ADD COLUMN,DROP COLUMN(opt-in),ALTER COLUMN … SET DATA TYPEstatements from the detect response. Passdry_run: trueto preview the SQL without executing.POST /api/schema-evolution/evolve-catalog— detects + applies across every table in the catalog in parallel. Use this after an incremental sync to bring the destination structure back into alignment.
Usage
# Detect drift on a single table
curl -X POST $CLXS_HOST/api/schema-evolution/detect \
-d '{"source_catalog":"prod","destination_catalog":"staging","schema_name":"orders","table_name":"line_items"}'
# Apply the returned changes (dry run)
curl -X POST $CLXS_HOST/api/schema-evolution/apply \
-d '{
"destination_catalog":"staging",
"schema_name":"orders",
"table_name":"line_items",
"changes": { ... from detect ... },
"dry_run": true
}'
# Or fire-and-forget across the whole catalog
curl -X POST $CLXS_HOST/api/schema-evolution/evolve-catalog \
-d '{"source_catalog":"prod","destination_catalog":"staging","dry_run":false}'
Limits
ALTER COLUMN … SET DATA TYPEis limited by Delta — only certain widenings are supported (e.g.int → bigint, notstring → int). Failures are logged and reported in theerrorsarray of the response; the table is skipped, not corrupted.- Column drops are off by default. Pass
drop_removed: trueonly when you're certain the destination shouldn't carry the old column. - Not a replacement for Incremental Sync — schema evolution aligns structure, Incremental Sync aligns data.
Cross-metastore reconciliation
Source:
src/cross_metastore_recon.py·api/routers/cross_metastore_recon.py
When to use: after a cross-workspace migration you want to verify every table on the destination matches the source. The built-in validation endpoint is same-workspace only — this one spans two metastores via two WorkspaceClients.
How it works
Two layers of check, both running in parallel across tables:
- Row counts —
SELECT COUNT(*)on each side. Cheap, catches the common "some tables didn't clone" failure mode. - Optional SHA-256 checksums —
sha2(cast(sum(xxhash64(concat_ws('|', columns...))) as string), 256)summed over hashable columns (excludes ARRAY / MAP / STRUCT). Catches silent data drift. Slower — reads the full table.
Response shape:
{
"status": "match | partial | mismatch | failed",
"table_count": 611,
"matched": 609,
"mismatched": 2,
"errors": 0,
"use_checksum": false,
"details": [
{"schema":"bronze","table":"orders","source_count":123456,"target_count":123456,"match":true}
]
}
Usage
curl -X POST $CLXS_HOST/api/reconciliation/cross-metastore \
-d '{
"source_catalog": "prod",
"destination_catalog": "prod_dr",
"target_workspace": {
"host": "https://adb-target.azuredatabricks.net",
"auth_method": "pat",
"token": "dapi...",
"warehouse_id": "abc123"
},
"use_checksum": false,
"max_workers": 4
}'
Run checksum mode (use_checksum: true) on spot-check runs, not scheduled jobs — it reads every row. For scheduled jobs, row counts only.
Clone signing / provenance
Source:
src/clone_provenance.py·api/routers/clone_provenance.py
When to use: compliance workflows where you need to later prove "this catalog prod_audit_2026_04 is the tamper-evident result of Clone-Xs run X with config Y at time T". HMAC-SHA256 signs a canonical manifest of the clone's config + result summary; tampering with any field breaks the signature.
Enabling
export CLONE_XS_SIGNING_SECRET="<base64-encoded-random-bytes>"
Without the env var set, sign endpoints return {"signed": false, "reason": "..."} — no crypto failure, just a clear message.
How it works
- Canonicalize — drop sensitive fields (
token,client_secret,target_workspace, etc.) + runtime-nondeterministic fields (logs, run_url), thenjson.dumpswithsort_keys=True, separators=(',',':'). Two independent signings of the same logical clone agree on a byte-identical canonical form. - Sign —
HMAC-SHA256(secret, canonical).hexdigest(). Envelope returned:{signed, algorithm, signature, manifest}. - Verify — re-canonicalize the manifest, recompute HMAC, compare in constant time (
hmac.compare_digest).
Usage
# Sign a completed job
curl -X POST $CLXS_HOST/api/provenance/sign/<job_id>
# Or sign an externally-constructed manifest
curl -X POST $CLXS_HOST/api/provenance/sign \
-d '{"source_catalog":"prod","destination_catalog":"prod_audit","config":{...},"result":{...}}'
# Verify later
curl -X POST $CLXS_HOST/api/provenance/verify \
-d '<the signed envelope from above>'
This is tamper-evidence, not proof against an attacker with the secret. The HMAC secret is as sensitive as a database password — rotate if exposed, and don't check it into the repo. For stronger guarantees, sign the envelope with an external KMS and put its key ID in the manifest.
AI-suggested config (Clone Builder)
Source:
src/ai_service.py·api/routers/ai.py·ui/src/components/CloneBuilder.tsx
When to use: you know what you want in English but don't remember which combination of flags matches. The Clone Builder translates "I need a 14-day dev copy of prod without PII" into a ready-to-run CloneRequest config.
How it works
The existing /ai/clone-builder endpoint forwards your natural-language query plus the list of available catalogs to the configured AI backend (Databricks Model Serving or the Anthropic API — pick one via the X-Databricks-Model header on the request, or the DATABRICKS_MODEL / ANTHROPIC_API_KEY environment variables on the server). The service returns a structured config JSON + a short explanation of which flags it picked and why.
Usage
UI: the Clone Builder modal in the header bar — ⌘K → "Clone Builder", or the sparkles icon.
API:
curl -X POST $CLXS_HOST/api/ai/clone-builder \
-H "X-Databricks-Model: databricks-dbrx-instruct" \
-d '{
"query": "14-day dev copy of retail_prod, exclude PII columns, skip unused tables",
"available_catalogs": ["retail_prod","retail_dev","retail_staging"]
}'
Response:
{
"config": {
"source_catalog": "retail_prod",
"destination_catalog": "retail_dev",
"clone_type": "SHALLOW",
"ttl": "14d",
"skip_unused": true,
"masking": {...}
},
"explanation": "Using SHALLOW for low-cost dev; 14d TTL auto-expires; skip_unused cuts scope; masking applied to columns matching pii_patterns."
}
The AI can't know your column semantics — always review the suggested masking rules and filter config before executing.
Continuous sync (preview)
Source:
src/continuous_sync.py·api/routers/continuous_sync.py
This endpoint returns a runnable streaming job plan but Clone-Xs does not auto-submit it. Paste the plan into your scheduler, or wait for the v0.12.0 execution engine.
When it'll be useful: for DR setups where minutes-fresh replicas are required. Beyond the scheduled Incremental Sync which is polling-based, continuous sync uses Structured Streaming against the source's Change Data Feed — seconds-level lag, checkpoint-recoverable.
What the plan contains
POST /api/continuous-sync/plan returns:
- Declarative job spec —
run_name, task list, a placeholdernew_clusterfor the caller to fill in - Inlined Python body — a
readStream … writeStreamtemplate with CDF options preset + a clearTODOblock where the v0.12.0 engine will inject the per-table MERGE logic - Prerequisites list — e.g. "CDF enabled on every source table in scope"
- Checkpoint root — defaults to
/Volumes/<dest>/_sys/continuous_syncif not provided
Usage
curl -X POST $CLXS_HOST/api/continuous-sync/plan \
-d '{
"source_catalog": "prod",
"destination_catalog": "prod_dr",
"schema_name": "orders",
"trigger_ms": 30000
}'
Save the response JSON, submit to a scheduler of your choice (Databricks Jobs CLI, Terraform, Airflow). The generated Python is valid and runs; it just doesn't have Clone-Xs's full MERGE-with-PK logic yet — that lands in v0.12.0.
Streaming / MV data clone (preview)
Source:
src/streaming_clone_generator.py·api/routers/streaming_clone_generator.py
Generates the DLT pipeline spec + notebook SQL. Does not auto-create the pipeline — caller POSTs to /api/2.0/pipelines themselves, or waits for v0.12.0.
When it'll be useful: the existing Advanced Tables Clone flow migrates MV + streaming-table definitions but does not repopulate the data — because data in MVs and streaming tables can only be built by running a DLT pipeline. This module closes the loop by generating that pipeline.
What the plan contains
POST /api/streaming-clone/generate returns:
pipeline_spec— consumable byclient.pipelines.create()as-is. Lands in development mode for safety.notebook_sql— oneCREATE OR REFRESH MATERIALIZED VIEW …orCREATE OR REFRESH STREAMING TABLE …statement per advanced table, with source SQL bodies rewritten from the inputnotebook_path— where to paste the SQL before creating the pipelinenext_steps— 3-step caller instructions
Usage
curl -X POST $CLXS_HOST/api/streaming-clone/generate \
-d '{
"source_catalog": "prod",
"destination_catalog": "prod_dr",
"schema_name": "analytics",
"advanced_tables": [
{"name": "daily_sales", "table_type": "MATERIALIZED_VIEW", "source_sql": "SELECT ... FROM prod.bronze.sales ..."},
{"name": "live_orders", "table_type": "STREAMING_TABLE", "source_sql": "SELECT ... FROM STREAM(prod.bronze.orders)"}
]
}'
Response includes the full notebook SQL — copy-paste into a notebook at the notebook_path, then POST the pipeline_spec to /api/2.0/pipelines and trigger a full-refresh update. The new pipeline populates both MVs + streaming tables on the destination.