Skip to main content

Cloning

The clone command replicates an entire Unity Catalog catalog — schemas, tables, views, functions, and volumes — to a new destination catalog.

Docs: Unity Catalog | CREATE TABLE CLONE

Basic usage

# Minimal — uses config file defaults
clxs clone

# Override source and destination from CLI
clxs clone --source production --dest sandbox

# With all the bells and whistles
clxs clone \
--source production --dest sandbox \
--clone-type DEEP \
--validate --enable-rollback --report --progress \
-v --log-file clone_sandbox.log

Config (config/clone_config.yaml)

source_catalog: "production"
destination_catalog: "sandbox"
clone_type: "DEEP"
sql_warehouse_id: "abc123def456"
max_workers: 4
copy_permissions: true
copy_ownership: true
copy_tags: true
load_type: "FULL"
exclude_schemas:
- "information_schema"
- "default"

How the clone engine works

Source: src/clone_catalog.py, clone_tables.py, clone_views.py, clone_functions.py, clone_volumes.py

When you'll reach for this: debugging a clone that behaved unexpectedly, deciding which stages to toggle for performance, or understanding why grants didn't transfer. See Use Cases for scenario-driven feature combinations.

A clone runs in five stages: catalog → schemas → tables → views/functions/volumes → metadata. Each stage is its own SQL batch and can be toggled via config. Table cloning runs in parallel within a schema; schemas themselves run in parallel up to max_workers (default 4).

Stage 1 — Catalog

CREATE CATALOG IF NOT EXISTS <dest> [MANAGED LOCATION '<location>']

If location (or catalog_location) is set, it's applied as the managed storage root. After creation, catalog-level grants are replayed (SHOW GRANTS ON CATALOG <src>GRANT … ON CATALOG <dest> TO …) when copy_permissions=true. Ownership (ALTER CATALOG … OWNER TO …) and catalog tags are copied when their flags are on.

Stage 2 — Schemas

CREATE SCHEMA IF NOT EXISTS <dest>.<schema>

get_schemas filters the source by include_schemas / exclude_schemas. Each schema is submitted to a ThreadPoolExecutor(max_workers=max_workers) so schemas clone concurrently. Grants / ownership / tags replay at the schema level right after creation, before any tables are cloned into it.

Stage 3 — Tables

The core of the engine. For every source table:

CREATE TABLE IF NOT EXISTS <dest>.<schema>.<table>
DEEP CLONE <src>.<schema>.<table>
[TIMESTAMP AS OF '2026-01-15 00:00:00' | VERSION AS OF 42]
  • DEEP CLONE copies every data file into the destination's storage — the destination becomes fully independent.
  • SHALLOW CLONE writes only a metadata pointer; the destination reads from the source's files until you DEEP CLONE again. Files deleted on source break the shallow clone.

Behavior modifiers:

ConditionSQL change
where_clauses matches the tableSwitches to CREATE TABLE dest AS SELECT * FROM src WHERE …loses Delta history (DEEP only)
force_reclone=true + table existsDROP TABLE dest first, then CREATE TABLE CLONE
as_of_timestamp / as_of_version setAppended to the CLONE statement (time travel)
Table matches include_tables_regex / exclude_tables_regexSkipped with a log line; reason recorded in the job summary

Within each schema, parallel_tables (default 1) controls how many tables clone simultaneously. Set to 4–8 for catalogs with many small tables; keep at 1 for catalogs dominated by large tables to avoid saturating the warehouse.

What gets cloned vs skipped

Clone-Xs only runs CREATE TABLE … CLONE source against tables whose table_type is MANAGED or EXTERNAL. Anything else is logged + counted as skipped, with a clear reason in the run output:

Source table_typeWhat happensWhy
MANAGED✓ ClonedStandard Unity Catalog managed Delta — fully cloneable.
EXTERNAL✓ ClonedExternal table (Delta / Parquet / Iceberg) registered in UC.
STREAMING_TABLE△ SkippedOwned by a Lakeflow / DLT pipeline. The table is the output of a streaming pipeline definition; cloning the data files would produce a static snapshot with no way to refresh, silently breaking the user's mental model. Recreate by running the pipeline against the new schema.
MATERIALIZED_VIEW△ SkippedSame shape as streaming tables — pipeline-owned, not standalone.
VIEW(handled by clone_views.py, not here)Logical view → re-emitted as CREATE OR REPLACE VIEW against the new schema.
Any other / unknown△ SkippedDefensive default — better to surface "unknown type, skipping" than fire a CLONE that produces a cryptic Databricks error. Open an issue if your environment uses a table_type that should be supported.

The skip line in the run log names the type explicitly so operators can tell at a glance what was filtered:

△ Skipping non-clonable table type STREAMING_TABLE: iot.bronze_pos_terminal
(streaming / materialized-view tables are pipeline-owned and must be
recreated by re-running their pipeline against the new schema)

If your run summary shows Tables: 0 success, 0 failed, 0 skipped but you expected tables to clone, check the schema's table_type via DESCRIBE TABLE EXTENDED — almost always it's a streaming table being silently ignored prior to the v0.8.x fix that surfaced the skip.

Mixed-format sources (Delta, Parquet, Iceberg)

The CLONE statement is format-agnostic. The same CREATE TABLE … DEEP CLONE source syntax works whether the source is Delta, Parquet, or Iceberg — provided the source is registered in Unity Catalog. The destination always lands as Delta, regardless of source format. This means a single Clone-Xs run can migrate a catalog that mixes formats (typical mid-migration state), and the run summary breaks the result down by source format:

Source formats:  DELTA: 26   PARQUET: 2   ICEBERG: 1
Bytes Copied: 480 GB Files Copied: 12,840

Format-specific gotchas inherited from Databricks CLONE. Phase B of the Iceberg work (released alongside target_format) added two safety nets so most of these no longer fail-loud:

  • Iceberg + partition evolution — Clone-Xs auto-retries as CREATE TABLE … AS SELECT * FROM source (CTAS) when it sees this error class. The recovered target lands as Delta but starts at version 0 — Delta source history is lost. A WARN line in the run log makes the fallback explicit.
  • Iceberg with truncated decimal partitions — same auto-CTAS recovery as above. Truncated partitions on string / long / int columns work natively on DBR 13.3+; the CTAS fallback covers older runtimes.
  • Iceberg with hidden partitioning (bucket(N, col), truncate(N, col), years(col), months(col), days(col), hours(col)) — refused at preflight, before any DDL runs. Hidden partition transforms have no Delta equivalent, and silently dropping them would break partition pruning on the target. Use the Convert table format endpoint to rewrite the source in place, then re-clone — or write a manual CTAS that materialises the transform as a Delta generated column.
  • Partitioned Parquet referenced by path — clone fails. Register the table to UC by name first.
  • Glob/wildcard paths — not supported by Databricks CLONE for any format.

See the Databricks Parquet/Iceberg CLONE reference for the canonical limitations list.

Target format — target_format: ICEBERG (UniForm)

By default a clone lands as Delta. Set target_format: ICEBERG (or pick ICEBERG in the wizard's Target Format toggle) to additionally enable Delta UniForm on the destination so external Iceberg engines (Snowflake, Trino, Athena, Iceberg-aware Spark, etc.) can read the table without a separate copy.

What it does, mechanically: after each successful Delta DEEP CLONE, Clone-Xs runs three statements in order (the order is mandatory — Databricks' IcebergCompatV2 validator rejects any other sequence):

-- 1. Disable deletion vectors. Modern DBR has them on by default; UniForm
-- can't coexist with DVs so we turn them off first.
ALTER TABLE `dst`.`schema`.`table`
SET TBLPROPERTIES ('delta.enableDeletionVectors' = 'false');

-- 2. Bake any existing deletion-marker files into rewritten data files.
-- No-op (but cheap scan) if the freshly-cloned table had no DVs.
REORG TABLE `dst`.`schema`.`table` APPLY (PURGE);

-- 3. Enable UniForm. Now allowed because the table has no DVs.
ALTER TABLE `dst`.`schema`.`table` SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);

If you skip steps 1+2, step 3 fails with DELTA_ICEBERG_COMPAT_VIOLATION.DELETION_VECTORS_SHOULD_BE_DISABLED. Earlier Clone-Xs releases emitted only step 3, which is why fresh clones from DV-enabled sources logged a UniForm warning even though the CLONE itself succeeded — fixed in v0.8.x.

Constraints worth knowing:

  • Delta source only. Non-Delta sources in the same job (Parquet, Iceberg) clone normally but UniForm is skipped for those tables — a WARN line is logged and the rest of the run continues.
  • Destination is still Delta. UniForm publishes Iceberg-compatible metadata alongside the Delta log; it doesn't physically rewrite to Iceberg. If you need actual Iceberg storage / file format semantics, that's the Phase B explicit-conversion path (currently scoped, not shipped).
  • One-way. Disabling UniForm later is ALTER TABLE … UNSET TBLPROPERTIES. The Delta history isn't affected.
  • Dry-run. No ALTER is emitted in dry-run mode — same discipline as the rest of the clone path.
  • Cross-workspace clones (Delta Sharing path) honour target_format: ICEBERG too — UniForm is enabled on the target after each successful DEEP CLONE through the share.
iceberg_physical: true — physical Iceberg target

UniForm makes a Delta table readable by Iceberg engines but UC still reports Data source: Delta. If you need UC itself to recognise the table as Iceberg (so it shows Data source: Iceberg and behaves like a managed Iceberg table for downstream Iceberg-native tooling), set iceberg_physical: true (or check the Physical Iceberg target box in the wizard, visible only when ICEBERG is selected).

What changes mechanically: instead of DEEP CLONE … then ALTER TBLPROPERTIES (UniForm), Clone-Xs emits

CREATE TABLE IF NOT EXISTS `dst`.`schema`.`table`
USING iceberg
AS SELECT * FROM `src`.`schema`.`table`;

Trade-offs you accept by enabling this:

LostWhy
Delta historyTarget starts at version 0 — CTAS doesn't carry source Delta log.
Time-travel argumentsTIMESTAMP AS OF / VERSION AS OF ignored with a WARN log line. Iceberg has its own snapshot model not addressable via Delta time-travel syntax.
Delta-only features on the targetDeletion vectors, change feed, row tracking — can't be applied to an Iceberg table.
Per-CLONE metricsThe Databricks per-statement counters Clone-Xs surfaces for DEEP CLONE aren't emitted by CTAS.

What you gain: UC sees the table as Iceberg, downstream Iceberg-native tooling (snapshot pruning, Iceberg-spec-compliant compaction, etc.) works directly without the metadata bridge.

Verify in your workspace before committing to this for a full catalog clone: Iceberg-managed-table support requires DBR 15+ and is not enabled in every region/billing tier. Try it on one table first:

CREATE TABLE my_test_iceberg USING iceberg AS SELECT * FROM source.delta_table LIMIT 1000;
DESCRIBE EXTENDED my_test_iceberg; -- confirm `Provider: iceberg`

If your workspace doesn't support this, the clone will fail with a clear CREATE TABLE … USING iceberg error. Drop back to the default UniForm path (iceberg_physical: false) — it works on every Databricks SQL warehouse.

Iceberg source preflight (Phase B)

When the source is Iceberg, Clone-Xs runs DESCRIBE TABLE EXTENDED before the CLONE statement and refuses tables that use hidden-partition transforms. The refusal is deliberate — see src/clone_iceberg.py for the full check. The error message names the offending transform and points at the workaround:

Source Iceberg table `src`.`s`.`t` uses hidden partitioning
(bucket(16, user_id)) which has no Delta equivalent. Clone-Xs refuses
this clone rather than silently change the partitioning semantics.
Workarounds:
1) Materialise the transform as a regular column on the source and re-clone, OR
2) Run a manual CTAS that replicates the transform via Delta generated columns, OR
3) Use CONVERT TO DELTA on the source (in-place; destructive) and then clone normally.

Option 3 has a dedicated endpoint and UI page — see the Convert table format guide.

Type-level differences (time, uuid, fixed(L), timestamptz) are not refusal cases — they map through CLONE with documented losses (uuid → string, fixed → binary, etc.). See ICEBERG_TYPE_NOTES in src/clone_iceberg.py for the full table.

When the source is Iceberg, the runner emits an informational log line listing the same caveats inline so operators see them in the run output (not just the docs):

INFO Iceberg source `src`.`s`.`t` — type-mapping caveats may apply: time:
no Delta equivalent — Delta has only date and timestamp, uuid: lands as
Delta string (lossy but reversible), fixed: lands as Delta binary, fixed
length is dropped, timestamptz: lands as Delta timestamp (UTC stored,
zone metadata dropped). Spot-check affected columns on the target if your
downstream consumers depend on length / zone / format-specific semantics.

This is a log, not a runtime detector: UC surfaces Iceberg-native types as their already-Sparkified equivalents (uuid already shows as STRING via DESCRIBE TABLE), so a programmatic schema scan can't reliably identify them. The log is the honest "here are the things to watch for" surface — operators can spot-check the columns they care about.

Stage 4 — Views, functions, volumes

Run after tables because views and functions reference them. For each:

ObjectRead source DDLRewriteWrite on destination
ViewSHOW CREATE TABLE <src>.<schema>.<view>Regex-rewrite <src>.<dest>. in qualified names (backticked + bare, case-insensitive)CREATE OR REPLACE VIEW <dest>…
FunctionDESCRIBE FUNCTION EXTENDED <src>.<schema>.<fn> → extract the DDL body, strip embedded Spark config linesSame catalog-ref rewriterCREATE OR REPLACE FUNCTION <dest>…
Volumeclient.volumes.list() via SDK, read volume_type + storage_locationCREATE VOLUME IF NOT EXISTS (managed) or CREATE EXTERNAL VOLUME … LOCATION '<url>'

Views that reference catalogs outside the migration scope will fail to materialize and get logged as errors — they don't block the rest of the clone. The DDL rewriter is regex-based, so Python UDFs that embed catalog names as string literals in their body are not rewritten automatically.

Stage 5 — Metadata replay

Per object (catalog, schema, table, view, function, volume):

  • Grants: SHOW GRANTS ON <object>GRANT <privilege> ON <dest-object> TO <principal>. Principals that don't exist on the destination metastore are skipped with a debug log; the count lands in grants_skipped.
  • Ownership: SDK tables.update / schemas.update for UC-managed ownership, or SQL ALTER … OWNER TO <principal>.
  • Tags: read from system.information_schema.table_tags (when available) → ALTER TABLE <dest> SET TAGS ('k' = 'v').

Metadata replay is best-effort — a single failing GRANT never aborts the clone.

Serverless execution

Source: src/serverless.py

When to use: one-off clones, CI/CD pipelines that spin up + tear down test catalogs per PR, and scheduled jobs where keeping a SQL warehouse hot between runs would be wasteful. Skip for clones that complete in under a minute — the cold-start cost outweighs the savings.

When serverless: true and volume: /Volumes/… are set, the clone doesn't run in the local process. Instead Clone-Xs:

  1. Packages itself as a wheel (dist/clone_xs-*.whl) and client.files.upload()s it to the provided UC volume.
  2. Generates a 3-cell notebook at /Shared/.clxs/run_clone via client.workspace.import_():
    • Cell 1: %pip install /Volumes/.../clone_xs-*.whl --quiet
    • Cell 2: dbutils.library.restartPython()
    • Cell 3: wires spark.sql() as the SQL executor via set_sql_executor(spark_sql_executor), then invokes clone_full_catalog(config) with the clone config passed in as a notebook widget (JSON-encoded).
  3. Submits the notebook as client.jobs.submit(SubmitTask(notebook_task=NotebookTask(…))) on serverless compute (no cluster config). The SDK's run.result() polls until the notebook exits.
  4. Every CREATE TABLE CLONE that would normally go through a SQL warehouse now runs as spark.sql(…) on the serverless job's executor — the warehouse isn't used at all. Progress + logs stream back via the job's stderr and surface in the Clone-Xs UI log panel.

Pick serverless mode for: one-off clones, CI pipelines, and scheduled jobs where keeping a warehouse hot is wasteful. The cold-start cost is ~20-40s to pull the wheel and restart the Python runtime — not worth it for clones that take less than a minute.


Deep vs shallow clone

Docs: Deep & Shallow Clone | CREATE TABLE CLONE

When to use:

  • Deep clone: You need a fully independent copy of the data (e.g., for a QA environment that runs destructive tests).
  • Shallow clone: You need a fast, low-cost copy that references the source data (e.g., a dev environment for running read-only queries).

Real-world scenario: Your QA team needs an isolated copy of production to run integration tests that may INSERT, UPDATE, or DELETE rows. Meanwhile, data scientists need a quick dev copy to explore data without modifying it.

# Deep clone for QA (full data copy — takes longer, uses storage)
clxs clone --source production --dest qa_env --clone-type DEEP

# Shallow clone for dev (fast, near-zero storage cost)
clxs clone --source production --dest dev_env --clone-type SHALLOW
CriterionDeep CloneShallow Clone
Data independenceFully independentReferences source files
Storage cost2x (duplicates data)Near zero
Clone speedSlow (copies data)Fast (metadata only)
Write operations on cloneSafeMay fail or affect source
Use caseQA, staging, DRDev, exploration, demos

Full vs incremental vs selective load

Docs: Delta Clone

When to use:

  • Full: First-time clone or when you want a complete refresh.
  • Incremental: Subsequent runs where you only want to add new objects that don't exist in the destination yet.
  • Selective: Re-clone only tables whose source state has drifted from target — leaves in-sync tables alone. Runtime is proportional to drift size, not catalog size.

Real-world scenario: You do a full clone every Sunday night. On weekdays, you run incremental loads to pick up new tables added during the week — without re-cloning existing tables. Mid-week, an upstream batch job rewrites three fact tables; you run a selective re-clone instead of a full refresh, which touches only those three tables.

# Sunday: full refresh
clxs clone --source production --dest staging --load-type FULL

# Mon-Sat: only clone new objects
clxs clone --source production --dest staging --load-type INCREMENTAL

# Mid-week drift fix: re-clone only tables whose source diverged from target
clxs clone --source production --dest staging --load-type SELECTIVE
source_catalog: "production"
destination_catalog: "staging"
clone_type: "DEEP"
load_type: "INCREMENTAL" # Only add new tables/views/functions
sql_warehouse_id: "abc123"

Selective re-clone (load_type: SELECTIVE)

Selective re-clone is a third mode (alongside FULL and INCREMENTAL) for keeping a previously-cloned catalog fresh without re-transferring static data. On every run, Clone-Xs:

  1. Lists tables on both source and target via the Catalog SDK.
  2. For each common table, compares the current Delta version on source vs target via DESCRIBE HISTORY.
  3. Builds a "drift list" of tables to re-clone:
    • never_cloned — present on source, missing from target. Cloned in.
    • version_driftsource.version > target.version. Re-cloned with force_reclone=true (DROP target, then CREATE TABLE … DEEP CLONE).
    • unable_to_compare — DESCRIBE HISTORY returned nothing on either side (non-Delta source like Parquet/Iceberg, or transient SDK errors). Treated as drifted to be safe — cheaper than missing real drift.
  4. Runs the existing per-table CLONE machinery (so all metrics capture, TBLPROPERTIES overrides, mask handling, ownership/tags/permissions replay still apply) on the drift list only.
  5. Schemas with zero drift log a one-line "in sync" entry and contribute nothing to the run summary.

What's NOT touched:

  • Tables on target but not on source — selective is additive only, never destructive. Use a separate compare/cleanup if you need to drop orphans on target.
  • Tables where source.version == target.version — assumed in sync, skipped.
  • Views, functions, volumes — these aren't versioned the same way. Selective only re-clones tables; combine with a separate FULL or INCREMENTAL run if non-table objects need refreshing.

Trade-offs vs INCREMENTAL:

  • INCREMENTAL skips tables that exist on target (regardless of drift).
  • SELECTIVE skips tables whose content matches target (regardless of whether they exist).

So if you ran INCREMENTAL daily, you'd never catch updates to existing tables; if you run SELECTIVE, you do — but at the cost of issuing two extra DESCRIBE HISTORY queries per source table.


Time travel

Docs: Delta Time Travel | Query table history

When to use: Clone tables as they were at a specific point in time. Useful for recovering data, auditing, or creating point-in-time snapshots.

Real-world scenario: A data pipeline had a bug on March 5th that corrupted the orders table. You want to clone the catalog as it was on March 4th (before the bug) to create a clean recovery copy.

# Clone from a specific timestamp
clxs clone \
--source production --dest recovery \
--as-of-timestamp "2026-03-04T23:59:59"

# Clone from a specific Delta version
clxs clone \
--source production --dest recovery_v42 \
--as-of-version 42

The tool appends TIMESTAMP AS OF '...' or VERSION AS OF N to every CREATE TABLE ... CLONE statement, leveraging Delta Lake's built-in time travel.


Schema filtering

Docs: Information Schema

When to use: You only need to clone specific schemas, not the entire catalog. Or you need to exclude certain schemas from cloning.

Real-world scenario: Your production catalog has 50 schemas, but you only need sales and marketing in the dev environment. Or you want to exclude staging_temp and backfill_scratch from cloning.

# Only clone specific schemas
clxs clone --include-schemas sales marketing analytics

# Exclude schemas via config
# Only clone these schemas (if set, overrides exclude)
include_schemas:
- "sales"
- "marketing"
- "analytics"

# Always exclude these
exclude_schemas:
- "information_schema"
- "default"
- "staging_temp"
- "backfill_scratch"

Regex table filtering

Docs: Information Schema TABLES

When to use: You need fine-grained control over which tables to clone — for example, only fact and dimension tables, or excluding temporary and backup tables.

Real-world scenario: Your analytics schema contains 200 tables, but you only need the star schema tables (prefixed with fact_ and dim_) in the reporting environment.

# Only clone fact and dimension tables
clxs clone --include-tables-regex "^fact_|^dim_"

# Exclude temp and backup tables
clxs clone --exclude-tables-regex "_tmp$|_backup$|_old$"

# Combine both
clxs clone \
--include-tables-regex "^fact_|^dim_" \
--exclude-tables-regex "_v1$"

Scope Picker — partial-catalog clones

Docs: include_objects config reference | POST /api/clone schema

When to use: You don't want the whole catalog. You want a specific set of schemas, a handful of tables from one schema, a view and two functions — or any combination across several schemas. Typing regex for that gets painful fast.

Real-world scenario: Your engineering team wants to clone prod to a dev workspace, but only the three schemas their service owns — plus one shared reference table from the warehouse schema. You open the Clone page, pick prod as source, flip the Scope toggle from "Entire catalog" to "Select schemas + objects", expand the four schemas and check what you need. Clone-Xs translates the selection into include_schemas + an anchored table regex and submits.

UI:

Step 1 on the Clone page has a Scope section with two buttons:

OptionBehavior
Entire catalogDefault — clone every schema except those in exclude_schemas (existing behavior)
Select schemas + objectsLazy-load schemas from the source catalog. Expand any schema to see its tables, views, functions, and volumes with individual checkboxes. A per-schema "all / none" shortcut sits next to the schema name.

Running totals (schemas / tables / views / functions / volumes) display above the tree so you can verify the scope at a glance. The Next: Options button stays disabled until at least one object is checked.

API usage:

The selection travels as include_objects on the existing POST /api/clone body:

{
"source_catalog": "prod",
"destination_catalog": "prod_dev",
"include_objects": [
{ "schema": "orders", "name": "line_items", "type": "table" },
{ "schema": "orders", "name": "customers", "type": "table" },
{ "schema": "marketing", "name": "v_campaigns", "type": "view" },
{ "schema": "analytics", "name": "calc_discount", "type": "function" }
]
}

The router translates the list into:

  • include_schemas = ["analytics", "marketing", "orders"]
  • include_tables_regex = "^(calc_discount|customers|line_items|v_campaigns)$"

Both orchestrators (same-workspace and cross-workspace) honor those filters, so no additional config is needed.

Volume selection

Volumes are enumerated per-schema and don't honor the table regex today. If you pick a specific volume from a schema, the whole schema's volumes will come along. Selecting nothing from a schema excludes volumes correctly.

Composes with other filters

include_objects composes with include_schemas, exclude_schemas, and the include/exclude regex fields — whichever filter is more restrictive wins. Safe to use alongside an existing YAML config.


Tag-based filtering

Docs: Unity Catalog Tags

When to use: Your organization uses Unity Catalog tags to classify schemas and tables. You want to clone only objects tagged with specific metadata.

Real-world scenario: Only schemas tagged pii_level: none should be cloned to the sandbox environment — schemas with PII data should be excluded automatically.

filter_by_tags:
pii_level: "none"
environment: "shareable"

This will only clone schemas that have both tags matching.


Parallel processing

Docs: SQL Statement Execution API

When to use: You have a large catalog and want to reduce total clone time by processing multiple schemas and tables concurrently.

Real-world scenario: Your warehouse catalog has 30 schemas and 2,000 tables. Sequential cloning takes 4 hours. With 8 parallel schema workers and 4 parallel table workers per schema, it completes in under 1 hour.

# 8 schemas in parallel, 4 tables in parallel within each schema
clxs clone --max-workers 8 --parallel-tables 4
max_workers: 8        # Parallel schema processing
parallel_tables: 4 # Parallel table cloning within each schema

Sizing guidance

Catalog Sizemax_workersparallel_tables
Small (< 10 schemas, < 100 tables)2-41
Medium (10-50 schemas, 100-1000 tables)4-82-4
Large (50+ schemas, 1000+ tables)8-164-8

Monitor your warehouse's query queue — if queries start queuing, reduce parallelism.


Table size ordering

Docs: DESCRIBE DETAIL

Real-world scenario:

  • Smallest first: Clone small reference tables first so downstream views and reports can start working sooner while large fact tables are still cloning.
  • Largest first: Start the biggest tables first to maximize wall-clock parallelism — small tables fill in the gaps.
# Clone smallest tables first
clxs clone --order-by-size asc

# Clone largest tables first (better for total time with parallel workers)
clxs clone --order-by-size desc

Rate limiting

Docs: SQL Statement Execution API rate limits

When to use: You're cloning during business hours or sharing a SQL warehouse with other teams, and you don't want the clone job to monopolize the warehouse.

Real-world scenario: Your shared serverless warehouse has a concurrency limit. By capping the clone at 5 SQL requests per second, other team members' queries continue to run smoothly.

clxs clone --max-rps 5
max_rps: 5   # Max 5 SQL statements per second (0 = unlimited)

Dry run

Docs: SQL Statement Execution API

When to use: Before running a clone against a production environment, preview every SQL statement that would be executed — without actually running any writes.

Real-world scenario: You're setting up a new clone config and want to verify it will clone the right schemas and tables before executing against the production warehouse.

# Preview all operations
clxs clone --dry-run -v

# Output shows:
# [DRY RUN] Would execute: CREATE CATALOG IF NOT EXISTS `staging`
# [DRY RUN] Would execute: CREATE SCHEMA IF NOT EXISTS `staging`.`sales`
# [DRY RUN] Would execute: CREATE TABLE IF NOT EXISTS ... DEEP CLONE ...

All read operations (listing schemas, tables) still execute so you get an accurate preview. Only write operations are skipped.


Pre-clone source quiesce

When to use: The source catalog has live writers (ingestion jobs, ad-hoc analyst writes) and you need a clone that's content-consistent — not a mix of "table A at 09:30, table B at 09:31, table A again at 09:32 because someone INSERTED mid-clone."

Real-world scenario: You're cloning production to production_dr for a DR drill. Production has a half-dozen Spark jobs that INSERT into bronze tables continuously. Without quiesce, the clone could produce a target where bronze.events has 1.2M rows but bronze.users only has the rows that existed before the Spark job for users started its commit during the clone — silent partial-time-travel divergence between tables.

clxs clone --source production --dest production_dr --quiesce-source
quiesce_source: true   # OFF by default
POST /api/clone
{ "source_catalog": "production", "destination_catalog": "production_dr", "quiesce_source": true }

How it works

When quiesce_source: true, Clone-Xs:

  1. Snapshots grants on each source schema via client.grants.get(SecurableType.SCHEMA, …). Captures principal → privileges for every grant that touches writes: MODIFY, WRITE_VOLUME, CREATE_TABLE, CREATE_VOLUME, CREATE_FUNCTION, CREATE_MATERIALIZED_VIEW, CREATE_MODEL, APPLY_TAG.
  2. Revokes those write privileges via PermissionsChange(remove=[…], principal=p). Concurrent INSERT / UPDATE / DELETE / MERGE / new-object creation now fail with PERMISSION_DENIED until the clone completes.
  3. Runs the clone under the now-read-only source.
  4. Restores grants in a finally block — runs whether the clone succeeded, partially failed, or was aborted by a runtime budget. Idempotent on retry (re-granting an already-held privilege is a no-op).

What stays writable

  • SELECT, USE_SCHEMA, READ_VOLUME, EXECUTE — never touched. The clone itself reads source via these privileges.
  • Any principal Databricks marks as the schema owner retains its inherent privileges (UC owner can always write regardless of grants). If you have ingestion service principals that you cannot afford to block, set them as schema/table owners on the source — they'll keep writing during the clone.
  • Schema-level privileges in scope are revoked; table-level grants outside the schema's grant graph are not touched.

Failure semantics

  • Per-principal revoke fails (e.g. principal deleted between grants.get and grants.update) → logged, that principal is NOT added to the snapshot, so restore won't try to re-grant. Other principals on the same schema are unaffected.
  • grants.get fails for a schema (auth issue, schema deleted) → logged, that schema is left writable. Better partial quiesce than abort.
  • Per-principal restore fails (e.g. principal deleted between revoke and restore) → logged, restore continues with the next principal. The finally block must always complete or admins lose track of revoked grants.

If a principal can't be restored, the warning log line tells you exactly which schema and which privileges need manual re-granting:

WARNING Restore: could not re-grant ['MODIFY'] to deleted-user@example.com on prod.bronze: PRINCIPAL_NOT_FOUND. Manual intervention may be needed.

Trade-offs

  • Cost: 2 additional grants.get + 2 × grants.update calls per source schema. Negligible compared to clone runtime.
  • Risk: a quiesce'd source rejects writes. If your audit / monitoring is set to page on PERMISSION_DENIED, the page volume during a clone could be noisy. Consider gating the quiesce around an explicit clone window.
  • Cross-workspace: works the same way as same-workspace — quiesce runs on the source workspace's client. Cross-workspace clones are typically longer-running (Delta Sharing + DEEP CLONE across regions), so they benefit most.

Permissions and ownership

Docs: Manage privileges | Object ownership

Real-world scenario: Your production catalog has fine-grained grants: the analysts group can SELECT from sales but not hr. When you clone to staging, those same grants should be applied so staging mirrors production's access model.

# Clone with all permissions and ownership
clxs clone --source production --dest staging

# Skip permissions (useful for dev environments with different access model)
clxs clone --source production --dest dev --no-permissions --no-ownership
copy_permissions: true   # Replicate GRANT statements
copy_ownership: true # Transfer object ownership

What gets copied

  • Catalog-level grants
  • Schema-level grants
  • Table, view, volume, and function-level grants
  • Object ownership (catalog, schema, table, etc.)

Tags and properties

Docs: Tags | TBLPROPERTIES

Real-world scenario: Tables in production are tagged with data_classification: confidential and have TBLPROPERTIES like delta.autoOptimize.optimizeWrite = true. You need these replicated to staging for accurate testing.

# Clone with tags and properties
clxs clone

# Skip tags and properties (faster clone)
clxs clone --no-tags --no-properties
copy_tags: true         # Catalog, schema, table, column-level tags
copy_properties: true # TBLPROPERTIES (excludes internal Delta properties)

Security policies

Docs: Row filters & column masks

Real-world scenario: The customers table has a row filter that restricts users to seeing only their region's data, and the ssn column has a masking function applied. These policies must be cloned to staging so QA tests reflect the same security model.

# Clone with security policies
clxs clone

# Skip security (useful when destination uses different policies)
clxs clone --no-security
copy_security: true   # Row filters and column masks

Constraints and comments

Docs: CHECK constraints | COMMENT ON

Real-world scenario: Your orders table has a CHECK constraint amount > 0 and column comments documenting each field. These should be preserved in the cloned copy for developer reference.

# Clone with constraints and comments
clxs clone

# Skip them
clxs clone --no-constraints --no-comments
copy_constraints: true   # CHECK constraints
copy_comments: true # Table and column-level comments

Data masking

Docs: Column masks

When to use: You're cloning production data to a dev/test environment and need to mask sensitive columns (PII, financial data) so developers can work with realistic but safe data.

Real-world scenario: Your customers table has email, phone, and ssn columns. You want to clone the data but mask these fields so the dev environment doesn't contain real PII.

masking_rules:
# Mask email addresses: john@company.com -> j***@company.com
- column: "email"
strategy: "email_mask"
match_type: "exact"

# Redact SSN and phone across all tables
- column: "ssn|phone|social_security"
strategy: "redact"
match_type: "regex"

# Hash credit card numbers (consistent hash for join integrity)
- column: "credit_card_number"
strategy: "hash"
match_type: "exact"

# Null out date of birth
- column: "date_of_birth"
strategy: "null"
match_type: "exact"

# Partial mask: show last 4 chars
- column: "account_number"
strategy: "partial"
match_type: "exact"

Available strategies

StrategyExample InputExample Output
hashjohn@example.coma1b2c3d4e5f6... (MD5)
redact555-123-4567[REDACTED]
null1990-01-15NULL
email_maskjohn.doe@company.comj***@company.com
partialACCT-12345678***5678

Pre/post hooks

Docs: OPTIMIZE | ANALYZE TABLE

When to use: You need to run custom SQL before or after the clone — health checks, OPTIMIZE, ANALYZE, cache warming, or cleanup.

Real-world scenario: After cloning large fact tables, you want to run OPTIMIZE to compact small files, and ANALYZE TABLE to update statistics for the query optimizer.

# Run before cloning starts
pre_clone_hooks:
- sql: "SELECT COUNT(*) FROM ${source_catalog}.sales.orders"
description: "Verify source table is accessible"
on_error: "fail" # fail | warn | ignore

# Run after all schemas are done
post_clone_hooks:
- sql: "OPTIMIZE ${dest_catalog}.sales.orders"
description: "Compact files in orders table"
on_error: "warn"
- sql: "ANALYZE TABLE ${dest_catalog}.sales.orders COMPUTE STATISTICS"
description: "Update table statistics"
on_error: "ignore"

# Run after each schema completes
post_schema_hooks:
- sql: "ANALYZE TABLE ${dest_catalog}.${schema}.* COMPUTE STATISTICS FOR ALL COLUMNS"
description: "Compute per-schema stats"
on_error: "ignore"

Variables available

VariableReplaced with
${source_catalog}Source catalog name
${dest_catalog}Destination catalog name
${schema}Current schema name (schema hooks only)

Error handling

on_errorBehavior
failStop the entire clone operation
warnLog a warning and continue
ignoreSilently continue

Managed location

If your workspace uses Default Storage, you may need to specify a storage location when creating the destination catalog:

clxs clone \
--source production --dest staging \
--location "abfss://catalog@storage.dfs.core.windows.net/staging"

Cross-workspace & cross-cloud migration

Docs: Delta Sharing | CREATE TABLE CLONE | Unity Catalog sharing identifier

Clone-Xs can migrate a full catalog across Databricks workspaces — including across clouds (AWS ↔ Azure ↔ GCP) — using Delta Sharing as the wire protocol and DEEP CLONE to physically land data in the target's storage. Unlike the same-workspace clone, the target is truly independent after migration: the share is torn down and the destination catalog lives entirely in the target cloud.

When to use:

  • DR replica — keep a hot standby catalog in a different region or cloud
  • Cross-cloud migration — move production from one cloud to another
  • Workspace consolidation — pull catalogs from several source workspaces into one
  • Compliance isolation — materialize a sanitized copy in a locked-down workspace

Real-world scenario: Your production catalog retail_prod lives in an AWS workspace, but a new compliance requirement forces you to run disaster recovery in Azure. The two workspaces are on different metastores. You pick the source catalog in Clone-Xs, enable "Clone to a different workspace," enter the Azure workspace URL + a PAT + a warehouse ID, and click run. Clone-Xs creates a Delta Share on AWS, points a recipient at the Azure metastore's global sharing id, provisions the share on Azure, then DEEP CLONEs every table into Azure-managed storage. Views, SQL functions, volumes, grants, tags, and ownership replay automatically.

How it works

The backend orchestrator (src/clone_cross_workspace.py) runs these steps:

  1. Introspect source — list schemas, tables, views, functions, volumes via the source WorkspaceClient.
  2. Provision share on sourceCREATE SHARE, ALTER SHARE ADD TABLE for every table, CREATE RECIPIENT USING ID '<target_metastore_sharing_id>', GRANT SELECT ON SHARE.
  3. Consume share on target — poll target for the source-metastore provider, then CREATE CATALOG … USING SHARE <provider>.<share>.
  4. Materialize data — for each table: CREATE TABLE <dest>.<schema>.<table> DEEP CLONE <shared>.<schema>.<table>. Data physically lands in target cloud storage.
  5. Replay metadata — views + SQL functions (DDL replay with catalog-reference rewrite), volumes + files (Databricks Files API), grants + tags + ownership (best-effort).
  6. Teardown — drop the shared catalog on target and the share + recipient on source, unless keep_share: true.

UI walkthrough

Configure target workspaces once in Settings. Open /settings → Target Workspaces → + Add target and fill in:

FieldPurpose
NameSlug used to reference this connection from /clone (e.g. prod-azure)
Target HostFull workspace URL (e.g. https://adb-1234.azuredatabricks.net)
Auth MethodPersonal Access Token, Service Principal, or CLI Profile
Token / Client ID + Secret / ProfileCredentials for the chosen method
Target SQL WarehouseRuns DDL + DEEP CLONE SQL on the target side. The dropdown auto-populates after Browse
Default data sync modeUsed when this target is picked on /clone (see below)
Auto-handle column masks & row filtersSee Column masks and row filters below
Keep migration shareLeave the Delta Share in place after migration (debug / audit)

Saved connections live in browser localStorage (clxs_target_connections), not on the server. PATs and client secrets never persist to disk — each clone request sends them inline, sourced from the picked entry. Each saved connection card auto-shows ✓ Logged in as <user> (resolved via the lightweight POST /target/whoami endpoint) so you can spot stale or wrong-identity tokens at a glance.

On /clone, just pick the saved target. Step 1 ("Source & Destination") has a "Clone to a different workspace" checkbox. Tick it, and a compact picker appears:

☑ Clone to a different workspace
─────────────────────────────────────────────────────────
Target connection: [ prod-azure ▼ ] [ Test ] Manage in Settings →
https://adb-7405….azuredatabricks.net · PAT · WH e83992177db8bdd5 · snapshot_once

If no targets are saved yet, the picker shows + Configure target in Settings → instead. Test runs the same checks as the saved-connection card (auth + metastore sharing + warehouse existence + non-blocking warehouse start if STOPPED).

When the box is ticked, the Destination Catalog dropdown switches its data source — it now lists catalogs that exist in the target workspace (with (from target 'prod-azure') shown next to the label). You pick an existing target catalog or + Create New Catalog to provision a fresh one.

Same-metastore guard

If you're attempting a cross-workspace clone between two workspaces that happen to share the same Unity Catalog metastore, Clone-Xs fails fast in 1–2 seconds before any Delta Sharing objects are created:

Source and target workspaces are in the same Unity Catalog metastore
(<your-metastore-uuid>). Delta Sharing requires distinct metastores —
you cannot share to yourself.

Fix: on /clone, untick 'Clone to a different workspace' and run a normal
in-metastore clone instead. Same metastore = same UC = no Delta Sharing required.

This is the most common pitfall when teams add a second workspace to an existing UC metastore. CREATE RECIPIENT IF NOT EXISTS against your own metastore silently no-ops in Databricks, so without this preflight you'd get a confusing "phantom recipient" error 30 seconds in. The check compares source and target global_metastore_id returned by client.metastores.summary().

Data sync modes

When you re-run a cross-workspace clone for the same source → target pair, the deterministic share/recipient names mean the Delta Sharing handshake is skipped and only table data is reconciled. How that reconciliation happens is controlled by data_sync_mode:

ModeSQL emitted per tableRe-run behaviourWhen to use
snapshot_once (default)CREATE TABLE IF NOT EXISTS dst DEEP CLONE srcNo-op on existing tables; only newly-added tables in source get cloned.One-time hydration. The target is meant to drift independently after the initial copy.
incrementalCREATE OR REPLACE TABLE dst DEEP CLONE srcReads both Delta logs and copies only files added since the last clone. ⚠ Overwrites any target-side writes to cloned tables.Source is the system of record and the target is a read-replica/mirror.
force_fullDROP TABLE IF EXISTS dst; CREATE TABLE dst DEEP CLONE srcFull re-clone every run. Slowest, most predictable.Recovery from corruption, or after a schema change you want to apply cleanly.

incremental and force_full log a WARNING at the start of the run because of the data-loss implication. DEEP CLONE is a one-way mirror — Databricks doesn't expose MERGE semantics for clone, so any row inserted on the target after a previous clone is lost on re-run in those modes.

Column masks and row filters

Delta Sharing refuses to share any table that has a column mask or row filter applied — the cross-workspace clone will fail at ALTER SHARE ADD TABLE for those tables, and any view that joins them will then fail with TABLE_OR_VIEW_NOT_FOUND on the target.

Set auto_handle_masks: true on target_workspace to let Clone-Xs handle this automatically. The flow becomes:

  1. Before adding each table to the share, Clone-Xs runs DESCRIBE EXTENDED on it and parses out any # Column Masks and # Row Filter entries.
  2. For tables with masks/filters: drops them on the source (ALTER TABLE ... ALTER COLUMN ... DROP MASK and ALTER TABLE ... DROP ROW FILTER).
  3. Adds the table to the share — now succeeds.
  4. The clone runs through (DEEP CLONE → views → functions → etc.). The mask/filter UDFs themselves get migrated by the existing function-migration step.
  5. After functions migration, re-applies the masks/filters on the target tables, rewriting the function FQN from source catalog to destination catalog.
  6. Finally: restoration on source depends on data_sync_mode:
    • snapshot_once / force_full → restore the masks on source. The clone is a one-shot operation; the share isn't being read continuously.
    • incremental → leave the source masks dropped. Re-applying them would break ongoing Delta Sharing reads (Databricks invalidates the share when masks reappear). A WARNING is logged; you'll need to drop and re-apply manually after you stop syncing if you need source-side protection back.

If auto_handle_masks is left false (the default), masked tables are skipped (with a warning at ALTER SHARE ADD TABLE) and any downstream views that depend on them fail. Use this option when you have demo data or a non-production source where you can tolerate brief mask-removal windows.

API usage

Pre-flight the target:

curl -X POST $CLXS_HOST/api/target/validate \
-H "Content-Type: application/json" \
-d '{
"host": "https://adb-target.azuredatabricks.net",
"auth_method": "pat",
"token": "dapi...",
"warehouse_id": "abc123"
}'
# { "ok": true, "catalog_count": 14, "metastore_sharing_id": "azure:eastus:uuid" }

Kick off the migration — same POST /api/clone endpoint, just supply target_workspace:

curl -X POST $CLXS_HOST/api/clone \
-H "Content-Type: application/json" \
-d '{
"source_catalog": "retail_prod",
"destination_catalog": "retail_prod_dr",
"target_workspace": {
"host": "https://adb-target.azuredatabricks.net",
"auth_method": "pat",
"token": "dapi...",
"warehouse_id": "abc123",
"keep_share": false
}
}'

See the API reference for the full schema.

Config (YAML)

source_catalog: retail_prod
destination_catalog: retail_prod_dr

target_workspace:
host: "https://adb-target.azuredatabricks.net"
auth_method: "pat" # "pat" | "service_principal" | "profile"
token: "" # for PAT
client_id: "" # for service_principal
client_secret: "" # for service_principal
profile: "" # for profile (~/.databrickscfg)
warehouse_id: "" # target warehouse — DDL + DEEP CLONE run here
keep_share: false

# Toggle which object types migrate (all default true)
clone_views: true
clone_functions: true
clone_volumes: true
volume_max_file_mb: 500 # per-file cap for volume copies

# These also apply to cross-workspace migrations
copy_permissions: true # GRANTs replayed via SHOW GRANTS
copy_ownership: true # ALTER … OWNER TO … on target
copy_tags: true # replayed from system.information_schema

Full reference in Configuration.

What gets migrated

ObjectHowKnown limits
CatalogCREATE CATALOG on target (optional MANAGED LOCATION)Target name must not already exist
SchemasCREATE SCHEMA IF NOT EXISTS per source schema
Tables (managed + external)CREATE TABLE … DEEP CLONE from the shared catalogStreaming tables not migrated in this pipeline
Mixed-format sources (Delta, Parquet, Iceberg)Same CREATE TABLE … DEEP CLONE syntax — Databricks materialises the clone as Delta on the target regardless of source formatIceberg with partition evolution / decimal-truncated partitions and partitioned Parquet referenced by path are unsupported by Databricks CLONE (Clone-Xs surfaces an actionable error per Databricks Parquet/Iceberg CLONE limits)
Views + materialized viewsSHOW CREATE TABLE → catalog-reference rewrite → CREATE OR REPLACE VIEWViews referencing catalogs outside the migration scope will fail and be logged
SQL functionsSHOW CREATE FUNCTION → rewrite → CREATE OR REPLACE FUNCTIONPython UDFs that contain literal catalog names in string bodies are not rewritten
Volumes (managed + external)CREATE VOLUME + file-by-file copy via the Databricks Files APIPer-file cap (volume_max_file_mb, default 500 MB); external volumes skipped if no storage_location
GrantsSHOW GRANTS on source → GRANT on targetPrincipals that don't exist on the target metastore are counted as grants_skipped
OwnershipDESCRIBE … EXTENDEDALTER … OWNER TOSame principal-resolution caveat as grants
Tagssystem.information_schema.table_tagsALTER TABLE … SET TAGSTables only; column-level tags migration is a future enhancement

Cross-cloud caveats

  • Egress: DEEP CLONE reads source data through the Delta Sharing endpoint. Cross-region / cross-cloud reads incur standard egress. Plan migration windows accordingly for TB-scale catalogs.
  • File-copy cap: volume files larger than volume_max_file_mb are skipped with a warning. The Files API streams through the Clone-Xs process, so extremely large blobs need a different transport (submit a Databricks job on target that reads from a jointly-reachable storage credential).
  • Principal resolution: user / group / service principal names must match on both metastores (SCIM-synced AD groups generally do). Missing principals log a debug line and increment grants_skipped / ownership_skipped.
  • DDL rewriter is regex-based: catalog references in view + function DDL are rewritten by pattern match (source_catalog.dest_catalog., both backticked and bare, case-insensitive). SQL-in-strings inside UDF bodies and dynamically constructed identifiers are not rewritten.
Prerequisites
  • Delta Sharing enabled on both metastores (Databricks-to-Databricks sharing).
  • Source user needs CREATE SHARE + CREATE RECIPIENT privileges on the source metastore.
  • Target user needs CREATE CATALOG + CREATE PROVIDER privileges on the target metastore.
  • Target warehouse must be running (or auto-start enabled) — all target-side DDL and DEEP CLONE runs on it.
Debugging failed migrations

Set keep_share: true (or tick the checkbox in the UI). Clone-Xs will leave the Delta Share, recipient, and shared catalog in place after the job completes or fails — you can inspect what the target actually saw via SHOW TABLES IN clone_xs_shared_<suffix> and re-issue the DEEP CLONE manually. Run a second migration with keep_share: false to clean up when you're done.

Multi-target fanout (target_workspaces)

When to use: DR replication or "data lake landing zone" pattern where one source catalog fans out to N target workspaces — typically across regions (us / eu / apac) or environments (prod / staging / dev). Sequential clones to N targets take N × clone-duration; fanout runs them in parallel.

Real-world scenario: Production data lives in prod-us (us-east-1). The DR plan requires hot-warm copies in prod-eu (west-europe) and prod-apac (ap-southeast-2), refreshed nightly. Without fanout you'd run three sequential clones — ~1 hour × 3 = ~3 hours nightly. With fanout it's ~1 hour total (the slowest target dominates).

POST /api/clone
{
"source_catalog": "production",
"destination_catalog": "production_dr",
"target_workspaces": [
{ "host": "https://eu.cloud.databricks.com", "auth_method": "pat", "token": "...", "warehouse_id": "wh-eu" },
{ "host": "https://us.cloud.databricks.com", "auth_method": "pat", "token": "...", "warehouse_id": "wh-us" },
{ "host": "https://apac.cloud.databricks.com", "auth_method": "pat", "token": "...", "warehouse_id": "wh-apac" }
],
"fanout_max_parallel": 5
}

The router routes plural-target_workspaces to the fanout orchestrator (src/clone_fanout.py), which spawns N parallel run_cross_workspace_clone calls, one per target. Each target gets its own deterministic share / recipient / shared-catalog (per the Recipient-uniqueness rule — one recipient per target metastore from a given source). Source-side state is independent: a failure on target B doesn't touch target A's share or recipient.

Result aggregation:

{
"mode": "fanout",
"status": "partial",
"target_count": 3,
"succeeded_targets": 2,
"failed_targets": 1,
"bytes_copied": 480000000000,
"tables_cloned": 78,
"per_target": [
{ "target_host": "https://eu...", "target_status": "success", "bytes_copied": 240000000000, "tables_cloned": 39 },
{ "target_host": "https://us...", "target_status": "success", "bytes_copied": 240000000000, "tables_cloned": 39 },
{ "target_host": "https://apac...", "target_status": "failed", "error": "DEEP CLONE failed on table users: ..." }
]
}

Aggregate status semantics:

  • success — every target finished without raising.
  • partial — at least one target succeeded AND at least one failed.
  • failed — no target succeeded.

fanout_max_parallel caps how many target clones run simultaneously (default 5). Higher values increase source-side egress bandwidth pressure (each parallel target reads from the same source share endpoint); lower values serialize. For the typical 3-region fanout, the default is fine. For 10+ targets, consider stepping down to 3-5 to avoid saturating the source warehouse.

Mutual exclusivity with target_workspace: the singular field (one cross-workspace clone) and the plural field (fanout to N) are mutually exclusive. Setting both is a 422 — pick one. The router decides dispatch by which field is set:

Request fieldsRouted to
NeitherSame-workspace clone (clone_catalog)
target_workspace (singular)Single cross-workspace (run_cross_workspace_clone)
target_workspaces (plural)Fanout (run_cross_workspace_fanout)
Both422 Validation Error

What if one target is in the same metastore as source? The same-metastore preflight runs inside run_cross_workspace_clone, so it fires per-target. The offending target raises and is marked failed in the per_target list; the other targets run normally. Net result: aggregate partial, with a clear error string on the rejected target.


Serverless compute

When to use: You want to run a clone without provisioning or paying for a SQL warehouse — ideal for one-off clones, CI/CD pipelines, and scheduled jobs.

Real-world scenario: Your CI pipeline creates a cloned catalog for every pull request. Instead of keeping a warehouse running 24/7, you use serverless compute — Clone-Xs packages itself, uploads to a UC Volume, and submits a serverless job that auto-scales and shuts down when done.

# Serverless clone
clxs clone \
--source production --dest staging \
--serverless \
--volume /Volumes/my_catalog/my_schema/libs

# With full options
clxs clone \
--source production --dest staging \
--serverless \
--volume /Volumes/my_catalog/my_schema/libs \
--validate --report
# config/clone_config.yaml
serverless: true
volume: "/Volumes/my_catalog/my_schema/libs"

For full details on how serverless works, volume requirements, and incremental sync support, see Notebooks & Serverless.


Cost & time estimate

Before you run a clone, the Preview step (step 3 on the Clone page) can surface a pre-flight estimate:

  • Table count — how many tables are in scope
  • Total size — sum of sizeInBytes from DESCRIBE DETAIL on each source table
  • Estimated duration — heuristic ~500 MB/s for DEEP clone on a medium warehouse
  • Storage costtotal_gb × price_per_gb per month / year (default $0.023/GB/month, override in config)

Click Estimate in the Preview panel. Under the hood it calls POST /api/estimate, which runs DESCRIBE DETAIL sequentially — expect ~1 second per table, so ~1 minute per 100 tables. SHALLOW clones skip the duration/cost estimate since they don't copy data files.

# Via CLI:
curl -X POST $CLXS_HOST/api/estimate \
-H "Content-Type: application/json" \
-d '{"source_catalog": "prod", "price_per_gb": 0.023}'

Full clone vs selective re-clone comparison

When you pass destination_catalog to /api/estimate AND that target catalog already exists, the response carries an extra selective block — the size + cost a SELECTIVE re-clone (drifted tables only) would incur, alongside the FULL numbers. The /clone preview tile renders both side-by-side with a "Recommended: SELECTIVE" or "Recommended: FULL" badge based on a 50% savings threshold:

{
"total_gb": 240,
"monthly_cost_usd": 5.52,
"selective": {
"target_exists": true,
"size_gb": 12,
"monthly_cost_usd": 0.28,
"tables_to_clone": 3,
"tables_in_sync": 47,
"savings_pct": 95.0,
"recommended": true,
"drift_breakdown": {
"never_cloned": 0,
"version_drift": 3,
"unable_to_compare": 0
}
}
}

The recommendation kicks in at savings ≥ 50% — below that, the per-table DESCRIBE HISTORY overhead and operational complexity outweigh the bandwidth savings. The block is omitted entirely when the target catalog doesn't exist (only a full clone is possible) and on cross-workspace previews (the source client can't read target Delta versions across the workspace boundary).

# Compare full vs selective when target exists
curl -X POST $CLXS_HOST/api/estimate \
-H "Content-Type: application/json" \
-d '{"source_catalog": "prod", "destination_catalog": "prod_dr"}'

Runtime guardrails

Hard limits that abort the job in flight — a safety net against runaway scope changes or unexpectedly large catalogs:

max_duration_min: 60     # Abort after 60 minutes wall clock
max_tables: 500 # Abort after 500 tables touched (any outcome)

Enforced in the orchestrator after each schema completes. When tripped, remaining schemas are cancelled and the job's summary gets aborted: true + abort_reason: "max_duration_min" | "max_tables". Already-cloned tables stay in place; use Rollback to undo them.

When to use: scheduled / CI clones where an unexpectedly long run is worse than a failed run. Not for interactive work.

caution

Guardrails only check between schemas, not during. A single schema with 2,000 tables won't be interrupted mid-schema even if max_tables=100 is set — set parallel_tables higher and max_workers lower to shorten the check interval.


Cloning from a named snapshot

You can tag a catalog's current state as a named snapshot (fork point) and later clone from that snapshot instead of the current state. Useful for pre-migration baselines, month-end captures, and repeatable dev refresh.

# 1. Take a snapshot
curl -X POST $CLXS_HOST/api/clone-snapshots \
-d '{"source_catalog": "prod", "name": "pre-migration"}'
# returns { "snapshot_id": "7f3a4b5c-...", ... }

# 2. Clone from it later
curl -X POST $CLXS_HOST/api/clone \
-d '{
"source_catalog": "prod",
"destination_catalog": "prod_audit",
"source_snapshot_id": "7f3a4b5c-..."
}'

The snapshot's captured_at timestamp becomes the default as_of_timestamp for every table in the clone. See the dedicated Snapshots guide for create/list/delete, UI flow, and limitations.


Reading the clone log

Every clone emits a consistent progression of log lines — the same stream surfaces in the Clone-Xs UI's Execution panel, in stdout/stderr for the CLI, and in the Databricks run view when a serverless job runs the clone.

Startup summary. After schemas are discovered, the tool pre-counts tables and emits one line with the full denominator so you know the scope up front:

[INFO] ◈ Found 50 schemas to clone: bronze, silver, gold, …
[INFO] ◈ Starting clone: 611 tables across 50 schemas → edp_01

Live Schemas + Tables progress. A single progress bar tracks both levels — the primary counter is schemas (how many schemas are done), and the suffix shows the catalog-level table count (updates live as each table finishes, not just at schema boundaries):

Schemas |██░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 5/50 (10%) [5ok/0fail/0skip] ETA: 2m · Tables 120/611 [115ok/2fail/3skip]

The [Nok/Nfail/Nskip] breakdown reflects table-level outcomes (skipped = matched a filter, excluded, or already-cloned under incremental / resume).

Per-schema roll-up. As each schema finishes, one summary line is emitted:

[INFO] ◈ Schema bronze complete: 42/45 tables cloned (2 failed, 1 skipped) in 18s

Schemas with no tables (metadata-only) stay silent — keeps the log clean for catalogs where most schemas are empty.

Per-table events. The granular ✓ Cloned table: … / △ Dropped table for re-clone: … / ✗ Failed … lines continue to fire for every object — the new summary lines sit alongside them, they don't replace them.

tip

Turn off show_progress (or pass --no-progress on the CLI) to suppress the Schemas progress bar when piping logs to a file or a log aggregator. The startup summary and per-schema roll-up are regular [INFO] logs and are not affected.


Resume from failure

When to use: A clone operation failed partway through (e.g., network timeout, warehouse stopped). You want to resume from where it left off instead of restarting from scratch.

Real-world scenario: Your clone of 2,000 tables failed at table #1,500. Instead of re-cloning all 2,000 tables, you resume from the rollback log — the tool skips the 1,500 already-cloned tables and continues with the remaining 500.

# Original clone with rollback enabled
clxs clone --enable-rollback
# ... fails at some point

# Resume from the rollback log
clxs clone --resume rollback_logs/rollback_staging_20260310_143022.json

Auto-mask PII (auto_mask_pii: true)

When to use: You're cloning to a lower environment (staging, dev, QA, UAT) and the source has PII columns tagged in Unity Catalog. You want the destination to land already masked — no separate post-clone step where unmasked rows briefly exist on the target.

What it does: At clone time, Clone-Xs queries information_schema.column_tags once per catalog and builds a list of PII columns (any column with a tag whose name matches the PII tag set Unity Catalog defines: EMAIL, SSN, CREDIT_CARD, PHONE, IBAN, etc.). After each schema's tables clone successfully, the existing src/masking.py pipeline runs an UPDATE against the destination using the per-column strategy from pii_detection.SUGGESTED_MASKINGemail_mask for EMAIL, hash for SSN / CREDIT_CARD, partial for PHONE, etc.

The masked-data exposure window is bounded by the clone job itself — no external reader sees the table before the UPDATE commits, so there's no observable moment when unmasked PII is on the target.

auto_mask_pii: true   # default false
POST /api/clone
{ "source_catalog": "production", "destination_catalog": "staging", "auto_mask_pii": true }

The masking column-tag query is cached per clone job — querying column_tags once per catalog rather than once per schema. Manual rules supplied via masking_rules still apply alongside auto-detected ones.


Auto-retry transient failures (enable_retry: true)

When to use: Long-running clones (hours, sometimes overnight) hit transient failures — Databricks throttles a high-volume warehouse with HTTP 429, a network blip drops a connection, a 5xx returns from the SQL execution endpoint. Without retry, the entire clone fails and the operator has to restart from the rollback log.

What it does: Wraps clone-job execution in RetryPolicy (src/retry.py). On a transient error class — TimeoutError, ConnectionError, requests-style network errors, HTTP 429 / 502 / 503 / 504, Databricks ThrottledRequest — Clone-Xs sleeps with exponential backoff and retries the same per-table CLONE statement. Logical errors never retry: schema-mismatch, permission-denied, missing-catalog, validation failures, bad config — the next attempt would just fail with the same error and mask the real problem.

enable_retry: true    # default true
max_retries: 3 # config-level cap on attempts per per-table CLONE

The retry count is surfaced in the clone job status response (GET /api/clone/{job_id}) so operators can spot upstream flakiness over time — a sudden uptick in retries usually means the source warehouse is throttling under unrelated load.


Compare DQ after clone (compare_dq_after_clone: true)

When to use: Catching silent data corruption mid-clone before the bad target becomes the new source of truth. Cross-environment promotions ("staging → production") where wrong data on the target is worse than no clone.

What it does: After each schema finishes cloning, Clone-Xs runs a per-table column-level comparison via the existing dqx_engine: row count plus per-column NULL counts on source vs target. The result is a per-table drift score (% of columns where the count delta exceeds 0.5%). When the max drift across any cloned table exceeds dq_drift_rollback_pct AND auto_rollback_on_failure is True, the existing rollback path (Delta RESTORE) reverts the destination.

compare_dq_after_clone: true        # default false
dq_drift_rollback_pct: 5.0 # 0–100, default 5%
auto_rollback_on_failure: true # required for the rollback to fire

Adds one extra warehouse round-trip per cloned table — expect a few seconds added per 100 tables. The default 5% threshold matches the existing row-count rollback_threshold so operators have one mental model for "acceptable drift."


WHERE-clause filtered clone (where_clauses: {…})

When to use: You only want a slice of a table on the destination — most-recent year for analyst sandboxes, a specific customer's rows for a DSAR export, a 1% sample for dev-environment fixtures. Different feature shape from include_tables_regex (which selects whole tables) and from data_filters (which is more limited).

What it does: For each (schema, table) key in the where_clauses dict, the per-table CLONE swaps to a CTAS path: CREATE TABLE IF NOT EXISTS dst AS SELECT * FROM src WHERE <clause>. Loses Delta source history (target lands at version 0) — CTAS doesn't carry the source's _delta_log. Time-travel arguments still work (the WHERE is applied to the time-travelled view).

where_clauses:
"bronze.events": "date >= '2026-01-01'" # last year only
"bronze.users": "country IN ('GB', 'IE')" # GDPR scope
"*": "is_deleted = false" # wildcard: applies to every table
POST /api/clone
{
"source_catalog": "production",
"destination_catalog": "analyst_sandbox",
"where_clauses": { "bronze.events": "date >= '2026-01-01'" }
}

Only effective with clone_type: DEEP — SHALLOW clones can't take a WHERE filter (they're metadata pointers, not row copies). A WHERE on a SHALLOW request is ignored with a WARN log line.


Inline TBLPROPERTIES override (clone_tbl_properties: {…})

When to use: You need a property that has to be set on the first commit — applying it post-clone via ALTER TABLE is too late for the property to take effect.

The canonical case is archival retention: delta.logRetentionDuration = '3650 days' controls how long Delta keeps history. If you set it via ALTER TABLE after the clone, the first commit has already happened with the default 30-day retention, and the longer window only applies to future commits. To extend retention on the existing clone commit, the property has to be inline on the CLONE statement itself.

clone_tbl_properties:
delta.logRetentionDuration: "3650 days"
delta.deletedFileRetentionDuration: "3650 days"
-- Clone-Xs renders this onto every per-table CLONE:
CREATE TABLE IF NOT EXISTS `dst`.`schema`.`table`
DEEP CLONE `src`.`schema`.`table`
TBLPROPERTIES (
'delta.logRetentionDuration' = '3650 days',
'delta.deletedFileRetentionDuration' = '3650 days'
)

The override applies to every table in the clone — there's no per-table syntax. For per-table property overrides, use a post-clone hook or ALTER TABLE follow-up SQL. Single quotes in property values are SQL-escaped by doubling.


Wizard control reference

Every control on the /clone wizard maps to a documented section. Use this as the canonical "did we cover X" reference.

Wizard controlBackend fieldDoc home
Clone Type (DEEP / SHALLOW)clone_typeDeep vs shallow clone
Load Type (FULL / INCREMENTAL)load_typeFull vs incremental vs selective load
Target Format (DELTA / ICEBERG)target_formatTarget format — UniForm
Physical Iceberg targeticeberg_physicalPhysical Iceberg target
Dry-rundry_runDry run
Use Serverless Computeserverless + volumeServerless execution
Schema-only modeschema_onlySee "Schema-only" pattern in demo-data.md and environments.md
Force re-cloneforce_recloneStage 3 — Tables (table)
WHERE clausewhere_clausesWHERE-clause filtered clone
Time travel (timestamp / version)as_of_timestamp, as_of_versionTime travel
Schema include / excludeinclude_schemas, exclude_schemasSchema filtering
Tables include / exclude regexinclude_tables_regex, exclude_tables_regexRegex table filtering
Tag-based filterrequired_schema_tagsTag-based filtering
Scope pickerinclude_objectsScope Picker — partial-catalog clones
Parallel tables / workersparallel_tables, max_workers, max_parallel_queriesParallel processing
Order by sizeorder_by_sizeTable size ordering
Rate limit (max RPS / throttle)max_rps, throttleRate limiting
Runtime guardrailsmax_duration_min, max_tables(CLI / API only — see field comments in api/models/clone.py)
Snapshot IDsource_snapshot_idSnapshots
Pre-clone quiesce sourcequiesce_sourcePre-clone source quiesce
Auto-mask PIIauto_mask_piiAuto-mask PII
Auto-retry transient failuresenable_retryAuto-retry transient failures
Compare DQ after clonecompare_dq_after_clone, dq_drift_rollback_pctCompare DQ after clone
TBLPROPERTIES overrideclone_tbl_propertiesInline TBLPROPERTIES override
Copy options (permissions / ownership / tags / properties / security / constraints / comments)copy_*Stage 5 — Metadata replay
Validate after clonevalidate_after_clone, validate_checksumSee scheduling.md and cicd.md
Auto-rollbackauto_rollback, rollback_thresholdSee safety.md
CheckpointingcheckpointSee safety.md and advanced-features.md
Approval workflowrequire_approvalSee advanced-clone.md
Impact checkimpact_check(CLI / API only — runs the existing impact-analysis surface against the clone scope)
Skip unused tablesskip_unusedSee advanced-features.md
TTL policyttlSee advanced-clone.md and advanced-features.md
Clone templatetemplateSee advanced-clone.md and advanced-features.md
Cross-workspace targettarget_workspace, target_workspaces, fanout_max_parallelSee advanced-clone.md for cross-workspace + fanout
Generate reportgenerate_report(CLI / API only — emits a JSON + HTML run report into reports/)
Show progress barshow_progressReading the clone log
Verbose loggingverbose(CLI flag — increases log level to DEBUG)