Skip to main content

Advanced Cloning

Beyond basic catalog cloning, Clone Catalog supports data filtering, time-to-live policies, a plugin system for extensibility, and detailed execution plans for previewing exactly what will happen before a clone runs.

Data filtering

Clone only the rows you need by applying SQL filters at the table level.

Real-world scenario

Your production.sales.transactions table has 5 years of data — 2 billion rows. Your dev team only needs the last 3 months for testing. Data filtering lets you clone with a WHERE clause so the destination gets a smaller, faster-to-query subset.

Usage

# Apply a global filter to all tables
clxs clone \
--source production --dest dev \
--where "created_at >= '2026-01-01'"

# Per-table filters
clxs clone \
--source production --dest dev \
--table-filter "sales.transactions:created_at >= '2026-01-01'" \
--table-filter "sales.orders:region = 'US'" \
--table-filter "analytics.events:event_type IN ('click', 'purchase')"

Configuration

source_catalog: "production"
destination_catalog: "dev"

data_filters:
# Global filter applied to all tables (where applicable)
global_where: "created_at >= '2026-01-01'"

# Per-table overrides
table_filters:
- schema: "sales"
table: "transactions"
where: "created_at >= '2026-01-01' AND amount > 0"

- schema: "sales"
table: "orders"
where: "region = 'US'"

- schema: "analytics"
table: "events"
where: "event_type IN ('click', 'purchase')"

- schema: "hr"
table: "employees"
where: "status = 'active'"

How it works

When a data filter is applied, the tool uses CREATE TABLE AS SELECT (CTAS) instead of CREATE TABLE ... CLONE:

-- Without filter (standard clone)
CREATE TABLE dev.sales.transactions DEEP CLONE production.sales.transactions;

-- With filter (CTAS)
CREATE TABLE dev.sales.transactions AS
SELECT * FROM production.sales.transactions
WHERE created_at >= '2026-01-01';

Limitations

AspectStandard CloneFiltered Clone (CTAS)
Delta historyPreservedLost (new table)
Clone typeDeep or shallowAlways deep (full copy)
Time travelSupportedNot supported (new history starts)
PerformanceOptimized file copyFull data scan + write
Schema evolutionInheritedSnapshot at time of clone
caution

Filtered clones use CTAS, which creates a new table rather than a true Delta clone. This means the destination table loses Delta history, time travel capabilities, and cannot be incrementally updated. Use filtered clones only when you need a subset of data and don't need these features.


TTL policies

Set expiration dates on cloned catalogs so temporary environments are automatically cleaned up.

How Clone-Xs TTL differs from native Databricks

Databricks does not provide a native TTL mechanism at the catalog or schema level. The only built-in retention controls operate at the table level through Delta table properties:

-- Native Databricks: table-level only
ALTER TABLE my_table SET TBLPROPERTIES (
'delta.deletedFileRetentionDuration' = '30 days', -- controls VACUUM
'delta.logRetentionDuration' = '30 days' -- controls time travel
);

These properties control how long deleted data files and transaction log entries are retained for individual tables. They do not expire or drop the table itself — they only affect storage cleanup during VACUUM operations.

Clone-Xs fills this gap by implementing catalog-level TTL on top of Unity Catalog tags. Instead of managing retention on hundreds of individual tables, Clone-Xs sets a single expiration on the entire destination catalog and drops it (along with all its schemas, tables, views, and volumes) when the TTL expires.

AspectNative DatabricksClone-Xs TTL
LevelTable onlyCatalog
What it controlsFile retention for VACUUM and time travelCatalog lifecycle (creation to deletion)
What happens on expiryOld data files become eligible for VACUUM cleanupEntire catalog is dropped
ScopeIndividual table propertyAll objects in the catalog
SetupALTER TABLE ... SET TBLPROPERTIES per tableSingle --ttl flag at clone time
AutomationRequires scheduled VACUUM per tableBuilt-in cleanup scheduler (clxs ttl cleanup)
Use caseStorage optimizationEphemeral environment lifecycle management
info

Clone-Xs TTL and native Delta retention serve different purposes. Native retention keeps your tables healthy by cleaning up old files. Clone-Xs TTL manages the lifecycle of entire cloned environments. You may use both — Clone-Xs TTL to expire a PR catalog after 7 days, and native Delta retention to control file cleanup within long-lived catalogs.

Real-world scenario

Your CI pipeline creates a fresh cloned catalog for every pull request. Without TTL policies, these catalogs accumulate and waste storage. With a 7-day TTL, abandoned PR environments are automatically cleaned up, keeping costs under control.

Usage

# Clone with a 7-day TTL
clxs clone \
--source production --dest pr_1234 \
--ttl 7d

# Set TTL on an existing catalog
clxs ttl set --dest pr_1234 --days 14

# Check TTL status for all catalogs
clxs ttl check

# Clean up expired catalogs
clxs ttl cleanup

# Extend TTL (e.g., PR review is taking longer)
clxs ttl extend --dest pr_1234 --days 7

# Remove TTL (make permanent)
clxs ttl remove --dest pr_1234

Output (ttl check):

============================================================
TTL STATUS
============================================================
Catalog TTL Expires Status
pr_1234 7d 2026-03-21 09:15:00 ACTIVE (7 days left)
pr_1201 7d 2026-03-15 14:30:00 EXPIRING (1 day left)
pr_1198 7d 2026-03-12 11:00:00 EXPIRED
dev_sandbox 30d 2026-04-13 02:00:00 ACTIVE (30 days left)
============================================================
1 catalog expired. Run 'clxs ttl cleanup' to remove.

TTL duration formats

FormatExampleDescription
Nd7dN days
Nw2wN weeks
Nm6mN months
Ny1yN years
Nh12hN hours (for short-lived test envs)

Configuration

ttl:
default: "7d" # Default TTL for new clones
auto_cleanup: true # Automatically clean up expired catalogs
cleanup_schedule: "0 6 * * *" # Run cleanup daily at 6 AM
warning_threshold: "1d" # Notify when TTL is within 1 day
notification_channel: "slack" # Notify via Slack before expiry

How it works

Clone-Xs implements catalog-level TTL using Unity Catalog tags — since Databricks has no native catalog expiration feature. The lifecycle works in three stages:

1. Tag assignment — When you clone with --ttl, Clone-Xs calculates the expiration timestamp and stores it as a UC tag on the destination catalog:

ALTER CATALOG pr_1234 SET TAGS ('clone_catalog_ttl' = '2026-03-21T09:15:00Z');

2. Status monitoring — The clxs ttl check command reads the clone_catalog_ttl tag from all catalogs and compares against the current time to report status (ACTIVE, EXPIRING, or EXPIRED).

3. Cleanup — The clxs ttl cleanup command (or the automatic cleanup schedule) queries all catalogs for expired TTL tags and drops the entire catalog, including all schemas, tables, views, and volumes within it:

-- What cleanup executes for each expired catalog
DROP CATALOG IF EXISTS pr_1234 CASCADE;

Because this uses standard Unity Catalog tags, the TTL metadata is visible in Catalog Explorer and queryable via the Unity Catalog API — no external database or state file is required.

tip

In CI/CD pipelines, always set a TTL when creating PR-specific catalogs. Combine with --drop-catalog on the rollback command for immediate cleanup when the PR is merged or closed.


Clone templates

Pre-built configurations that set all clone options in one click.

Template long descriptions

Each of the 13 built-in templates now includes a detailed long_description field that explains:

  • The intended use case (e.g., "Create a lightweight dev environment with only the last 30 days of data")
  • Which settings are enabled and which are disabled (e.g., "Enables shallow clone, disables validation and TTL")
  • Why those defaults were chosen (e.g., "Shallow clone is used to minimize storage costs for short-lived environments")

The long description is displayed in the UI when a user expands a template card, giving full context before they commit to a configuration.

How templates populate the clone page

When a user selects a template, all configuration values are passed to the clone page as URL parameters. This means:

  • Every field on the clone form (source, destination, clone mode, TTL, filters, etc.) is pre-filled from the template
  • Users can review and override any value before starting the clone
  • The URL is shareable -- teammates can open the same pre-configured clone page

Storage Location auto-population

The clone page automatically populates the Storage Location field from the source catalog's metadata. When a source catalog is selected (either manually or via a template), the app reads the catalog's default storage location and fills it in as the suggested destination storage path. Users can override this value if needed.


Plugin system

Extend Clone Catalog with custom logic that runs at specific points in the clone lifecycle.

Real-world scenario

After every clone, your team needs to run OPTIMIZE on all large tables and ANALYZE TABLE to update statistics. Instead of adding post-clone hooks to every config file, you install the optimize plugin once and it runs automatically on every clone operation.

Usage

# List installed plugins
clxs plugin list

# Install a built-in plugin
clxs plugin install optimize

# Install from a directory
clxs plugin install /path/to/my-plugin

# Remove a plugin
clxs plugin remove optimize

# View plugin details
clxs plugin info optimize

Output (plugin list):

============================================================
INSTALLED PLUGINS
============================================================
Name Version Status Hooks
logging 1.0.0 active pre-clone, post-clone, on-error
optimize 1.0.0 active post-table, post-clone
analyze 1.0.0 active post-clone
============================================================

Built-in plugins

PluginDescriptionHooks
loggingEnhanced logging with structured JSON output and log rotationpre-clone, post-clone, on-error
optimizeRuns OPTIMIZE on cloned tables above a configurable size thresholdpost-table, post-clone
analyzeRuns ANALYZE TABLE ... COMPUTE STATISTICS after clone completespost-clone

Plugin directory structure

my-plugin/
plugin.yaml # Plugin metadata and configuration
plugin.py # Plugin implementation

plugin.yaml:

name: "my-custom-plugin"
version: "1.0.0"
description: "Custom post-clone processing"
author: "your-team@company.com"

hooks:
- event: "post-table"
handler: "on_table_cloned"
- event: "post-clone"
handler: "on_clone_complete"
- event: "on-error"
handler: "on_error"

config:
size_threshold_gb: 10
notify_on_complete: true

plugin.py:

from clone_catalog.plugins import PluginBase

class MyCustomPlugin(PluginBase):

def on_table_cloned(self, context):
"""Called after each table is cloned."""
table = context.table
size_gb = context.size_bytes / (1024 ** 3)

if size_gb > self.config.get("size_threshold_gb", 10):
self.execute_sql(
f"OPTIMIZE {context.dest_catalog}.{table.schema}.{table.name}"
)
self.logger.info(f"Optimized {table.schema}.{table.name} ({size_gb:.1f} GB)")

def on_clone_complete(self, context):
"""Called after the entire clone operation completes."""
self.logger.info(
f"Clone complete: {context.tables_cloned} tables in {context.duration}"
)

def on_error(self, context):
"""Called when an error occurs."""
self.logger.error(f"Error cloning {context.table}: {context.error}")

Auto-loading

Plugins in the ~/.clxs/plugins/ directory are loaded automatically. You can also specify a custom plugin directory:

plugins:
directory: "config/plugins/"
enabled:
- "logging"
- "optimize"
disabled:
- "analyze" # Installed but not active
note

Plugins run synchronously in the clone pipeline. A slow plugin will increase total clone time. Keep plugin logic lightweight or offload heavy work to async tasks.


Execution plan

Preview every SQL statement, cost estimate, and expected outcome before running a clone.

Real-world scenario

You are setting up a new clone pipeline for a 500-table catalog. Before running it against production, you want to see: exactly what SQL will be executed, how long it might take, and how much storage it will use. The execution plan gives you a detailed preview — like a database query plan but for your entire clone operation.

Usage

# Generate an execution plan (console output)
clxs plan \
--source production --dest staging

# Save plan as JSON (for CI/CD pipelines)
clxs plan \
--source production --dest staging \
--format json --output plan.json

# Save plan as HTML (shareable report)
clxs plan \
--source production --dest staging \
--format html --output plan.html

# Save plan as SQL (DBA review, manual execution)
clxs plan \
--source production --dest staging \
--format sql --output plan_statements.sql

# Capture SQL separately (in addition to console output)
clxs plan \
--source production --dest staging \
--capture-sql plan_statements.sql

Example: Real execution plan output

Running clxs plan --source production --dest staging produces:

======================================================================
CLONE EXECUTION PLAN
======================================================================
Source: production
Destination: staging
Clone Type: DEEP
Load Type: FULL

Total SQL Statements: 203
By Category:
CLONE : 166
CREATE_SCHEMA : 24
CREATE_VIEW : 4
CREATE_VOLUME : 9

Schemas: 24
Tables : 166 to clone, 4 to skip
Views : 4 to clone, 0 to skip
Functions : 0 to clone, 0 to skip
Volumes : 9 to clone, 0 to skip
======================================================================

Example: Captured SQL file

Using --capture-sql plan_statements.sql generates a .sql file with every write statement in execution order:

-- Clone-Xs Execution Plan
-- Source: production -> Destination: staging
-- Generated: 2026-03-15 22:00:36
-- Total statements: 203

-- [CREATE_SCHEMA] Statement 1
CREATE SCHEMA IF NOT EXISTS `staging`.`staging`;

-- [CREATE_SCHEMA] Statement 2
CREATE SCHEMA IF NOT EXISTS `staging`.`bronze`;

-- [CREATE_SCHEMA] Statement 3
CREATE SCHEMA IF NOT EXISTS `staging`.`assessment`;

-- ... (21 more schemas)

-- [CLONE] Statement 25
CREATE TABLE IF NOT EXISTS `staging`.`bronze`.`customer_data`
DEEP CLONE `production`.`bronze`.`customer_data`;

-- [CLONE] Statement 26
CREATE TABLE IF NOT EXISTS `staging`.`assessment`.`assessment_runs`
DEEP CLONE `production`.`assessment`.`assessment_runs`;

-- ... (164 more table clones)

-- [CREATE_VIEW] Statement 191
CREATE OR REPLACE VIEW `staging`.`test_reports`.`junit_summary_by_run`
AS SELECT run_id, COUNT(*) as total_tests, ...;

-- ... (3 more views)

-- [CREATE_VOLUME] Statement 195
CREATE VOLUME IF NOT EXISTS `staging`.`test_reports`.`test_artifacts`;

-- [CREATE_VOLUME] Statement 196
CREATE EXTERNAL VOLUME IF NOT EXISTS `staging`.`bronze`.`non-pii`
LOCATION 'abfss://non-pii@stdpdvuks01.dfs.core.windows.net/';

-- ... (7 more volumes)

Each statement is categorised with [CLONE], [CREATE_SCHEMA], [CREATE_VIEW], or [CREATE_VOLUME] — making it easy to review, filter, or run manually in a Databricks notebook.

Output formats

FormatFlagBest For
console (default)--format consoleQuick review in the terminal
json--format json --output plan.jsonCI/CD pipelines, programmatic validation
html--format html --output plan.htmlSharing with stakeholders, email reports
sql--format sql --output plan.sqlDBA review, manual execution
Capture SQL--capture-sql plan.sqlSave write statements alongside any format
tip

CI/CD usage: Run clxs plan --source prod --dest staging --format json --output plan.json in your pipeline as a validation step. Parse the JSON to check statement counts, verify no unexpected drops, and auto-approve or flag for review.

tip

DBA review: Use --capture-sql plan.sql to generate a SQL file that a DBA can review before approving the clone. The file contains only write statements (CREATE, CLONE, GRANT) — read queries used for discovery are excluded.