Clone → Xs
Enterprise-grade Unity Catalog Toolkit for Databricks — clone, compare, sync, and manage catalogs from CLI, Web UI, Desktop App, Databricks App, or REST API.
What it does
Clone-Xs replicates an entire Unity Catalog catalog to a new catalog in the same (or different) workspace, preserving:
- Schemas — all schemas are recreated in the destination
- Tables — deep or shallow Delta Lake clone with time travel support
- Views — view definitions recreated with catalog references updated
- Functions — UDFs recreated in the destination
- Volumes — volume metadata and managed volume contents
- Permissions — grants, ownership, and access controls
- Tags — catalog, schema, table, and column-level tags
- Security policies — row filters and column masks
- Constraints — primary keys, foreign keys, not-null constraints
- Comments — table and column-level comments
Key capabilities
| Capability | Description |
|---|---|
| Deep & Shallow Clone | Full data copy or metadata-only reference clone |
| Cross-Format Clone (Iceberg) | Land Delta destinations as Iceberg-readable via UniForm, or as physical Iceberg tables (USING iceberg). Iceberg sources clone to Delta with hidden-partitioning preflight refusal and auto-CTAS recovery for partition-evolution failures. See clone guide — target format |
| Convert table format (in-place) | Rewrite Iceberg / Parquet sources to Delta at the same FQN. D2 will add Delta→Iceberg / Parquet target cells; Hudi gated behind D3. Destructive on source — confirmation gate at API + UI. Per-target audit row in convert_operations Delta table. See Convert table format guide |
| Incremental Load | Only clone new objects added since last run |
| Time Travel | Clone tables at a specific version or timestamp |
| Data Filtering | Clone subsets with --where and --table-filter |
| Schema Drift Detection | Detect changes between source and destination |
| Cross-Workspace & Cross-Cloud Migration | Delta Sharing + DEEP CLONE pipeline to migrate a catalog — schemas, tables, views, SQL functions, volumes + files, grants, tags, ownership — across Databricks workspaces on AWS, Azure, or GCP. Saved target connections live in browser localStorage, never on the server — no PATs in clone_config.yaml, nothing for git secret-scanning to flag. Includes a same-metastore preflight check that fails fast (1–2s) instead of creating orphan recipients when both workspaces share a single UC metastore |
| Dry Run & Execution Plan | Preview all SQL with cost estimates |
| Auto-Rollback | Automatically undo clone if validation fails |
| Delta RESTORE Rollback | Non-destructive rollback using RESTORE TABLE ... TO VERSION AS OF with pre-clone version tracking |
| Checkpointing | Resume long clones from where they left off |
| Scheduled Cloning | Cron or interval-based scheduling with drift detection |
| Throttle Controls | Rate-limit clones with low/medium/high/max presets |
| Clone Templates | One-command cloning with predefined profiles |
| RBAC & Approvals | Control who can clone what, with approval workflows |
| TTL Policies | Auto-expire cloned catalogs after N days |
| Usage Analysis | Find and skip unused tables |
| Incremental Sync | Sync only changed tables using Delta version history |
| Dependency Analysis | View/function dependency graphs with creation order |
| Slack Bot | Trigger and monitor clone operations from Slack |
| Data Sampling | Preview and compare table data between catalogs |
| Metrics & History | Track throughput, failure rates, and operation history |
| Delta Audit Logging | Every operation logs to run_logs, clone_operations, and clone_metrics |
| Compliance Reports | Audit-ready reports covering PII, permissions, lineage |
| RTBF / Right to Be Forgotten | GDPR Article 17 erasure workflow — discover, delete, VACUUM, verify, certificate across all cloned catalogs. 34 legal bases from 18 jurisdictions |
| DSAR / Right of Access | GDPR Article 15 access request — discover and export subject data as CSV/JSON/Parquet with audit trail and 30-day deadline tracking |
| Clone Pipelines | Chain operations into reusable workflows — clone, mask, validate, notify, vacuum. 4 built-in templates, 3 failure policies, execution history |
| Data Observability | Unified health dashboard (0-100 score) combining freshness, volume, anomaly, SLA, and data quality metrics |
| REST API Server | Expose clone operations as HTTP endpoints |
| Plugin System | Extend with custom plugins from a marketplace |
| Pre-flight Checks | Validate connectivity, permissions, and config before cloning |
| Cost Estimation | Estimate storage and compute costs |
| Terraform / Pulumi Export | Generate IaC from your catalog |
| Notebook API | Run from Databricks notebooks via wheel or repo import |
| Storage Metrics | Analyze per-table storage (active, vacuumable, time-travel) via ANALYZE TABLE COMPUTE STORAGE METRICS |
| OPTIMIZE & VACUUM | Run table maintenance directly from the UI with multi-select and dry-run |
| Create Databricks Job | Create persistent scheduled jobs from UI or CLI — no manual JSON needed |
| Desktop App | Native macOS/Windows app via Electron — no terminal required |
| Databricks App | Deploy as a native Databricks App with automatic service principal auth |
| Demo Data Generator | Generate realistic demo catalogs with 10 industries, 200+ tables, medallion architecture, and comprehensive enrichment (PII tags, FK constraints, partitioning, SCD2, volumes). Also generates unstructured corpora — Documents (PDF/DOCX/PPTX/XLSX/EML, optional AI-drafted narrative), Media (PNG/WAV/MP4), Knowledge (wiki/Q&A/chat), Logs (NGINX/JSON/syslog/OTel), and Code (Python/JS/Java repos) — into UC Volumes or direct Delta tables for RAG, observability, and code-search demos. |
| Marketplace | Publish to Databricks Marketplace as Solution Accelerator |
| Analytics Dashboard | 10 stat cards, 5 charts, catalog health scores, pinned favorites, notifications |
| Notification Center | Real-time bell icon showing clone completions, failures, and TTL warnings |
| Catalog Health Score | Per-catalog health scoring (0-100) based on failure rates and operation history |
| Pinned Catalog Pairs | Quick-access favorites for frequently used source→destination pairs |
| Page State Persistence | Navigate away and come back — scan results are preserved across all pages |
| Auto Storage Location | Clone and Create Job pages auto-populate storage location from source catalog |
| Template Config Pass-through | Templates pre-fill all clone checkboxes, not just clone type |
| Master Data Management | First open-source Databricks-native MDM — entity resolution (6 match types), golden records, survivorship rules, data stewardship with SLA tracking, hierarchy management, industry templates (Healthcare, Financial, Retail, Manufacturing), reference data, DQ scorecards, consent management. 19 pages, 6 Delta tables, 21 API endpoints |
| Databricks Jobs Cloning | Clone job definitions within or across workspaces — with diff view, backup/restore, and cross-workspace migration |
| DLT Pipeline Cloning | Clone Delta Live Tables pipeline definitions — same workspace or cross-workspace with credential handling |
| 8-Portal Architecture | Clone-Xs, Governance, Data Quality, FinOps, Security, Automation, Infrastructure, MDM — each with dedicated sidebar and pages |
| DQX Integration | Databricks Labs DQX — profile tables, auto-generate rules, run check suites, persist results to Delta. Pairs with Trust Scores and Coverage Map |
| ODCS Data Contracts | Open Data Contract Standard — full CRUD with YAML import/export, validation, and DQX-backed enforcement |
| Trust Score Engine | Composite per-table 0-100 score from six dimensions (DQ pass rate, freshness, anomaly history, PII coverage, schema stability, lineage) with configurable weights |
| Compliance Automation | Map DQ controls to SOC2 / GDPR / HIPAA / CCPA / DORA frameworks with automated evidence collection and audit-ready reports |
| COPQ — Cost of Poor Data Quality | Quantify pipeline reruns, SLA breaches, engineer time, and downstream impact in dollars |
| Anomaly Correlation | Group correlated anomalies under root-cause groups across upstream/downstream tables |
| NL Rule Builder | Translate plain-English descriptions into executable DQ rule configs via the configured AI backend |
| Alert Routing | Smart deduplication, correlation, priority-ranking, and digest-mode delivery to teams via channels |
| Remediation Playbooks | If-this-then-that automation triggered on DQ failures, anomalies, SLA breaches, freshness staleness, schema drift |
| FinOps Suite | Billing, breakdown, compute, query costs, recommendations, storage optimization, budgets, trend dashboards backed by Databricks system tables |
| Ephemeral Environments | One-click sandbox creation with auto PII masking, DQ validation, cost budgets, and TTL-based cleanup |
| Continuous Sync | Streaming replication via Structured Streaming jobs (PREVIEW) for change-data-capture sync |
| Data Products Catalog | Internal marketplace for publishing and subscribing to curated data products with docs, quality guarantees, and SLAs |
| Lakehouse Federation | Browse foreign catalogs, manage connections, migrate to managed Delta |
| ML Assets Cloning | Clone Models + Feature Tables + Vector Indexes + Serving Endpoints across catalogs/workspaces |
| Advanced Tables | Clone Materialized Views, Streaming Tables, and Online Tables |
| Streaming Demo Profiles | 10 IoT device profiles (sensor, machine, car, smart-meter, wearable, POS, turbine, ATM, server, clickstream) emit JSON to UC Volume; auto-create Bronze + schedule as Databricks Job |
| Durable Job Tracking | Long-running operations (clone, sync, demo-data, IaC, batch reconciliation) survive page navigation and browser refresh — progress, logs, and chart history resume from server state |
Quick install
pip install clone-xs
Verify:
clxs --help
Why multiple run modes?
Clone-Xs provides several deployment options because different teams and workflows have different needs. Here's when to use each:
| Mode | How to run | Best for |
|---|---|---|
| CLI | clxs clone --source X --dest Y | Engineers who prefer the terminal. Scriptable, pipeable, works in CI/CD pipelines. Fastest for one-off clones. |
| Web UI | make web-start → http://localhost:3000 | Teams who need a visual interface. 33 pages covering clone, diff, sync, storage metrics, and more. Great for demos and non-technical stakeholders. |
| Desktop App | make desktop-dev | Users who want a native app without managing terminals or servers. Double-click to launch — the backend starts automatically. Available for macOS and Windows. |
| Databricks App | make deploy-dbx-app | Production teams who want Clone-Xs embedded in their Databricks workspace. Uses workspace service principal for authentication — no PAT tokens needed. Accessible to anyone with workspace access. |
| Wheel Package | pip install clone-xs | Notebook users and data engineers. Import Clone-Xs as a Python library in Databricks notebooks or jobs. Install once, call from any notebook cell. |
| Serverless Job | clxs clone --serverless --volume /Volumes/... | Cost-conscious teams. Uploads the wheel to a UC Volume and submits a serverless notebook job — $0 warehouse cost, auto-scaling, zero cluster wait time. |
| REST API | clxs serve → http://localhost:8000/docs | Platform teams building internal tools. Embed Clone-Xs operations into custom dashboards, Slack bots, or CI/CD workflows via HTTP endpoints. |
| Databricks Job | clxs create-job --source X --dest Y --schedule "..." | Scheduled production clones. Creates a persistent Databricks Job with cron scheduling, email alerts, retries, and tags — runs unattended. |
Next steps
- Quickstart — clone your first catalog in 5 minutes
- Setup — installation and configuration
- Authentication — configure credentials
- Advanced Cloning — data filtering, TTL, execution plans, plugins
- Safety & Rollback — auto-rollback, checkpointing, config lint, impact analysis
- Governance — RBAC, approval workflows, compliance reports
- RTBF — Right to Be Forgotten / GDPR Article 17 erasure workflows
- DSAR — Data Subject Access Request / GDPR Article 15
- Clone Pipelines — chain clone, mask, validate, notify into workflows
- Data Observability — unified health dashboard
- Delta Live Tables — discover, clone, and monitor DLT pipelines
- Scheduling & Automation — scheduled clones, templates, API server, throttling
- Analytics & Insights — usage analysis, metrics, history, data preview
- Storage Metrics — analyze and optimize table storage
- Create Job — schedule clone operations as Databricks Jobs
- Desktop App — run as a native desktop application
- Databricks App — deploy to your Databricks workspace
- Web UI — all 60+ pages across 8 portals
- CLI Reference — full command reference
- API Reference — REST API endpoint reference
- Changelog — version history