Skip to main content

Clone → Xs

Enterprise-grade Unity Catalog Toolkit for Databricks — clone, compare, sync, and manage catalogs from CLI, Web UI, Desktop App, Databricks App, or REST API.

What it does

Clone-Xs replicates an entire Unity Catalog catalog to a new catalog in the same (or different) workspace, preserving:

  • Schemas — all schemas are recreated in the destination
  • Tables — deep or shallow Delta Lake clone with time travel support
  • Views — view definitions recreated with catalog references updated
  • Functions — UDFs recreated in the destination
  • Volumes — volume metadata and managed volume contents
  • Permissions — grants, ownership, and access controls
  • Tags — catalog, schema, table, and column-level tags
  • Security policies — row filters and column masks
  • Constraints — primary keys, foreign keys, not-null constraints
  • Comments — table and column-level comments

Key capabilities

CapabilityDescription
Deep & Shallow CloneFull data copy or metadata-only reference clone
Cross-Format Clone (Iceberg)Land Delta destinations as Iceberg-readable via UniForm, or as physical Iceberg tables (USING iceberg). Iceberg sources clone to Delta with hidden-partitioning preflight refusal and auto-CTAS recovery for partition-evolution failures. See clone guide — target format
Convert table format (in-place)Rewrite Iceberg / Parquet sources to Delta at the same FQN. D2 will add Delta→Iceberg / Parquet target cells; Hudi gated behind D3. Destructive on source — confirmation gate at API + UI. Per-target audit row in convert_operations Delta table. See Convert table format guide
Incremental LoadOnly clone new objects added since last run
Time TravelClone tables at a specific version or timestamp
Data FilteringClone subsets with --where and --table-filter
Schema Drift DetectionDetect changes between source and destination
Cross-Workspace & Cross-Cloud MigrationDelta Sharing + DEEP CLONE pipeline to migrate a catalog — schemas, tables, views, SQL functions, volumes + files, grants, tags, ownership — across Databricks workspaces on AWS, Azure, or GCP. Saved target connections live in browser localStorage, never on the server — no PATs in clone_config.yaml, nothing for git secret-scanning to flag. Includes a same-metastore preflight check that fails fast (1–2s) instead of creating orphan recipients when both workspaces share a single UC metastore
Dry Run & Execution PlanPreview all SQL with cost estimates
Auto-RollbackAutomatically undo clone if validation fails
Delta RESTORE RollbackNon-destructive rollback using RESTORE TABLE ... TO VERSION AS OF with pre-clone version tracking
CheckpointingResume long clones from where they left off
Scheduled CloningCron or interval-based scheduling with drift detection
Throttle ControlsRate-limit clones with low/medium/high/max presets
Clone TemplatesOne-command cloning with predefined profiles
RBAC & ApprovalsControl who can clone what, with approval workflows
TTL PoliciesAuto-expire cloned catalogs after N days
Usage AnalysisFind and skip unused tables
Incremental SyncSync only changed tables using Delta version history
Dependency AnalysisView/function dependency graphs with creation order
Slack BotTrigger and monitor clone operations from Slack
Data SamplingPreview and compare table data between catalogs
Metrics & HistoryTrack throughput, failure rates, and operation history
Delta Audit LoggingEvery operation logs to run_logs, clone_operations, and clone_metrics
Compliance ReportsAudit-ready reports covering PII, permissions, lineage
RTBF / Right to Be ForgottenGDPR Article 17 erasure workflow — discover, delete, VACUUM, verify, certificate across all cloned catalogs. 34 legal bases from 18 jurisdictions
DSAR / Right of AccessGDPR Article 15 access request — discover and export subject data as CSV/JSON/Parquet with audit trail and 30-day deadline tracking
Clone PipelinesChain operations into reusable workflows — clone, mask, validate, notify, vacuum. 4 built-in templates, 3 failure policies, execution history
Data ObservabilityUnified health dashboard (0-100 score) combining freshness, volume, anomaly, SLA, and data quality metrics
REST API ServerExpose clone operations as HTTP endpoints
Plugin SystemExtend with custom plugins from a marketplace
Pre-flight ChecksValidate connectivity, permissions, and config before cloning
Cost EstimationEstimate storage and compute costs
Terraform / Pulumi ExportGenerate IaC from your catalog
Notebook APIRun from Databricks notebooks via wheel or repo import
Storage MetricsAnalyze per-table storage (active, vacuumable, time-travel) via ANALYZE TABLE COMPUTE STORAGE METRICS
OPTIMIZE & VACUUMRun table maintenance directly from the UI with multi-select and dry-run
Create Databricks JobCreate persistent scheduled jobs from UI or CLI — no manual JSON needed
Desktop AppNative macOS/Windows app via Electron — no terminal required
Databricks AppDeploy as a native Databricks App with automatic service principal auth
Demo Data GeneratorGenerate realistic demo catalogs with 10 industries, 200+ tables, medallion architecture, and comprehensive enrichment (PII tags, FK constraints, partitioning, SCD2, volumes). Also generates unstructured corpora — Documents (PDF/DOCX/PPTX/XLSX/EML, optional AI-drafted narrative), Media (PNG/WAV/MP4), Knowledge (wiki/Q&A/chat), Logs (NGINX/JSON/syslog/OTel), and Code (Python/JS/Java repos) — into UC Volumes or direct Delta tables for RAG, observability, and code-search demos.
MarketplacePublish to Databricks Marketplace as Solution Accelerator
Analytics Dashboard10 stat cards, 5 charts, catalog health scores, pinned favorites, notifications
Notification CenterReal-time bell icon showing clone completions, failures, and TTL warnings
Catalog Health ScorePer-catalog health scoring (0-100) based on failure rates and operation history
Pinned Catalog PairsQuick-access favorites for frequently used source→destination pairs
Page State PersistenceNavigate away and come back — scan results are preserved across all pages
Auto Storage LocationClone and Create Job pages auto-populate storage location from source catalog
Template Config Pass-throughTemplates pre-fill all clone checkboxes, not just clone type
Master Data ManagementFirst open-source Databricks-native MDM — entity resolution (6 match types), golden records, survivorship rules, data stewardship with SLA tracking, hierarchy management, industry templates (Healthcare, Financial, Retail, Manufacturing), reference data, DQ scorecards, consent management. 19 pages, 6 Delta tables, 21 API endpoints
Databricks Jobs CloningClone job definitions within or across workspaces — with diff view, backup/restore, and cross-workspace migration
DLT Pipeline CloningClone Delta Live Tables pipeline definitions — same workspace or cross-workspace with credential handling
8-Portal ArchitectureClone-Xs, Governance, Data Quality, FinOps, Security, Automation, Infrastructure, MDM — each with dedicated sidebar and pages
DQX IntegrationDatabricks Labs DQX — profile tables, auto-generate rules, run check suites, persist results to Delta. Pairs with Trust Scores and Coverage Map
ODCS Data ContractsOpen Data Contract Standard — full CRUD with YAML import/export, validation, and DQX-backed enforcement
Trust Score EngineComposite per-table 0-100 score from six dimensions (DQ pass rate, freshness, anomaly history, PII coverage, schema stability, lineage) with configurable weights
Compliance AutomationMap DQ controls to SOC2 / GDPR / HIPAA / CCPA / DORA frameworks with automated evidence collection and audit-ready reports
COPQ — Cost of Poor Data QualityQuantify pipeline reruns, SLA breaches, engineer time, and downstream impact in dollars
Anomaly CorrelationGroup correlated anomalies under root-cause groups across upstream/downstream tables
NL Rule BuilderTranslate plain-English descriptions into executable DQ rule configs via the configured AI backend
Alert RoutingSmart deduplication, correlation, priority-ranking, and digest-mode delivery to teams via channels
Remediation PlaybooksIf-this-then-that automation triggered on DQ failures, anomalies, SLA breaches, freshness staleness, schema drift
FinOps SuiteBilling, breakdown, compute, query costs, recommendations, storage optimization, budgets, trend dashboards backed by Databricks system tables
Ephemeral EnvironmentsOne-click sandbox creation with auto PII masking, DQ validation, cost budgets, and TTL-based cleanup
Continuous SyncStreaming replication via Structured Streaming jobs (PREVIEW) for change-data-capture sync
Data Products CatalogInternal marketplace for publishing and subscribing to curated data products with docs, quality guarantees, and SLAs
Lakehouse FederationBrowse foreign catalogs, manage connections, migrate to managed Delta
ML Assets CloningClone Models + Feature Tables + Vector Indexes + Serving Endpoints across catalogs/workspaces
Advanced TablesClone Materialized Views, Streaming Tables, and Online Tables
Streaming Demo Profiles10 IoT device profiles (sensor, machine, car, smart-meter, wearable, POS, turbine, ATM, server, clickstream) emit JSON to UC Volume; auto-create Bronze + schedule as Databricks Job
Durable Job TrackingLong-running operations (clone, sync, demo-data, IaC, batch reconciliation) survive page navigation and browser refresh — progress, logs, and chart history resume from server state

Quick install

pip install clone-xs

Verify:

clxs --help

Why multiple run modes?

Clone-Xs provides several deployment options because different teams and workflows have different needs. Here's when to use each:

ModeHow to runBest for
CLIclxs clone --source X --dest YEngineers who prefer the terminal. Scriptable, pipeable, works in CI/CD pipelines. Fastest for one-off clones.
Web UImake web-starthttp://localhost:3000Teams who need a visual interface. 33 pages covering clone, diff, sync, storage metrics, and more. Great for demos and non-technical stakeholders.
Desktop Appmake desktop-devUsers who want a native app without managing terminals or servers. Double-click to launch — the backend starts automatically. Available for macOS and Windows.
Databricks Appmake deploy-dbx-appProduction teams who want Clone-Xs embedded in their Databricks workspace. Uses workspace service principal for authentication — no PAT tokens needed. Accessible to anyone with workspace access.
Wheel Packagepip install clone-xsNotebook users and data engineers. Import Clone-Xs as a Python library in Databricks notebooks or jobs. Install once, call from any notebook cell.
Serverless Jobclxs clone --serverless --volume /Volumes/...Cost-conscious teams. Uploads the wheel to a UC Volume and submits a serverless notebook job — $0 warehouse cost, auto-scaling, zero cluster wait time.
REST APIclxs servehttp://localhost:8000/docsPlatform teams building internal tools. Embed Clone-Xs operations into custom dashboards, Slack bots, or CI/CD workflows via HTTP endpoints.
Databricks Jobclxs create-job --source X --dest Y --schedule "..."Scheduled production clones. Creates a persistent Databricks Job with cron scheduling, email alerts, retries, and tags — runs unattended.

Next steps