Architecture

High-level overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                              CLI (main.py)                                  │
│                                                                             │
│  clone │ diff │ compare │ validate │ sync │ rollback │ estimate │ snapshot  │
│  schema-drift │ generate-workflow │ export-iac │ init │ preflight │ search  │
│  stats │ profile │ monitor │ export │ config-diff │ completion │ auth      │
└────────────────────────────────┬────────────────────────────────────────────┘
                                 │
                    ┌────────────┴────────────┐
                    │     Config (config.py)   │
                    │  YAML + Profiles + CLI   │
                    │       Overrides          │
                    └────────────┬─────────────┘
                                 │
        ┌────────────────────────┼────────────────────────┐
        │                        │                        │
        ▼                        ▼                        ▼
┌───────────────┐    ┌───────────────────┐    ┌──────────────────┐
│  Clone Engine │    │  Analysis Engine  │    │  Export Engine    │
│               │    │                   │    │                  │
│ clone_catalog │    │ diff              │    │ export (CSV/JSON)│
│ clone_tables  │    │ compare           │    │ snapshot         │
│ clone_views   │    │ validation        │    │ terraform        │
│ clone_funcs   │    │ schema_drift      │    │ pulumi           │
│ clone_volumes │    │ data_profile      │    │ workflow gen     │
│ permissions   │    │ search            │    │ estimate         │
│ tags          │    │ stats             │    │                  │
│ security      │    │ monitor           │    │                  │
└───────┬───────┘    └─────────┬─────────┘    └────────┬─────────┘
        │                      │                       │
        └──────────────────────┼───────────────────────┘
                               │
                    ┌──────────┴──────────┐
                    │  Client (client.py) │
                    │  Auth  (auth.py)    │
                    │  Metadata Cache     │
                    │  RateLimiter        │
                    │  SQL Execution      │
                    └──────────┬──────────┘
                               │
                    ┌──────────┴──────────┐
                    │ Databricks SDK      │
                    │ WorkspaceClient     │
                    │ SQL Statement API   │
                    └─────────────────────┘

Module structure

Module	Purpose
`main.py`	CLI entry point, argument parsing, subcommand routing
`auth.py`	Authentication — PAT, service principal, OAuth, browser login
`client.py`	WorkspaceClient factory, SQL execution, rate limiting, retries, metadata caching
`metadata_cache.py`	Thread-safe TTL cache for SDK metadata (schemas, tables, views, etc.)
`config.py`	YAML config loading, profile support, CLI override merging
`clone_catalog.py`	Orchestrates full catalog clone (schemas → tables → views → functions → volumes)
`clone_tables.py`	Table cloning (deep/shallow, time travel, incremental)
`clone_views.py`	View recreation with catalog reference rewriting
`clone_functions.py`	UDF cloning
`clone_volumes.py`	Volume cloning (managed and external)
`permissions.py`	Permission copying (grants, ownership)
`tags.py`	Tag copying (catalog, schema, table, column)
`security.py`	Row filter and column mask cloning
`diff.py`	Schema-level diff between two catalogs
`compare.py`	Deep compare (row counts, checksums)
`validation.py`	Post-clone validation
`schema_drift.py`	Schema drift detection over time
`data_profile.py`	Column-level data profiling
`search.py`	Full-text search across catalog metadata
`stats.py`	Catalog statistics and inventory
`monitor.py`	Continuous monitoring mode
`sync.py`	Two-way sync between catalogs
`rollback.py`	Undo clone operations
`export.py`	CSV/JSON metadata export
`snapshot.py`	Point-in-time catalog snapshots
`estimate.py`	Storage and compute cost estimation
`terraform.py`	Terraform HCL export
`workflow.py`	Databricks Workflow JSON generation
`catalog_clone_api.py`	Notebook-friendly API wrapper

How cloning works

Pre-flight — verify connectivity, permissions, warehouse status
Create destination catalog — CREATE CATALOG IF NOT EXISTS (with managed location if needed)
Discover schemas — query information_schema.schemata on the source
For each schema (in parallel):
- Create schema in destination
- Clone tables (deep or shallow via CREATE TABLE ... CLONE)
- Recreate views with updated catalog references
- Recreate functions
- Clone volumes
Copy metadata — permissions, tags, security policies, constraints, comments
Validate — compare row counts, schema structure (if --validate flag)
Report — generate summary with success/fail/skip counts

SQL execution

All SQL is executed via the Databricks SQL Statement Execution API. This means:

No cluster required — uses SQL warehouses (serverless or pro)
Built-in rate limiting (configurable)
Automatic retries with exponential backoff (3 attempts)
Full SQL logging for debugging and audit

Caching

Clone-Xs uses two layers of in-memory caching to reduce redundant Databricks API calls:

Auth cache

The auth.py module caches the WorkspaceClient instance with a 1-hour verification TTL. This avoids re-authenticating on every API call while ensuring stale credentials are detected.

Metadata cache

The metadata_cache.py module provides a thread-safe, TTL-based cache for SDK metadata calls. All SDK wrapper functions in client.py are cached automatically:

Cached function	Key	What it stores
`list_schemas_sdk`	catalog + exclude list	Schema names
`list_tables_sdk`	catalog + schema	Table names, types, formats
`list_views_sdk`	catalog + schema	View names, definitions
`list_functions_sdk`	catalog + schema	Function names
`list_volumes_sdk`	catalog + schema	Volume names, types
`get_table_info_sdk`	full table name	Columns, owner, properties
`get_catalog_info_sdk`	catalog name	Owner, storage root

NOT cached: SQL queries (execute_sql), row counts, checksums, and mutating operations (delete_table_sdk).

TTL: 300 seconds (5 minutes) by default. Override with the CLXS_CACHE_TTL environment variable.

Auto-invalidation: The cache is automatically cleared for affected catalogs after clone, sync, and incremental sync jobs complete. It is also cleared when authentication credentials change.

Manual control: Use the /api/cache/stats, /api/cache/clear, and /api/cache/invalidate API endpoints to monitor and manage the cache.

High-level overview​

Module structure​

How cloning works​

SQL execution​

Caching​

Auth cache​

Metadata cache​