Skip to main content

Architecture

High-level overview

┌─────────────────────────────────────────────────────────────────────────────┐
│ CLI (main.py) │
│ │
│ clone │ diff │ compare │ validate │ sync │ rollback │ estimate │ snapshot │
│ schema-drift │ generate-workflow │ export-iac │ init │ preflight │ search │
│ stats │ profile │ monitor │ export │ config-diff │ completion │ auth │
└────────────────────────────────┬────────────────────────────────────────────┘

┌────────────┴────────────┐
│ Config (config.py) │
│ YAML + Profiles + CLI │
│ Overrides │
└────────────┬─────────────┘

┌────────────────────────┼────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ Clone Engine │ │ Analysis Engine │ │ Export Engine │
│ │ │ │ │ │
│ clone_catalog │ │ diff │ │ export (CSV/JSON)│
│ clone_tables │ │ compare │ │ snapshot │
│ clone_views │ │ validation │ │ terraform │
│ clone_funcs │ │ schema_drift │ │ pulumi │
│ clone_volumes │ │ data_profile │ │ workflow gen │
│ permissions │ │ search │ │ estimate │
│ tags │ │ stats │ │ │
│ security │ │ monitor │ │ │
└───────┬───────┘ └─────────┬─────────┘ └────────┬─────────┘
│ │ │
└──────────────────────┼───────────────────────┘

┌──────────┴──────────┐
│ Client (client.py) │
│ Auth (auth.py) │
│ Metadata Cache │
│ RateLimiter │
│ SQL Execution │
└──────────┬──────────┘

┌──────────┴──────────┐
│ Databricks SDK │
│ WorkspaceClient │
│ SQL Statement API │
└─────────────────────┘

Module structure

ModulePurpose
main.pyCLI entry point, argument parsing, subcommand routing
auth.pyAuthentication — PAT, service principal, OAuth, browser login
client.pyWorkspaceClient factory, SQL execution, rate limiting, retries, metadata caching
metadata_cache.pyThread-safe TTL cache for SDK metadata (schemas, tables, views, etc.)
config.pyYAML config loading, profile support, CLI override merging
clone_catalog.pyOrchestrates full catalog clone (schemas → tables → views → functions → volumes)
clone_tables.pyTable cloning (deep/shallow, time travel, incremental)
clone_views.pyView recreation with catalog reference rewriting
clone_functions.pyUDF cloning
clone_volumes.pyVolume cloning (managed and external)
permissions.pyPermission copying (grants, ownership)
tags.pyTag copying (catalog, schema, table, column)
security.pyRow filter and column mask cloning
diff.pySchema-level diff between two catalogs
compare.pyDeep compare (row counts, checksums)
validation.pyPost-clone validation
schema_drift.pySchema drift detection over time
data_profile.pyColumn-level data profiling
search.pyFull-text search across catalog metadata
stats.pyCatalog statistics and inventory
monitor.pyContinuous monitoring mode
sync.pyTwo-way sync between catalogs
rollback.pyUndo clone operations
export.pyCSV/JSON metadata export
snapshot.pyPoint-in-time catalog snapshots
estimate.pyStorage and compute cost estimation
terraform.pyTerraform HCL export
workflow.pyDatabricks Workflow JSON generation
catalog_clone_api.pyNotebook-friendly API wrapper

How cloning works

  1. Pre-flight — verify connectivity, permissions, warehouse status
  2. Create destination catalogCREATE CATALOG IF NOT EXISTS (with managed location if needed)
  3. Discover schemas — query information_schema.schemata on the source
  4. For each schema (in parallel):
    • Create schema in destination
    • Clone tables (deep or shallow via CREATE TABLE ... CLONE)
    • Recreate views with updated catalog references
    • Recreate functions
    • Clone volumes
  5. Copy metadata — permissions, tags, security policies, constraints, comments
  6. Validate — compare row counts, schema structure (if --validate flag)
  7. Report — generate summary with success/fail/skip counts

SQL execution

All SQL is executed via the Databricks SQL Statement Execution API. This means:

  • No cluster required — uses SQL warehouses (serverless or pro)
  • Built-in rate limiting (configurable)
  • Automatic retries with exponential backoff (3 attempts)
  • Full SQL logging for debugging and audit

Caching

Clone-Xs uses two layers of in-memory caching to reduce redundant Databricks API calls:

Auth cache

The auth.py module caches the WorkspaceClient instance with a 1-hour verification TTL. This avoids re-authenticating on every API call while ensuring stale credentials are detected.

Metadata cache

The metadata_cache.py module provides a thread-safe, TTL-based cache for SDK metadata calls. All SDK wrapper functions in client.py are cached automatically:

Cached functionKeyWhat it stores
list_schemas_sdkcatalog + exclude listSchema names
list_tables_sdkcatalog + schemaTable names, types, formats
list_views_sdkcatalog + schemaView names, definitions
list_functions_sdkcatalog + schemaFunction names
list_volumes_sdkcatalog + schemaVolume names, types
get_table_info_sdkfull table nameColumns, owner, properties
get_catalog_info_sdkcatalog nameOwner, storage root

NOT cached: SQL queries (execute_sql), row counts, checksums, and mutating operations (delete_table_sdk).

TTL: 300 seconds (5 minutes) by default. Override with the CLXS_CACHE_TTL environment variable.

Auto-invalidation: The cache is automatically cleared for affected catalogs after clone, sync, and incremental sync jobs complete. It is also cleared when authentication credentials change.

Manual control: Use the /api/cache/stats, /api/cache/clear, and /api/cache/invalidate API endpoints to monitor and manage the cache.