Analytics & Insights
Clone Catalog collects detailed metrics about your clone operations and provides tools to analyze catalog usage, review clone history, and preview data differences between source and destination.
Usage analysis
Identify which tables in your catalog are actually being used — and which can be skipped to save time and cost.
Real-world scenario
Your production catalog has 500 tables, but your clone takes 3 hours and costs $40 in compute per run. Usage analysis reveals that 180 tables have not been queried in the last 90 days. By skipping those unused tables, you cut the clone time to 1.5 hours and halve the cost.
Usage
# Analyze table usage patterns
clxs usage-analysis --source production
# Analyze with a custom lookback period
clxs usage-analysis --source production --days 90
# Get skip recommendations
clxs usage-analysis --source production --recommend
# Clone with automatic unused table skipping
clxs clone \
--source production --dest staging \
--skip-unused
Output:
============================================================
USAGE ANALYSIS: production (last 90 days)
============================================================
Total tables: 500
Frequently used (daily queries): 142 tables
Occasionally used (weekly queries): 98 tables
Rarely used (monthly queries): 80 tables
Unused (no queries in 90 days): 180 tables
Top 10 most queried tables:
sales.transactions 12,450 queries
analytics.daily_metrics 8,230 queries
sales.customers 5,120 queries
reporting.monthly_summary 3,890 queries
...
RECOMMENDATIONS:
Skip 180 unused tables to save:
Estimated time saved: ~1.5 hours
Estimated storage saved: ~320 GB (deep clone)
Tables recommended for skipping:
analytics.legacy_report_2024 0 queries 12.3 GB
sales.orders_backup 0 queries 8.7 GB
marketing.campaign_archive 0 queries 5.2 GB
...
============================================================
Table categories
| Category | Definition | Default Action |
|---|---|---|
| Frequently used | Queried daily | Always clone |
| Occasionally used | Queried weekly | Clone by default |
| Rarely used | Queried monthly | Clone by default, flag for review |
| Unused | No queries in lookback period | Skip with --skip-unused |
Data source
Usage analysis queries Databricks system tables (system.access.audit) to determine query patterns. The service principal running Clone Catalog needs SELECT access to these system tables.
-- Underlying query (simplified)
SELECT
t.table_schema,
t.table_name,
COUNT(a.event_time) AS query_count,
MAX(a.event_time) AS last_queried
FROM information_schema.tables t
LEFT JOIN system.access.audit a
ON a.request_params.table_name = t.table_name
WHERE a.event_time > DATEADD(DAY, -90, CURRENT_DATE())
GROUP BY t.table_schema, t.table_name
System tables may have a lag of up to 24 hours. Usage analysis reflects historical patterns, not real-time query activity.
Dashboard
The Analytics Dashboard provides a real-time overview of all clone activity, powered by Delta tables.
Data source
The dashboard queries three Delta tables directly instead of relying on an in-memory job store:
| Delta Table | Purpose | Key Columns |
|---|---|---|
run_logs | One row per clone run | tables_cloned, tables_failed, total_size_bytes |
clone_operations | One row per table-level operation | tables_skipped, clone_mode, trigger, destination_existed |
clone_metrics | Aggregated performance metrics | user_name, status, job_type |
Because the data lives in Delta, dashboards survive app restarts and can be queried from Databricks SQL or any BI tool connected to the catalog.
Stat cards
The dashboard displays 10 stat cards at the top of the page:
| Card | Source Table |
|---|---|
| Total Runs | run_logs |
| Tables Cloned | run_logs.tables_cloned |
| Tables Failed | run_logs.tables_failed |
| Total Size | run_logs.total_size_bytes |
| Success Rate | clone_operations |
| Tables Skipped | clone_operations.tables_skipped |
| Avg Duration | clone_metrics |
| Active Users | clone_metrics.user_name |
| Catalog Health Score | Composite score derived from success rate, failure trends, and skipped-table ratio |
| Last Run Status | run_logs (most recent row) |
Charts
Five charts visualize trends and breakdowns:
- Clone Runs Over Time -- daily run count from
run_logs - Success vs Failure -- stacked bar from
clone_operations.status - Clone Mode Distribution -- pie chart from
clone_operations.clone_mode - Duration Trend -- line chart from
clone_metrics - Size by Catalog -- bar chart from
run_logs.total_size_bytes
Insight tables
Two insight tables sit below the charts:
- Recent Clone Operations -- the last 50 rows from
clone_operations, showing source, destination, clone mode, trigger, and whether the destination already existed. - Top Failing Tables -- tables with the highest failure count, aggregated from
clone_operations.
Catalog Health Score
The Catalog Health Score card computes a 0-100 score for each catalog pair based on:
- Clone success rate (weight 50%)
- Trend direction over the last 7 days (weight 30%)
- Ratio of skipped tables to total tables (weight 20%)
A score below 70 is flagged amber; below 50 is flagged red.
Notification Center
The Notification Center surfaces important events without requiring users to poll the dashboard:
- Clone failures and partial failures
- TTL expiration warnings (catalogs expiring within 24 hours)
- Health Score drops below threshold
- Long-running clones that exceed the P95 duration
Notifications appear as a bell icon badge in the app header and are stored in the clone_metrics table for audit.
Pinned Catalog Pairs
Users can pin frequently monitored source-destination catalog pairs to the top of the dashboard. Pinned pairs persist across sessions and show a condensed summary card with last run status, health score, and next scheduled run.
The Demo Data Generator (/demo-data) can pre-populate the dashboard with 20 fake clone operations via audit log enrichment, so new installations see meaningful data on the Dashboard immediately.
Integration notes
- Explorer cost cards read the storage price from Settings (persisted in localStorage), so changing the price on the Settings page is immediately reflected in Explorer stat cards.
- Column usage analytics falls back to
information_schema.columnswhensystem.access.column_lineageis unavailable, ensuring column data is always displayed even without system table access. - Cost Estimator passes the
price_per_gbvalue from Settings to the API (POST /api/estimate), allowing cost projections to use the user-configured storage price.
Metrics collection
Track clone performance over time and export metrics to your observability platform.
Real-world scenario
Your team runs nightly clones across 5 catalogs. You want to track clone duration trends, failure rates, and throughput on a Grafana dashboard. Clone Catalog exports Prometheus-compatible metrics that you scrape and visualize.
Usage
# View current metrics
clxs metrics
# View metrics filtered by source catalog
clxs metrics --source monitoring.clone_metrics
# View metrics in JSON format
clxs metrics --format json
# Enable Prometheus metrics endpoint (when running in serve mode)
clxs serve --port 9090
Output:
============================================================
CLONE METRICS SUMMARY
============================================================
Total operations (last 30 days): 62
Success rate: 96.8%
Average duration: 45m 12s
Median duration: 38m 05s
P95 duration: 1h 22m
Average throughput: 5.4 tables/min
Total data cloned: 4.2 TB
Failure breakdown:
Warehouse timeout: 1
Permission denied: 1
Trend (last 7 days):
Mon: 45m ████████████████████
Tue: 42m ███████████████████
Wed: 44m ████████████████████
Thu: 51m ██████████████████████
Fri: 39m █████████████████
Sat: 38m █████████████████
Sun: 40m ██████████████████
============================================================
Metrics collected
| Metric | Type | Description |
|---|---|---|
clone_duration_seconds | Histogram | Total clone operation duration |
clone_tables_total | Counter | Total tables cloned |
clone_tables_failed | Counter | Tables that failed to clone |
clone_throughput_tables_per_min | Gauge | Current cloning throughput |
clone_data_bytes_total | Counter | Total bytes cloned |
validation_pass_rate | Gauge | Percentage of tables passing validation |
rollback_count | Counter | Number of rollback operations |
Export destinations
metrics:
enabled: true
collection_interval: 60 # Collect every 60 seconds during clone
export:
# Delta table (queryable from Databricks)
delta:
enabled: true
table: "monitoring.clone_metrics"
catalog: "ops"
# JSON file
json:
enabled: false
path: "metrics/"
# Prometheus endpoint
prometheus:
enabled: true
port: 9090
path: "/metrics"
# Webhook (send metrics to external system)
webhook:
enabled: false
url: "https://your-monitoring-system.com/api/metrics"
headers:
Authorization: "Bearer ${METRICS_API_KEY}"
When exporting to a Delta table, you can build Databricks SQL dashboards directly on top of the metrics data — no external tools needed.
Clone history
Review past clone operations with git-log style output and diff between runs.
Real-world scenario
A data engineer notices that the staging environment has stale data. They want to check: when was the last clone? Did it succeed? What changed between the last two runs? Clone history answers all of these questions without digging through log files.
Usage
# List recent clone operations
clxs history list
# Show details of a specific operation
clxs history show clone-20260314-020000
# Diff two operations (what changed between runs)
clxs history diff \
clone-20260313-020000 \
clone-20260314-020000
# Filter by catalog
clxs history list --source production
# Limit results
clxs history list --limit 5
Output (history list):
============================================================
CLONE HISTORY
============================================================
ID Source Dest Status Duration Tables Date
clone-20260314-020000 production staging SUCCESS 42m 15s 247 2026-03-14 02:00
clone-20260313-020000 production staging SUCCESS 44m 02s 245 2026-03-13 02:00
clone-20260312-020000 production staging FAILED 12m 33s 89/247 2026-03-12 02:00
clone-20260311-020000 production staging SUCCESS 41m 50s 245 2026-03-11 02:00
clone-20260310-020000 production staging ROLLED_BACK 45m 10s 247 2026-03-10 02:00
============================================================
Output (history diff):
============================================================
CLONE DIFF: clone-20260313-020000 → clone-20260314-020000
============================================================
New tables cloned (2):
+ analytics.campaign_results
+ sales.promotions_q1
Tables removed from clone (0):
(none)
Row count changes:
sales.transactions: 8,234,567 → 8,312,045 (+77,478)
sales.customers: 1,234,567 → 1,234,890 (+323)
analytics.daily_metrics: 365,000 → 366,000 (+1,000)
Duration change: 44m 02s → 42m 15s (-1m 47s)
============================================================
History storage
Clone history is stored locally in .clxs/history/ as JSON files. Each operation creates a timestamped record with full details including:
- Source and destination catalogs
- All options and flags used
- Per-table status (success/failure/skipped)
- Duration and throughput metrics
- Validation results (if enabled)
History files are retained for 90 days by default. Configure history.retention_days in your config to adjust.
Data preview
Compare source and destination data side by side to visually verify clone accuracy.
Real-world scenario
After cloning, a data analyst reports that the numbers in a dashboard "look wrong." Instead of running manual SQL queries against both catalogs, you use data preview to quickly compare the source and destination tables side by side and spot the differences.
Usage
# Preview a specific table
clxs preview \
--source production --dest staging \
--table sales.orders
# Preview all tables (summary mode)
clxs preview \
--source production --dest staging \
--all
# Limit rows for preview
clxs preview \
--source production --dest staging \
--table sales.orders --limit 20
Output (single table):
============================================================
DATA PREVIEW: sales.orders
============================================================
Source (production): 8,312,045 rows
Destination (staging): 8,312,045 rows
Status: MATCH
Sample comparison (first 10 rows):
┌──────────┬─────────────┬────────────┬─────────────┬────────────┐
│ order_id │ customer_id │ amount │ src_status │ dest_status│
├──────────┼─────────────┼────────────┼─────────────┼────────────┤
│ 1001 │ 42 │ 299.99 │ shipped │ shipped │
│ 1002 │ 87 │ 149.50 │ delivered │ delivered │
│ 1003 │ 15 │ 89.00 │ pending │ pending │
│ ... │ ... │ ... │ ... │ ... │
└──────────┴─────────────┴────────────┴─────────────┴────────────┘
All sampled rows match.
============================================================
Output (all tables — summary):
============================================================
DATA PREVIEW SUMMARY: production vs staging
============================================================
Tables compared: 247
Matching: 245
Mismatched: 2
Mismatched tables:
sales.daily_agg:
Source rows: 1,043,289
Dest rows: 1,043,201
Difference: -88 rows (0.008%)
hr.payroll:
Source rows: 15,232
Dest rows: 15,230
Difference: -2 rows (0.013%)
============================================================
Difference highlighting
When mismatches are found, use the diff command to compare catalogs in detail:
clxs diff \
--source production --dest staging
Rows in source but not in destination (88):
┌──────────┬────────────┬──────────┐
│ date │ region │ total │
├──────────┼────────────┼──────────┤
│ 2026-03-14│ EMEA │ 45,230 │
│ 2026-03-14│ APAC │ 32,100 │
│ ... │ ... │ ... │
└──────────┴────────────┴──────────┘
Data preview queries both source and destination tables. For very large tables, use --limit to restrict the number of rows compared, or use --checksum for hash-based comparison without transferring row data.