Skip to main content

Analytics & Insights

Clone Catalog collects detailed metrics about your clone operations and provides tools to analyze catalog usage, review clone history, and preview data differences between source and destination.

Usage analysis

Identify which tables in your catalog are actually being used — and which can be skipped to save time and cost.

Real-world scenario

Your production catalog has 500 tables, but your clone takes 3 hours and costs $40 in compute per run. Usage analysis reveals that 180 tables have not been queried in the last 90 days. By skipping those unused tables, you cut the clone time to 1.5 hours and halve the cost.

Usage

# Analyze table usage patterns
clxs usage-analysis --source production

# Analyze with a custom lookback period
clxs usage-analysis --source production --days 90

# Get skip recommendations
clxs usage-analysis --source production --recommend

# Clone with automatic unused table skipping
clxs clone \
--source production --dest staging \
--skip-unused

Output:

============================================================
USAGE ANALYSIS: production (last 90 days)
============================================================
Total tables: 500

Frequently used (daily queries): 142 tables
Occasionally used (weekly queries): 98 tables
Rarely used (monthly queries): 80 tables
Unused (no queries in 90 days): 180 tables

Top 10 most queried tables:
sales.transactions 12,450 queries
analytics.daily_metrics 8,230 queries
sales.customers 5,120 queries
reporting.monthly_summary 3,890 queries
...

RECOMMENDATIONS:
Skip 180 unused tables to save:
Estimated time saved: ~1.5 hours
Estimated storage saved: ~320 GB (deep clone)

Tables recommended for skipping:
analytics.legacy_report_2024 0 queries 12.3 GB
sales.orders_backup 0 queries 8.7 GB
marketing.campaign_archive 0 queries 5.2 GB
...
============================================================

Table categories

CategoryDefinitionDefault Action
Frequently usedQueried dailyAlways clone
Occasionally usedQueried weeklyClone by default
Rarely usedQueried monthlyClone by default, flag for review
UnusedNo queries in lookback periodSkip with --skip-unused

Data source

Usage analysis queries Databricks system tables (system.access.audit) to determine query patterns. The service principal running Clone Catalog needs SELECT access to these system tables.

-- Underlying query (simplified)
SELECT
t.table_schema,
t.table_name,
COUNT(a.event_time) AS query_count,
MAX(a.event_time) AS last_queried
FROM information_schema.tables t
LEFT JOIN system.access.audit a
ON a.request_params.table_name = t.table_name
WHERE a.event_time > DATEADD(DAY, -90, CURRENT_DATE())
GROUP BY t.table_schema, t.table_name
note

System tables may have a lag of up to 24 hours. Usage analysis reflects historical patterns, not real-time query activity.


Dashboard

The Analytics Dashboard provides a real-time overview of all clone activity, powered by Delta tables.

Data source

The dashboard queries three Delta tables directly instead of relying on an in-memory job store:

Delta TablePurposeKey Columns
run_logsOne row per clone runtables_cloned, tables_failed, total_size_bytes
clone_operationsOne row per table-level operationtables_skipped, clone_mode, trigger, destination_existed
clone_metricsAggregated performance metricsuser_name, status, job_type

Because the data lives in Delta, dashboards survive app restarts and can be queried from Databricks SQL or any BI tool connected to the catalog.

Stat cards

The dashboard displays 10 stat cards at the top of the page:

CardSource Table
Total Runsrun_logs
Tables Clonedrun_logs.tables_cloned
Tables Failedrun_logs.tables_failed
Total Sizerun_logs.total_size_bytes
Success Rateclone_operations
Tables Skippedclone_operations.tables_skipped
Avg Durationclone_metrics
Active Usersclone_metrics.user_name
Catalog Health ScoreComposite score derived from success rate, failure trends, and skipped-table ratio
Last Run Statusrun_logs (most recent row)

Charts

Five charts visualize trends and breakdowns:

  1. Clone Runs Over Time -- daily run count from run_logs
  2. Success vs Failure -- stacked bar from clone_operations.status
  3. Clone Mode Distribution -- pie chart from clone_operations.clone_mode
  4. Duration Trend -- line chart from clone_metrics
  5. Size by Catalog -- bar chart from run_logs.total_size_bytes

Insight tables

Two insight tables sit below the charts:

  • Recent Clone Operations -- the last 50 rows from clone_operations, showing source, destination, clone mode, trigger, and whether the destination already existed.
  • Top Failing Tables -- tables with the highest failure count, aggregated from clone_operations.

Catalog Health Score

The Catalog Health Score card computes a 0-100 score for each catalog pair based on:

  • Clone success rate (weight 50%)
  • Trend direction over the last 7 days (weight 30%)
  • Ratio of skipped tables to total tables (weight 20%)

A score below 70 is flagged amber; below 50 is flagged red.

Notification Center

The Notification Center surfaces important events without requiring users to poll the dashboard:

  • Clone failures and partial failures
  • TTL expiration warnings (catalogs expiring within 24 hours)
  • Health Score drops below threshold
  • Long-running clones that exceed the P95 duration

Notifications appear as a bell icon badge in the app header and are stored in the clone_metrics table for audit.

Pinned Catalog Pairs

Users can pin frequently monitored source-destination catalog pairs to the top of the dashboard. Pinned pairs persist across sessions and show a condensed summary card with last run status, health score, and next scheduled run.

tip

The Demo Data Generator (/demo-data) can pre-populate the dashboard with 20 fake clone operations via audit log enrichment, so new installations see meaningful data on the Dashboard immediately.

Integration notes

  • Explorer cost cards read the storage price from Settings (persisted in localStorage), so changing the price on the Settings page is immediately reflected in Explorer stat cards.
  • Column usage analytics falls back to information_schema.columns when system.access.column_lineage is unavailable, ensuring column data is always displayed even without system table access.
  • Cost Estimator passes the price_per_gb value from Settings to the API (POST /api/estimate), allowing cost projections to use the user-configured storage price.

Metrics collection

Track clone performance over time and export metrics to your observability platform.

Real-world scenario

Your team runs nightly clones across 5 catalogs. You want to track clone duration trends, failure rates, and throughput on a Grafana dashboard. Clone Catalog exports Prometheus-compatible metrics that you scrape and visualize.

Usage

# View current metrics
clxs metrics

# View metrics filtered by source catalog
clxs metrics --source monitoring.clone_metrics

# View metrics in JSON format
clxs metrics --format json

# Enable Prometheus metrics endpoint (when running in serve mode)
clxs serve --port 9090

Output:

============================================================
CLONE METRICS SUMMARY
============================================================
Total operations (last 30 days): 62
Success rate: 96.8%
Average duration: 45m 12s
Median duration: 38m 05s
P95 duration: 1h 22m
Average throughput: 5.4 tables/min
Total data cloned: 4.2 TB

Failure breakdown:
Warehouse timeout: 1
Permission denied: 1

Trend (last 7 days):
Mon: 45m ████████████████████
Tue: 42m ███████████████████
Wed: 44m ████████████████████
Thu: 51m ██████████████████████
Fri: 39m █████████████████
Sat: 38m █████████████████
Sun: 40m ██████████████████
============================================================

Metrics collected

MetricTypeDescription
clone_duration_secondsHistogramTotal clone operation duration
clone_tables_totalCounterTotal tables cloned
clone_tables_failedCounterTables that failed to clone
clone_throughput_tables_per_minGaugeCurrent cloning throughput
clone_data_bytes_totalCounterTotal bytes cloned
validation_pass_rateGaugePercentage of tables passing validation
rollback_countCounterNumber of rollback operations

Export destinations

metrics:
enabled: true
collection_interval: 60 # Collect every 60 seconds during clone

export:
# Delta table (queryable from Databricks)
delta:
enabled: true
table: "monitoring.clone_metrics"
catalog: "ops"

# JSON file
json:
enabled: false
path: "metrics/"

# Prometheus endpoint
prometheus:
enabled: true
port: 9090
path: "/metrics"

# Webhook (send metrics to external system)
webhook:
enabled: false
url: "https://your-monitoring-system.com/api/metrics"
headers:
Authorization: "Bearer ${METRICS_API_KEY}"
tip

When exporting to a Delta table, you can build Databricks SQL dashboards directly on top of the metrics data — no external tools needed.


Clone history

Review past clone operations with git-log style output and diff between runs.

Real-world scenario

A data engineer notices that the staging environment has stale data. They want to check: when was the last clone? Did it succeed? What changed between the last two runs? Clone history answers all of these questions without digging through log files.

Usage

# List recent clone operations
clxs history list

# Show details of a specific operation
clxs history show clone-20260314-020000

# Diff two operations (what changed between runs)
clxs history diff \
clone-20260313-020000 \
clone-20260314-020000

# Filter by catalog
clxs history list --source production

# Limit results
clxs history list --limit 5

Output (history list):

============================================================
CLONE HISTORY
============================================================
ID Source Dest Status Duration Tables Date
clone-20260314-020000 production staging SUCCESS 42m 15s 247 2026-03-14 02:00
clone-20260313-020000 production staging SUCCESS 44m 02s 245 2026-03-13 02:00
clone-20260312-020000 production staging FAILED 12m 33s 89/247 2026-03-12 02:00
clone-20260311-020000 production staging SUCCESS 41m 50s 245 2026-03-11 02:00
clone-20260310-020000 production staging ROLLED_BACK 45m 10s 247 2026-03-10 02:00
============================================================

Output (history diff):

============================================================
CLONE DIFF: clone-20260313-020000 → clone-20260314-020000
============================================================
New tables cloned (2):
+ analytics.campaign_results
+ sales.promotions_q1

Tables removed from clone (0):
(none)

Row count changes:
sales.transactions: 8,234,567 → 8,312,045 (+77,478)
sales.customers: 1,234,567 → 1,234,890 (+323)
analytics.daily_metrics: 365,000 → 366,000 (+1,000)

Duration change: 44m 02s → 42m 15s (-1m 47s)
============================================================

History storage

Clone history is stored locally in .clxs/history/ as JSON files. Each operation creates a timestamped record with full details including:

  • Source and destination catalogs
  • All options and flags used
  • Per-table status (success/failure/skipped)
  • Duration and throughput metrics
  • Validation results (if enabled)
note

History files are retained for 90 days by default. Configure history.retention_days in your config to adjust.


Data preview

Compare source and destination data side by side to visually verify clone accuracy.

Real-world scenario

After cloning, a data analyst reports that the numbers in a dashboard "look wrong." Instead of running manual SQL queries against both catalogs, you use data preview to quickly compare the source and destination tables side by side and spot the differences.

Usage

# Preview a specific table
clxs preview \
--source production --dest staging \
--table sales.orders

# Preview all tables (summary mode)
clxs preview \
--source production --dest staging \
--all

# Limit rows for preview
clxs preview \
--source production --dest staging \
--table sales.orders --limit 20

Output (single table):

============================================================
DATA PREVIEW: sales.orders
============================================================
Source (production): 8,312,045 rows
Destination (staging): 8,312,045 rows
Status: MATCH

Sample comparison (first 10 rows):
┌──────────┬─────────────┬────────────┬─────────────┬────────────┐
│ order_id │ customer_id │ amount │ src_status │ dest_status│
├──────────┼─────────────┼────────────┼─────────────┼────────────┤
│ 1001 │ 42 │ 299.99 │ shipped │ shipped │
│ 1002 │ 87 │ 149.50 │ delivered │ delivered │
│ 1003 │ 15 │ 89.00 │ pending │ pending │
│ ... │ ... │ ... │ ... │ ... │
└──────────┴─────────────┴────────────┴─────────────┴────────────┘
All sampled rows match.
============================================================

Output (all tables — summary):

============================================================
DATA PREVIEW SUMMARY: production vs staging
============================================================
Tables compared: 247
Matching: 245
Mismatched: 2

Mismatched tables:
sales.daily_agg:
Source rows: 1,043,289
Dest rows: 1,043,201
Difference: -88 rows (0.008%)

hr.payroll:
Source rows: 15,232
Dest rows: 15,230
Difference: -2 rows (0.013%)
============================================================

Difference highlighting

When mismatches are found, use the diff command to compare catalogs in detail:

clxs diff \
--source production --dest staging
  Rows in source but not in destination (88):
┌──────────┬────────────┬──────────┐
│ date │ region │ total │
├──────────┼────────────┼──────────┤
│ 2026-03-14│ EMEA │ 45,230 │
│ 2026-03-14│ APAC │ 32,100 │
│ ... │ ... │ ... │
└──────────┴────────────┴──────────┘
caution

Data preview queries both source and destination tables. For very large tables, use --limit to restrict the number of rows compared, or use --checksum for hash-based comparison without transferring row data.