Changelog
All notable changes to Clone-Xs are documented here.
v1.0.0 — Live Capture tab with image-grounded multimodal AI
Released 2026-05-12.
A sixth unstructured-data tab — Live Capture — joins
/demo-data (see
guide/unstructured-demo-data → Live Capture).
Instead of synthesising bytes on the server, captures arrive from the
user's browser webcam (one HTTP multipart request per snapshot or
video chunk) and land synchronously in a UC Volume + Delta catalog
table that carries both file_path and inline content BINARY.
Added — Live Capture orchestrator
- New module
src/demo_capture.py—init_capture_target,handle_frame,list_recent. No JobManager and no batching: each capture is one synchronous HTTP request the handler completes before returning, so the UI's Recent strip updates immediately. - New router
api/routers/demo_capture.pywith three endpoints:POST /api/capture/init— idempotent volume + table create (called on tab mount).POST /api/capture/frame— multipart upload → Volume upload + INSERT row.GET /api/capture/recent— recent metadata rows for the live UI (no inlineBINARYin the payload).
- Combined-shape table at
<catalog>.<schema>.demo_capture_catalog(default name; override via theTable namefield). Created withCREATE TABLE IF NOT EXISTSso captures accumulate across browser sessions; existing tables get newer columns added on next call viaALTER TABLE ADD COLUMN IF NOT EXISTS. - Per-tab session isolation. Each browser tab generates a
session_idon mount and the Recent strip filters by it server-side, so concurrent users don't see each other's captures. - Best-effort
submitted_by. Pulls the caller's email fromclient.current_user.me(). Captures never block on this — if the SDK call fails, the row lands with NULLsubmitted_byand the upload still succeeds.
Added — Six AI-derived fields per photo, in one consolidated call
When AI mode is on and a Databricks Foundation Model is selected, every photo capture triggers one multimodal call returning all six fields as a JSON blob:
| Field | Purpose | Length |
|---|---|---|
caption | 1-sentence visual caption | ≤14 words |
alt_text | accessibility text | ≤18 words |
summary | scene description | 2–3 sentences |
tags | comma-separated visual keywords | 5–8 single words |
detected_text | OCR of any visible text | empty if none readable |
scene_category | high-level scene class | 1–2 words |
- New helper
maybe_ai_jsoninsrc/ai_drafter.py. Mirrors the existingmaybe_aiergonomic but parses a JSON response (with code-fence stripping and brace-slicing for noisy outputs) and falls back to afallback_dictper-key on any failure. Six AI calls collapsed to one — meaningful on free-tier endpoints. - Image-grounded only for photos. Photos with
image/jpeg|png|webpmimes are forwarded as base64 inline via the OpenAI-styleimage_urlcontent block (Llama 4 Maverick / Claude 3.7 Sonnet on Databricks Model Serving accept this shape). Video chunks (webm / mp4) bypass the vision endpoint and use a metadata-only prompt; visual-only fields (detected_text,scene_category) are forced to""/"unknown"so SQL aggregates aren't polluted with hallucinated values. - Databricks Model Serving only. Live Capture never uses the
Anthropic API path. The endpoint comes from the same
X-Databricks-Modelheader the Documents tab uses, sourced from Settings.
Added — Strict vs Permissive description style toggle
A new segmented control next to the AI mode toggle picks the prompt style:
- Strict (default) — industry-neutral, demographics-neutral.
No gender / age / ethnicity / profession claims; people are
referred to as
"a person"and only directly-observable features are described. Fixes the failure mode where industry-priming caused the model to label any person at a desk in healthcare mode as "nurse". - Permissive — vivid description. Industry priming is back on and the model may describe apparent gender / profession when the scene supports it. Caller has accepted the bias risk.
Defence-in-depth: any unknown style value from the wire (typo,
enum drift) clamps back to strict server-side. The router accepts
the choice as a description_style form field on
POST /api/capture/frame.
Added — UI: Live Capture tab with rendered AI fields
- New tab at
/demo-data→ Live Capture with three modes: Take photo, Burst photos (interval-driven), Record video (MediaRecorderchunked with operator-set chunk length). - Recent strip now renders the AI work per tile: 1-line
truncated
summary,scene_categoryas a small pill,tagsas chips (max 4 visible), anddetected_textas an OCR caption. Previously the strip rendered only file size / capture id. - Description style segmented control (Strict / Permissive) beside the AI mode toggle, disabled until AI mode is on.
Changed — Migration logging
ALTER TABLE ADD COLUMN IF NOT EXISTSfailures in_ensure_capture_tablenow log at warning level instead of debug, so a genuine migration failure shows up in the API log instead of silently leading to "column not found" on the next INSERT.
Unreleased — AI-drafted narrative content + token budget for the Documents tab
The Documents tab on /demo-data (see
guide/unstructured-demo-data → Documents)
gains an AI mode that drafts narrative text via a user-picked
Databricks Model Serving endpoint, with a per-job token budget and
graceful template fallback. Pure-template generation continues to
work unchanged when AI is off or unconfigured.
Added — _AIAdapter for narrative drafting
- New class
_AIAdapterinsrc/demo_documents.py. WrapsAIService._call_llmwith a per-job token counter and a system prompt tuned for synthetic-document text ("output ONLY the requested content, no preamble, no markdown"). The adapter is constructed once per job and threaded through every generator as anai_clientparameter; generators don't need to know the budget exists — they call.draft(prompt, fallback)and the adapter degrades to the fallback when exhausted. - Dual-backend routing. When the request carries an
X-Databricks-Model: <endpoint-name>header, the adapter routes through Databricks Model Serving — the UI's api-client sets it automatically fromlocalStorage.dbx_modelwhenever the user has picked an endpoint in Settings (same pattern the AI assistant uses). Otherwise falls back to the Anthropic API path (ANTHROPIC_API_KEY). When neither is configured the runner logs a warning and proceeds in template-only mode. - Per-job token budget. New
ai_token_budgetfield onDemoDocumentsRequest(default 50,000, range0–10,000,000). Default ≈ $0.50 on Sonnet at typicalmax_tokens. Set to0to disable AI even whenrealistic_content=True. Accounting is conservative: every call charges the full requestedmax_tokens(the SDK doesn't surface usage), biasing toward stopping early for cost safety. - Job summary fields. Completion now reports
ai_backend(e.g."databricks:my-endpoint"or"anthropic"),ai_calls,ai_tokens_used, andai_fallbacksso operators can see how the budget was spent.
Added — Distinctness primitives + expanded industry context
To avoid the "every PDF reads identical" problem on a 10,000-row corpus, the generators gain three small primitives — used regardless of AI mode:
_rotate(*variants)—random.choiceover phrasing variants for openings, salutations, transitions._maybe_section(prob)— random optional inclusion of secondary sections so document length and shape vary._INDUSTRY_CONTEXTregistry expansion — 2–3× more diagnosis codes, treatment codes, department names, transaction types, store codes, product categories, and services across all ten industries. Sized large enough that a 10,000-row corpus has visible variety without AI mode.
Changed — Documents request model
realistic_contentdescription updated to call out both backends ("a Databricks Model Serving endpoint picked in Settings orANTHROPIC_API_KEY") instead of Anthropic only.- New
ai_token_budgetfield accepted onPOST /api/generate/demo-documents. Older clients omitting it pick up the default; no breaking change. - Router accepts the new
X-Databricks-Modelheader and forwards it to the JobManager asai_endpoint_namein the job config.
Unreleased — Code tab + dynamic catalog/schema/volume picker on /demo-data
Adds a fifth unstructured tab and unifies destination selection across all five tabs behind a single picker component. See guide/unstructured-demo-data → Code.
Added — Code tab
- Three generators in
src/demo_code.py:python_repo(src/ package + tests + README +pyproject.toml),js_repo(ES6 withpackage.json),java_repo(src/main/java+src/test/javapom.xml). Each repo is ~25–35 files.
- Per-type cap is 50 repos (≈1,500 source files per type) — intentionally lower than Documents/Knowledge because building the per-repo file set is non-trivial.
direct_tableis one row per source file withcontent STRINGinline — the natural shape for code-search embeddings, which work at the file level not the repo level. The schema is(repo_name, language, file_path, content STRING, …).- Endpoints:
GET /api/generate/demo-code/types,POST /api/generate/demo-code/preview,POST /api/generate/demo-code. - UI:
ui/src/app/demo-data/CodeTab.tsx— same shape as Documents/Media/Knowledge/Logs (destination radio, picker, industry, type grid, preview).
Added — CatalogSchemaVolumePicker shared component
- New file:
ui/src/components/CatalogSchemaVolumePicker.tsx. Replaces free-text catalog/schema/volumeInputfields across all five unstructured tabs (Documents, Media, Knowledge, Logs, Code). - Three dropdowns + custom-name fallback per field. Each field
shows existing names from the workspace plus a "Custom name…
(create new)" option that swaps in a free-text input. The runner
auto-creates new schemas and volumes on submit via
CREATE SCHEMA IF NOT EXISTS/CREATE VOLUME IF NOT EXISTS. - API endpoints called:
GET /api/catalogs,GET /api/catalogs/{catalog}/schemas,GET /api/auth/volumes. Volumes are filtered to the chosencatalog.schemascope; schemas fetch is skipped while the user is still typing a custom catalog name. - Volume picker disables on
direct_table. The label flips to "(unused for direct_table)" but the field stays visible so layout doesn't shift.
Unreleased — Logs tab on /demo-data
Adds a Logs tab generating synthetic log corpora for observability, SIEM, and anomaly-detection demos. See guide/unstructured-demo-data → Logs.
Added — Four log generators
nginx_access— combined-log-format with a 24-hour traffic curve peaking at 10 and 16 UTC; status distribution ~94% 2xx / 4% 3xx / 1% 4xx / 1% 5xx.app_json— JSON Lines, level mix ~94% INFO / 5% WARN / 1% ERROR with realistic message templates.syslog— RFC 5424 with a per-industry service registry (e.g.auth-svc,billing-svcfor financial;pacs-gw,ehr-apifor healthcare).otel_trace— OpenTelemetry span trees, 3–8 spans per trace withparent_span_idwired so traces render correctly in Tempo / Jaeger / Databricks observability dashboards.
Added — Two extra cadence inputs
lines_per_file(default 1,000, range 1–100,000) — lets a single file represent anything from a 5-minute slice to a full-day log without changing the file count.days_back(default 7, range 1–365) — files are spread evenly acrossdays_backUTC days with peak-hour clustering inside each day, so a 7-day corpus produces a realistic weekly pattern.
Added — direct_table writes one row per LINE (not per file)
-
Schema for
direct_table:CREATE OR REPLACE TABLE <fqn> (
log_id STRING,
log_type STRING,
service STRING,
ts TIMESTAMP,
level STRING,
message STRING,
attrs MAP<STRING, STRING>,
generated_at TIMESTAMP
) USING delta;Operators query
attrs['status']etc. without reshaping. The per-filevolume_with_catalogschema is preserved separately for file-level metadata demos. -
Per-type cap: 1,000 files. With
lines_per_file=100,000that's 100 M rows max per type per submit. -
Endpoints:
GET /api/generate/demo-logs/types,POST /api/generate/demo-logs/preview,POST /api/generate/demo-logs.
Unreleased — Documents, Media, Knowledge tabs on /demo-data
Introduces three new tabs on the existing /demo-data page that
generate unstructured demo corpora — files (and inline-bytes
Delta tables) instead of typed Delta columns. See the full guide at
guide/unstructured-demo-data.
Added — Documents tab
- 29 document types in a registry on
src/demo_documents.py: 9 industry-aware originals (pdf_claim,pdf_invoice,pdf_contract,docx_letter,docx_report,pptx_deck,xlsx_budget,xlsx_inventory,eml_message) plus 20 industry-specific additions (lab reports, account statements, BOL/customs forms, property listings, syllabi, …). The picker filters to types that make sense for the chosen industry; e.g.pdf_lab_reportonly appears whenindustry=healthcare. - Three destinations:
volume(files only),volume_with_catalog(default — files + Delta index, one row per file),direct_table(content BINARYinline; no Volume writes). - Per-type cap: 10,000.
- Dependency gate: requires
clone-xs[documents](reportlab, python-docx, python-pptx, openpyxl). The/typesendpoint surfacesavailable: falsewith an install hint when missing, andPOST /demo-documentsreturns a structured 503 with the install command instead of a generic error. - Endpoints:
GET /api/generate/demo-documents/types,POST /api/generate/demo-documents/preview,POST /api/generate/demo-documents. - UI:
ui/src/app/demo-data/DocumentsTab.tsx.
Added — Media tab
- Five generators in
src/demo_media.py:img_xray(512×512 grayscale),img_scan(800×1000 off-white scanned-doc look),img_photo(600×400 stock-photo placeholder),audio_voicemail(2-second sine + Faker transcript line),video_clip(320×240 H.264 MP4 at 15 fps). - Per-type cap: 5,000 (lower than Documents because media files are larger).
- Dual dependency probe.
/typesreturns bothavailable(Pillow — required for images and the voicemail transcript path) andffmpeg_available(required only forvideo_clip). Whenffmpeg_available: falsethe UI greys out the Video Clip checkbox; the four other types remain selectable. direct_tablecaveat for video. Delta has a ~16 MB row-size cap that a busyvideo_cliprun can blow through. The runner doesn't split or truncate today (v2 work). Video-heavy demos should pickvolume_with_catalog; direct-table video demos should keep counts low.- Endpoints:
GET /api/generate/demo-media/types,POST /api/generate/demo-media/preview,POST /api/generate/demo-media. - UI:
ui/src/app/demo-data/MediaTab.tsx.
Added — Knowledge tab
- Three generators in
src/demo_knowledge.py:wiki_article(markdown body + YAML frontmatter),qa_pair(one-question-per- file JSON),chat_thread(Slack-export-shaped JSONL threads). - No extra deps — pure stdlib + Faker. The
/typesendpoint always returnsavailable: true. - Per-industry topic IA. Each output file lands in a
<topic>sub-directory (e.g.wiki_article/billing/…) so RAG demos can filter on topic without parsing filenames. direct_tablecontent type isSTRING(notBINARY). Knowledge bodies are text and should be queryable inline:SELECT content FROM demo_knowledge WHERE topic='billing' AND content LIKE '%refund%'.- Per-type cap: 10,000.
- Endpoints:
GET /api/generate/demo-knowledge/types,POST /api/generate/demo-knowledge/preview,POST /api/generate/demo-knowledge. - UI:
ui/src/app/demo-data/KnowledgeTab.tsx.
Added — Shared validation
The five unstructured request models (DocumentsRequest, MediaRequest, KnowledgeRequest, LogsRequest, CodeRequest) share validators:
- Catalog / schema / volume must each be a single Unity Catalog identifier (no dotted FQNs).
volumeis required when destination isvolumeorvolume_with_catalog; ignored ondirect_table.countskeys must appear intypes(catches stale form state).
Unreleased — Streaming Events form: presets, configurable limits, warehouse-impact hints, chart polish
A focused round of ergonomics on the /demo-data Streaming Events
tab. No public API contract changes for the existing POST /api/generate/demo-data/streaming request; three new GET/PATCH
endpoints surface the form-bounds config so workspace admins can
widen or narrow the form without code changes.
Added — Configurable streaming-form bounds
The form's three cadence inputs (events_per_batch,
interval_seconds, total_duration_seconds) used to have hardcoded
min/max/default values in three places (UI clamp logic, Pydantic
validators, runner defaults). All three now read from a single
source admins can edit.
- New file:
config/streaming_limits.json. Stores the per-field{default, min, max}for the three streaming-form fields. Independent ofclone_config.yaml— these are UX form bounds, not clone orchestration. Created on first save via the Settings page; until then the API serves built-in defaults. - New helper:
src/config.get_streaming_limits()andset_streaming_limits(). mtime-cached read so streaming validation is a dict access, not file I/O.set_streaming_limitsdoes merge-on-write so partial updates don't have to resend the whole shape; atomic write via.tmp+os.replace. Validatesmin ≤ default ≤ maxper field before persisting — the file is never written into a state that would 422 every subsequent streaming request. - Pydantic validators converted from
Field(ge,le)to@field_validator.StreamingEmissionRequest,StreamingScheduleRequest, andZerobusSnippetRequestall read bounds via_check_streaming_boundat request time. Defaults switched toField(default_factory=lambda: _streaming_default(...))so the API's default value tracks YAML edits without a server restart. Sub-secondinterval_secondsis preserved (min=0.1) so existing direct-API callers using fractional cadence don't break. - Runner defaults read from config too.
src/demo_streaming.pynow uses_limits["events_per_batch"]["default"]instead of a hardcoded100when the caller's config dict omits the field. Same for the other two fields. - New endpoint:
GET /api/config/streaming-limits. Returns the current form bounds. Used by the Settings page card. - New endpoint:
PATCH /api/config/streaming-limits. Partial updates supported. Returns 400 with a descriptive message on invariant violation. Cache invalidates so the next form fetch picks up the new bounds within a second. - New endpoint:
GET /api/generate/demo-data/streaming/limits. Focused endpoint the/demo-datapage reads on mount — same source as the config endpoint, no need to fetch the full blob. - New Settings card: Settings → Performance → Streaming Form Limits. Three-row × three-column grid (event rows × default/min/ max). Save button calls the PATCH endpoint with full state; Reset button reverts to built-in defaults locally (admin still has to click Save to persist). Same client-side invariant check as the server before round-tripping.
Added — Performance presets row on /demo-data
One-click bundles of destination + cadence tuned for different
throughput tiers. Picking a preset sets destination,
events_per_batch, interval_seconds, and total_duration_seconds
in one click. Active preset auto-detected by exact-match comparison;
manually editing any field flips the indicator to Custom.
- Four presets shipped: Demo (
volume_bronze/ 100 / 5s / 60s), Direct small batches (direct_table/ 50K / 1s / 300s), Bulk files (volume_bronze/ 100K / 2s / 300s), Streaming Zerobus (zerobus/ 1M / 5s / 600s). - Clamping to admin-configured bounds. Preset values pass through
the same clamp as manual edits — if
events_per_batch.maxhas been narrowed in Settings, a preset whose batch size exceeds the cap applies clamped values and atoast.warningexplains the gap. - Zerobus preset gated. Disabled (with tooltip explaining why) when the Zerobus SDK isn't installed or Premium tier isn't available — same gating as the destination radio.
- Active-preset highlight. The matching preset gets the brand
#E8453Caccent border; "Custom — current settings don't match any preset" hint appears below the row when the user has drifted off-preset.
Added — Per-destination warehouse-impact indicators
Each radio card under Destination now surfaces a one-line italic note explaining how that destination uses the SQL warehouse:
volume: "Warehouse: not used. Files write directly to UC Volume." (emerald)volume_bronze: "Warehouse: one-time CREATE OR REFRESH STREAMING TABLE. Refresh runs on its own DBSQL Serverless pool." (emerald)direct_table: "Warehouse: every tick. INSERT VALUES is single-driver-bound — pick the largest serverless you can." (amber)zerobus: "Warehouse: one-time DDL only (CREATE TABLE + GRANTs). Idle during streaming. Smallest warehouse is fine." (emerald)
Color is currentColor + text-emerald-{600,400} /
text-amber-{600,400}, so it adapts to all 10 themes. The amber
note on direct_table is the highest-leverage hint — INSERT VALUES
throughput is bounded by the warehouse driver's parse speed, which
no other destination cares about.
Added — Throughput chart enhancements
The streaming progress card's throughput chart switched from
<LineChart> to <ComposedChart> and gained:
- Tooltip label fix. Both lines previously rendered as "Events /
tick" because the formatter checked
name(the legend label, which Recharts maps fromnameprop) instead ofdataKey. Now uses dataKey so "Cumulative events" and "Events / tick" are always distinguished. - K/M/B number formatting. New
fmtNhelper on top-level. Y-axis ticks render3Minstead of3000000; tooltip values render the same. Major readability win once batch size passes ~10K. - Subtle area fill under the cumulative line via a
<linearGradient>from 25% alpha at top to 2% at bottom. Gives the line visual weight without dominating. - Expected-throughput reference line. Horizontal dashed line on
the per-tick axis at the configured
events_per_batch, labeled "expected N/tick". Hidden when the configured value is less than 1% of peak per-tick delta (e.g. user changed the form to 100 after running with 1M batches) — at that scale the line is flush against the X-axis and the label collides with the last X-tick. - Per-tick error markers. Snapshot history captures
tick_errorsalongsideevents_emitted; the chart computeshasErrorper snapshot fromerrorDelta > 0and renders a red ⨯ circle on any tick where errors went up. A separate hidden<Line>carries the custom dot so the visual doesn't interfere with the cumulative<Area>. - Theme-aware colors. All hardcoded
#374151strokes replaced withcurrentColor+className="text-muted-foreground"so the chart renders correctly across light / dark / midnight / sunset / high-contrast / ocean / forest / solarized / rose / slate themes. - Taller chart (160 → 220px). With axis labels on both Y axes ("cumulative" / "per tick"), the previous height was cramped.
- Y-axis labels and X-axis spacing fixes. Reference line label
position changed from
insideTopRight(which collided with the last X-axis tick) toinsideTopLeft. Bottom margin bumped 18 → 30 so the X-axis title and Legend no longer crowd each other. Right margin 16 → 24 so the last X-tick has breathing room.
Doc updates
- Demo Data Generator guide gained four new subsections: Performance presets, warehouse-impact column on the destination modes table, Throughput chart, and Form-bound limits (with cross-links to the new endpoints).
- API reference gained three new endpoint entries:
GET /api/config/streaming-limits,PATCH /api/config/streaming-limits,GET /api/generate/demo-data/streaming/limits.
Tests
All 72 existing streaming tests still pass after the Pydantic refactor. Smoke-tested end-to-end via TestClient: GET with no file returns built-in fallback, PATCH with partial update writes the file, GET reflects the new bounds, Pydantic accepts a value that was 422'd before the PATCH, invalid PATCH (min > max, default outside range) returns 400 with descriptive detail.
v0.9.0 — N×N table-format converter, Zerobus PAT auth + reliability hardening
Turns the four cheap CTAS cells from "skipped" to working in the convert page (so the matrix now ships six format pairs end-to-end), plus a substantial reliability + ergonomics pass on the Zerobus streaming destination. All new paths are additive; defaults are unchanged from v0.8.x.
Added — N×N table-format converter
The convert page handles six format pairs end-to-end now (was: two). Hudi remains gated behind a Job-cluster runtime decision (D3, not yet shipped).
- Four new pairs unlocked.
(DELTA, ICEBERG),(PARQUET, ICEBERG),(DELTA, PARQUET),(ICEBERG, PARQUET)are now executable. Combined with the original D1 pair set, total is{(PARQUET, DELTA), (ICEBERG, DELTA), (DELTA, ICEBERG), (PARQUET, ICEBERG), (DELTA, PARQUET), (ICEBERG, PARQUET)}. - Strategy registry. New
src/format_strategies.pyships four primitives —enable_uniform_plan,ctas_iceberg_plan,ctas_iceberg_inplace_plan,ctas_parquet_inplace_plan— each returning aPlanof labelledPlanSteps. Theconvert_table_formatorchestrator picks the right primitive for each(source, target)pair via a_dispatch_strategylookup. The audit row's newstrategy_usedcolumn records which path ran (convert_to_delta,uniform,ctas_iceberg,ctas_parquet). iceberg_physicalflag onConvertToDeltaRequest. Only meaningful for(DELTA, ICEBERG)rows.false(default) picks the UniForm-update path (no data movement, table stays Delta with Iceberg metadata).truepicks the temp+rename CTAS path that produces a real Iceberg table; UC reportsData source: Iceberg. Mirrors the same flag onCloneRequest.keep_backupflag onConvertToDeltaRequest. For temp+rename CTAS pairs (any → ICEBERG/PARQUET when not UniForm),true(default) renames the source aside as{fqn}_pre_convert_<utc>for reversibility.falsedrops the source after the rename — non-recoverable.- Per-pair compatibility preflight. New
src/format_compat.pyrunsDESCRIBE TABLE EXTENDEDbefore strategy dispatch and refuses pairs with known incompatibilities. Today's checks:(ICEBERG, *)refuses hidden-partition Iceberg sources (delegates toclone_iceberg.preflight_iceberg_source);(DELTA, ICEBERG)and(DELTA, PARQUET)refuseGENERATED ALWAYS/ identity columns. Refusal returnsstatus="skipped"with a structured reason and emits no SQL. Skipped on dry-run so operators can preview the plan against known-incompatible sources. Plan/PlanStepexecution model. Every strategy now builds a multi-stepPlanup-front (no execute-and-then-build). On step failure, the exception is wrapped with the step's label (step 'disable deletion vectors' failed: …) so operators see which DDL blew up without parsing the SQL. Dry-run renders every step in the log so the wizard preview shows the full sequence, not just the first statement.- Audit schema migration.
convert_operationsgaineddestination_format STRING(D1) andstrategy_used STRING(D2), applied via idempotentALTER TABLE ADD COLUMN IF NOT EXISTS+UPDATE … WHERE col IS NULLbackfill on first call. Pre-D1 rows backfilled to"DELTA"; pre-D2 rows left empty. - UI page rename.
/convert-to-delta→/convert(the old name was misleading once the page handles every target). The old URL keeps working via a<Navigate to="/convert" replace />redirect. Sidebar entry updates to "Convert table format". Doc page renamed todocs/docs/guide/convert.md. - UI per-row target dropdown. Each cart row gets a target-format select;
Default target formatselector applies to newly-added rows only. Hudi option present-but-disabled with a tooltip referencing the runtime sponsorship gate. Pre-submit validation against a client-sideSUPPORTED_PAIRSset so unsupported pairs render an inline warning before the user clicks Submit. - Status-badge colour fix. The status chips in Results and Recent Runs ("converted" / "failed" / "skipped") now render in the correct emerald / red / grey palette. Earlier they all rendered as the brand-red default-variant Badge because the per-status Tailwind classes were being overridden by
bg-primary. Fix: passvariant="outline"so the variant adds no background and the utility classes win cleanly. - Page copy refresh. Banner ("rewritten in place to the chosen target format" instead of D1's hard-coded "rewritten to Delta in place"), default target dropdown labels (strategy-aware: "DELTA — CONVERT TO DELTA (in-place)" / "ICEBERG — UniForm metadata, or physical CTAS (toggle below)" / "PARQUET — CTAS (loses Delta history)"), and confirmation dialog text updated to match the N×N reality.
nonConvertibleReasonis target-aware. The table browser used to grey out every Delta source with "already Delta", which was wrong once Delta could be a source for ICEBERG/PARQUET targets. Now takes the chosen target as a second arg — only marks identity rows (source = target) as "already X".- Tests: +9 D2 tests covering each per-pair cell, the keep_backup-off DROP path, the compat-preflight refusal path, dry-run-skips-preflight, and the supported-pairs registry shape. Total Zerobus + convert + format-strategies suite: 52 tests, all passing alongside the unchanged 2025+ existing tests.
Added — Zerobus PAT auth (zerobus_auth_mode: "pat")
The Zerobus streaming destination (demo-data guide) gained a second auth path so users without a service principal can still stream.
zerobus_auth_mode: Literal["oauth", "pat"]onStreamingEmissionRequest. Default"oauth"preserves the original SP-based flow. Setting"pat"makes the runner skip the form's SP fields and instead liftclient.config.tokenoff the logged-inWorkspaceClient, passing it via a customHeadersProvider(subclass ofzerobus.sdk.shared.HeadersProvider) that returnsAuthorization: Bearer <pat>on every gRPC request.open_zerobus_stream(pat=…)parameter. Whenpatis non-empty, the SDK is given the headers provider andclient_id/client_secretare passed as empty strings (the SDK ignores them whenheaders_provideris set, per the create_stream signature inzerobus/sdk/sync/zerobus_sdk.py:282).- API model: conditional validation. When
auth_mode='oauth'the validator requiresserver_endpoint+client_id+client_secret. Whenauth_mode='pat'onlyserver_endpointis required — the form's SP fields are hidden in PAT mode and the_zerobus_requires_credentialsvalidator omits them from the missing-fields check. - UI step-by-step layout. The credentials block is now a 5-step vertical stepper with numbered circles that swap to green checkmarks once each step's predicate is satisfied: (1) auth mode → (2) server endpoint → (3) credentials (SP fields or PAT info card depending on mode) → (4) verify (OAuth-only, optional) → (5) catalog storage (optional). The bulky "One-time admin prerequisite" callout is collapsed into a
<details>block at the top. - Caveat surface. PAT mode shows an inline amber note: the Zerobus server may still reject PATs that lack the right scopes; if
invalid_clientshows up in PAT mode, fall back to OAuth and supply an SP.
Added — Zerobus reliability hardening
Several footguns surfaced during real Premium/Enterprise testing. All landed as additive fixes; none change the public API contract.
- Pre-flight existence check.
ensure_zerobus_tablenow doesSHOW CATALOGS/SHOW SCHEMAS IN <cat> LIKE <schema>before issuing CREATE. Workspaces without a metastore default storage root rejectCREATE CATALOG IF NOT EXISTSwithINVALID_STATE— even when the catalog already exists, because Databricks evaluates the storage prerequisite before the IF-NOT-EXISTS short-circuit. Doing SHOW first lets us skip CREATE entirely in the idempotent case. - Optional
zerobus_catalog_location. New form field accepts anabfss:///s3:///gs://URI. When populated, the runner appendsMANAGED LOCATION '<path>'to the CREATE CATALOG. Required only on workspaces without a default storage root; ignored when the catalog already exists. SQL injection guard: single quotes inside the path are doubled. - Auto-grant CREATE TABLE on schema.
_grant_zerobus_permsnow applies four grants instead of three:USE CATALOG,USE SCHEMA,CREATE TABLEon schema,MODIFY, SELECTon the table. The new grant lets the SP create additional tables for follow-up Zerobus runs without re-granting per-table. Stops short ofALL PRIVILEGES— SP can't drop or alter the schema itself. - Stream auto-reopen. When the per-tick
ingest_batch_zerobusraises withStream is closed, the runner catches it, calls a closure that re-opens the stream with the same args (fresh gRPC connection + auth), and continues with the next tick. The current batch is lost; subsequent ticks land against the fresh stream. Newstream_reopenscounter surfaces in the streaming progress dict + final result. Workaround for an SDKrecovery=Truethat doesn't fire reliably for thestatus: Internalclose we observe in practice. wait_for_offsetper batch.ingest_record_offsetis fire-and-buffer — it returns an offset immediately without waiting for the server to commit. After each batch, the runner now blocks onstream.wait_for_offset(last_offset)to ensure records actually committed before the next tick. Without this, the runner reportedrows_inserted: 600against an empty destination table because all records were sitting in the local SDK buffer when the server tore down the stream.flush()beforeclose()inclose_zerobus_stream. Drains pending records from the SDK's local buffer before closing the gRPC stream. Resilient on flush failure (still attempts close so the connection doesn't leak). Per-tickwait_for_offsetcovers the in-stream case; this covers the end-of-run case.- TIMESTAMP / DATE encoding for JSON records. Per the upstream Zerobus README's Delta type-mapping table,
TIMESTAMP/TIMESTAMP_NTZmap to int64 (microseconds since epoch) andDATEto int32 (days since 1970-01-01). The sharedDEVICE_PROFILESgenerators emitnow.isoformat()for thevolume_bronze/direct_tablepaths; the newencode_record_for_zerobus(record, columns)helper rewrites timestamps and dates at the SDK boundary so the JSON wire shape matches what the Zerobus server's decoder expects. Symptom of getting this wrong (and what we hit in practice):Record decoder/encoder error: invalid digit found in string at line 1 column N. - Azure region detection.
derive_zerobus_endpointnow resolves Azure workspaces' regions via the same DNS-CNAME-walking approach used for AWS. Azure workspace hostnames alias through<region>.azuredatabricks.net(e.g.uksouth) before terminating atingress.<region>.azuredatabricks.net; the resolver matches both. Earlier the helper unconditionally returnedregion: nullfor Azure, prompting the user to look it up in the Portal. GCP region detection remains a defer-to-user case (DNS topology there is patchy). - Per-tick error visibility in the streaming UI. The runner's per-tick
try/exceptblock was previously logged-and-swallowed: a job where every tick failed silently surfaced asCompleted — 0 eventswith the real cause buried in API server logs. The streaming progress dict now carrieslast_error(str) andtick_errors(int); the UI's job panel renders an amber callout below the metrics grid whentick_errors > 0, showing the exception type + message verbatim.
v0.8.0 — Iceberg cross-format clone, in-place CONVERT TO DELTA, format-aware audit
Iceberg ↔ Delta cross-format clone, with two follow-up paths: physical Iceberg target and in-place CONVERT TO DELTA. All paths shipped behind explicit opt-in flags; defaults are unchanged from v0.7.x.
Added — Iceberg cross-format clone
target_format: ICEBERGonCloneRequest. When the source is Delta, after a successful DEEP CLONE the target gets a 3-step UniForm enable: disabledelta.enableDeletionVectors,REORG TABLE … APPLY (PURGE), thenSET TBLPROPERTIESfordelta.universalFormat.enabledFormats=iceberg+delta.enableIcebergCompatV2=true+delta.columnMapping.mode=name. External Iceberg engines (Snowflake, Trino, Athena, Iceberg-aware Spark) can now read the Delta destination without a separate copy. The 3-step ordering is mandatory — Databricks' IcebergCompatV2 validator rejects any other sequence withDELTA_ICEBERG_COMPAT_VIOLATION.DELETION_VECTORS_SHOULD_BE_DISABLED.- Iceberg-source preflight refusal (Phase B). New module
src/clone_iceberg.pyrunsDESCRIBE TABLE EXTENDEDbefore any DDL and refuses sources that use hidden-partition transforms (bucket(N, col),truncate(N, col),years(col),months(col),days(col),hours(col)). Hidden partitioning has no Delta equivalent; silently dropping it would change partition pruning semantics on the target. The error message names the offending transform and points atCONVERT TO DELTAas the workaround. - Auto-CTAS recovery for known Iceberg failures (Phase B). When
CREATE TABLE … DEEP CLONEfails withpartition evolutionortruncated-decimal errors on an Iceberg source, Clone-Xs automatically retries asCREATE TABLE … AS SELECT * FROM source. The recovered target lands at Delta version 0 (history is lost) — aWARNline in the run log makes the fallback explicit. - Cross-workspace UniForm. Delta-Sharing-based clones (
clone_cross_workspace.py) honourtarget_format: ICEBERGtoo — UniForm enable runs on the target after each successful share-based DEEP CLONE. - Iceberg type-mapping caveats log (Phase C1). Every Iceberg-source clone emits one INFO line listing the lossy mappings (
uuid → string,fixed → binary,timeunsupported,timestamptzzone loss). It's a log, not a runtime detector — UC surfaces Iceberg types as their already-Sparkified equivalents, so a programmatic schema scan can't see them.
Added — Physical Iceberg target
iceberg_physical: trueonCloneRequest. New flag that, combined withtarget_format: ICEBERG, swaps the UniForm path forCREATE TABLE dst USING iceberg AS SELECT * FROM src. UC reports the destination asData source: Icebergrather than Delta. Trade-offs: loses Delta history, loses Delta-only features (deletion vectors, change feed, row tracking), ignores time-travel arguments with aWARN(CTAS doesn't acceptTIMESTAMP/VERSION AS OF). Requires DBR 15+ and Iceberg-managed-table support enabled on the workspace.- UI toggle in the clone wizard. New "Physical Iceberg target" checkbox under the Target Format radio group, visible only when
ICEBERGis selected. Inline help text spells out the trade-offs and the workspace-capability requirement.
Added — In-place CONVERT TO DELTA
POST /api/convert-to-deltaendpoint. New synchronous endpoint that mutates Iceberg / Parquet sources to Delta in-place. Distinct from/api/clonebecause there's no destination — the same FQN keeps pointing at the same data, but the underlying format changes. Two-layer safety gate: Pydantic validator rejects requests withoutconfirm_destructive: true(ordry_run: true); module-level check inconvert_tables_to_deltare-checks the same flag.- Auto-skip non-convertible inputs. Already-Delta tables,
STREAMING_TABLE,MATERIALIZED_VIEW,VIEW, and unsupported formats (CSV, JSON, etc.) skip with a clear reason — no SQL is sent to the warehouse for these. - Audit trail (
convert_operationsDelta table). New helpersensure_convert_audit_table+log_convert_resultinsrc/audit_trail.py, sibling to the existingclone_operationstable. One row per(operation_id, target_fqn)with status / source_format / dry_run / duration / error captured. Init failures fall through to running without audit (best-effort, matches the clone path). - Web UI (
ui/src/app/convert-to-delta/page.tsx). Two-column layout: catalog → schema → tables browser on the left (powered by a newGET /catalogs/{c}/{s}/tables/with-formatendpoint that surfacesdata_source_formatfor picker auto-fill), selected-targets cart on the right. Non-convertible rows are visible-but-disabled with inline reason captions. Free-text manual-FQN entry is anchored as an escape hatch for cross-catalog batches. Confirmation modal requires the user to typeCONVERTbefore the destructive submit unlocks;dry_rundefaults totrue. - Sidebar entry. New "Convert to Delta" item under Operations, between Clone and Sync.
Added — Operability fixes
- Streaming-table skip is now logged + counted. Previously
clone_tables_in_schemasilently dropped non-MANAGED/EXTERNALtable types inget_tables(), producing confusing "1 table planned, 0/0/0 results" runs. Now skip lines like[SKIP] Skipping non-clonable table type STREAMING_TABLE: iot.bronze_pos_terminalappear in the log and the skipped counter is bumped, matching the existing skip paths for excluded / regex-filtered / DLT-prefix tables. DataSourceFormatSDK enum normalised at the boundary. New_normalize_formathelper insrc/client.pyunwraps the SDK'sDataSourceFormatenum to its.valuestring before downstream code sees it. Fixes a'DataSourceFormat' object has no attribute 'upper'crash in the per-schema format-rollup that surfaced once non-clonable tables stopped being pre-filtered.- UniForm 3-step ordering documented in
clone.md. New subsection under "Mixed-format sources" explains whydisable DV → REORG PURGE → SET IcebergCompatV2is mandatory. Earlier docs only mentioned the finalSET TBLPROPERTIES.
Fixed
- Free Edition daily-limit error gets a friendly toast. UI client (
ui/src/lib/api-client.ts) now matchesfree edition/daily compute limitkeywords in error responses and surfaces a clear "your workspace has used up its free daily compute" message instead of the raw backend exception. 10s toast duration so users have time to read it. exclude_schemasundefined name inclone_catalog.process_schema. Pulled fromconfiglike the rest of the schema-level options. Was an F821 ruff failure onfeature/enhance-clone-functionality.
Tested
- 1967 unit + integration tests pass (was 1900 pre-session). New coverage: 17 tests for Iceberg preflight + CTAS fallback (
test_clone_iceberg.py), 14 tests for CONVERT TO DELTA module + endpoint (test_convert_to_delta.py,test_router_convert_to_delta.py), 3 tests for the format-enum normaliser, 4 for streaming-table skip path, 5 for the audit callback wiring, 3 for the physical Iceberg path, 3 for the UniForm 3-step DDL.
v0.7.1 — UI state persistence, deferred Bronze auto-create, Data Lab deep-links
Added
- Durable in-flight job tracking across UI navigation. New
useDurableJobhook (inui/src/hooks/useDurableJob.ts) fuses sessionStorage-backed job IDs, auto-reconnect on remount, tab-visibility-aware polling, and a capped progress-history ring buffer. Pages with long-running operations (clone, sync, incremental-sync, demo-data batch + streaming, generate IaC, governance reconciliation row/column/deep) survive page navigation and browser refresh — coming back mid-job resumes from the last server-known state instead of resetting to a blank form. usePersistedStatehook (ui/src/hooks/usePersistedState.ts) and a 30-page sweep migrating filter dropdowns, search inputs, tab selectors, catalog/days pickers and other navigation-aid inputs fromuseStateto sessionStorage-backed state. Form fields about to be POSTed (notes, descriptions, YAML, SQL, credentials, typed-confirm fields) intentionally stay local.JobContextextensions (ui/src/contexts/JobContext.tsx): addedjobId,progressHistory,updateJob,appendProgressto theJobEntryshape so durable in-flight jobs can persist progress series (used by the streaming throughput chart).- Notebook runtime persistence —
useNotebooknow mirrors cell results / errors / view modes / params to sessionStorage so navigating away from/notebooksand back doesn't re-execute the queries against Databricks. - Explore page query caching — catalog tree, schemas, tables, table-info drawer, functions, volumes, UC objects, table-usage, trend, and views queries converted to TanStack Query with 5–10 min staleTime. Combined with the global localStorage persister, returning to
/explorewithin the staleness window hits the cache instead of re-querying Databricks. - Data Lab deep-link auto-run:
/data-lab#q=<base64-sql>&run=1now pre-fills SQL and firesrunQuery()on arrival. Used by the new "Query latest rows →" link on the Demo Data streaming card to jump straight into aSELECT * FROM bronze_<profile> ORDER BY captured_at DESC LIMIT 100against the just-created Bronze table.
Fixed
- Bronze auto-create no longer trips
CF_EMPTY_DIR_FOR_SCHEMA_INFERENCE.create_bronze_streaming_tablewas previously called before the streaming loop emitted any JSON batches, soread_files()had nothing to infer schema from. Bronze creation is now deferred until after the first batch lands; uniform fix applies to every device profile. - Marketplace UI page restored to git tracking. The repo's
.gitignorehad a non-anchoredmarketplace/rule that swallowedui/src/app/marketplace/page.tsx. Anchored to/marketplace/so the UI page can be tracked. - Ruff lint clean. Resolved 26 ruff errors in
src/(E402 module-level imports belowlogger = …, F401 unused imports, E713not (x in y)→x not in y). - Streaming Bronze "Query latest rows" link no longer produces empty backticks. Reads catalog/schema/profile from the streaming-job result (server-authoritative) instead of the form state, which can be empty when the durable job hydrates from sessionStorage on a fresh load.
Changed
- GitHub Actions bumped to Node 24 versions to silence Node 20 deprecation warnings (
checkoutv4→v5,setup-nodev4→v5,setup-pythonv5→v6,upload-artifactv4→v6,download-artifactv4→v5,upload-pages-artifactv3→v4,deploy-pagesv4→v5).
v0.7.0 — DQX, ODCS, FinOps, MDM, Compliance, Data Products, Streaming Demo, Persistent UI
Added — Data Quality
- DQX integration (
src/dqx_engine.py,api/routers/governance.py) — Databricks Labs DQX profiling, rule generation, check execution, and result persistence. UI at/governance/dqx. - Expectation Suites (
src/expectation_suites.py,/api/data-quality/suites) — group DQ rules + DQX checks into named reusable suites; run a suite end-to-end and persist results. UI at/data-quality/expectations. - Trust Score Engine (
src/trust_score.py,/api/trust-scores) — composite per-table 0–100 score from six dimensions (DQ pass rate, freshness, anomaly history, PII coverage, schema stability, lineage completeness). Configurable weights. UI at/data-quality/trust-scores. - DQ Coverage Map (
src/coverage_map.py,/api/coverage) — cross-references information_schema against DQ rules, SLA, PII scans, profiling, and contracts to compute per-table coverage percentage. UI at/data-quality/coverage. - COPQ — Cost of Poor Data Quality (
src/copq.py,/api/copq) — quantifies pipeline reruns, SLA breaches, engineer time, and downstream impact in dollars. UI at/finops/copq. - Anomaly correlation engine (
src/anomaly_correlation.py,/api/anomaly-correlations) — groups correlated anomalies under root-cause groups across upstream/downstream tables. UI at/data-quality/correlations. - NL Rule Builder (
src/nl_rule_builder.py,/api/nl-rules) — translate plain-English rule descriptions into executable DQ rule configs via the configured AI backend. UI at/governance/nl-rules. - Alert routing (
src/alert_routing.py,/api/alerts) — smart deduplication, correlation, priority-ranking, and routing of alerts to teams via channels. Supports digest mode. UI at/data-quality/alert-routing.
Added — Governance & Compliance
- ODCS Data Contracts (
src/data_contracts.py,/api/governance/odcs) — full Open Data Contract Standard CRUD with YAML import/export, validation, and DQX integration. UI at/governance/odcs. - Compliance automation (
src/compliance_engine.py,/api/compliance) — maps DQ controls to SOC2 / GDPR / HIPAA / CCPA / DORA frameworks with automated evidence collection and audit-ready reports. UI at/compliance/frameworks. - Remediation playbooks (
src/playbooks.py,/api/playbooks) — if-this-then-that automation triggered on DQ failures, anomalies, SLA breaches, freshness staleness, schema drift. UI at/automation/playbooks. - Data Products catalog (
src/data_products.py,/api/data-products) — internal marketplace for publishing and subscribing to curated data products with docs, quality guarantees, and SLAs.
Added — Master Data, Federation, ML
- MDM (Master Data Management) (
src/mdm.py,/api/mdm) — entity resolution, survivorship, golden records, hierarchies, stewardship, cross-domain matching. UI under/mdm/*. - Lakehouse Federation (
src/federation.py,/api/federation) — browse foreign catalogs, manage connections, migrate to managed Delta. UI at/federation. - ML Assets (
src/clone_feature_tables.py,clone_models.py,clone_serving_endpoints.py,clone_vector_search.py,/api/ml-assets) — clone Models + Feature Tables + Vector Indexes + Serving Endpoints. UI at/ml-assets. - Advanced Tables (
src/clone_advanced_tables.py,/api/advanced-tables) — clone Materialized Views, Streaming Tables, Online Tables. UI at/advanced-tables.
Added — Operations
- Continuous Sync (streaming replication) (
src/continuous_sync.py,/api/continuous-sync) — Structured Streaming job spec for change-data-capture sync. PREVIEW. - Ephemeral Environments (
src/environment_manager.py,/api/environments) — one-click sandbox creation with auto PII masking, DQ validation, cost budgets, and TTL-based cleanup. UI at/environments. - FinOps suite (
src/azure_costs.py,src/finops_queries.py,/api/finops) — cost dashboards (billing, breakdown, compute, query costs, recommendations, storage optimization, budgets, trends, warehouses) backed by Databricks system tables. UI under/finops/*. - System Insights (
src/system_insights.py,/api/system-insights) — workspace billing, optimization opportunities, job costs, query costs from system tables. UI at/system-insights.
Added — Demo Data
- 10 streaming device profiles in
src/demo_streaming.py:generic_sensor,industrial_machine,car_obd2,smart_meter,wearable_health,pos_terminal,wind_turbine,atm_transaction,server_metrics,clickstream. Each emits batched JSON to a UC Volume; Auto Loader / DLT consumes the files. - Schedule streaming as a Databricks Job (
/api/demo-data/streaming/schedule) — generates a self-contained notebook + creates a real Databricks Job with the chosen Quartz schedule and tagscreated_by=clone-xs. - Auto-create Bronze streaming table (opt-in) —
CREATE OR REFRESH STREAMING TABLE … AS SELECT * FROM STREAM read_files(...)on DBSQL Serverless; failure-isolated so file emission keeps working when CREATE is denied. - Manage Catalogs tab on
/demo-data— list every catalog the user can read with metadata, demo-only filter, typed-confirm drop modal. - Star schema modeling layer (
src/demo_models.py) and locale-aware Faker pools (src/demo_faker.py). - Anomaly injection (
src/demo_anomalies.py) — labeled anomalies for ML training datasets.
Added — Portal Model
- Multi-portal sidebar / app shell. The UI now organises pages into seven portals — Clone-Xs (default), Governance, Data Quality, FinOps, Security, Automation, Infrastructure, MDM. Switch via the portal-picker in the header (
ui/src/components/PortalSwitcher.tsx). Portals can be enabled/disabled per workspace in Settings.
Improvements
- Reconciliation suite — row-level (
/reconciliation/batch-validate), column-level (/reconciliation/batch-compare), and deep (/reconciliation/batch-deep-validate) batch validation with WebSocket progress streams. UI under/governance/reconciliation/*. - Cross-metastore reconciliation (
src/cross_metastore_recon.py) — for migrated catalogs. - Lakehouse Monitor integration (
src/lakehouse_monitor.py,/api/lakehouse-monitor) — discover, clone, manage Databricks quality monitors. UI at/lakehouse-monitor. - Persistent runtime state (sessionStorage) for ~30 analysis-result pages — hitting the same page twice no longer re-queries Databricks within a 30-minute window.
Unreleased — Streaming demo: clickstream profile + bug fix for unreachable profiles
Added
- New
clickstreamdevice profile for the streaming demo — web/mobile event stream withuser_id,session_id,event_type,page_url,referrer,user_agent,device_type. Sessions rotate every ~30 events per user (drives Bronze→Silver sessionization demos),user_agentanddevice_typeare sticky per user (preserves identity across events for analytics joins). Default 500 distinct users; weighted event distribution biases towardpage_viewwith rarersubmit/purchaseto mirror funnel drop-off. - Two new guard tests in
tests/test_demo_streaming.pyto prevent silent drift across the registry, the Pydantic Literal, and the scheduled-notebook generator source:test_pydantic_literal_matches_registry— fails CI ifStreamingEmissionRequest.profileLiteral goes out of sync withDEVICE_PROFILESkeys.test_schedule_notebook_source_covers_all_profiles— fails CI if_PROFILE_GENERATORS_SOURCEis missing a profile (which would crash the scheduled Job at runtime withNameErroroninit_state).
Fixed
- Pydantic
profileLiteral was rejecting 6 of 9 dropdown options. The UI exposedsmart_meter,wearable_health,pos_terminal,wind_turbine,atm_transaction, andserver_metricsprofiles, but the request model'sLiteralonly listed the original 3 — so users selecting any of the other 6 got a 422 at the/demo-data/streamingendpoint. The Literal now covers all 10 profiles, kept in sync via the new guard test. - Scheduled-notebook generator covers all profiles.
_PROFILE_GENERATORS_SOURCEpreviously inlined only 3 profile generators; the other 6 (and nowclickstream) all have inlined source so users can schedule any profile without editing the notebook by hand.
Tested
- 4 new tests in
tests/test_demo_streaming.py: clickstream event shape, session-rotation behaviour (sessions change after ~30 events), per-useruser_agentstickiness, plus the two guard tests above. - All prior tests preserved. Full suite: 1828 passing (was 1815 → +13 from this batch).
Unreleased — Demo Data Generator: Manage Catalogs tab + Schedule streaming as Databricks Job
Added
- New "Manage Catalogs" tab on
/demo-data— lists every catalog the user can read, with metadata (schemas / tables / demo-tables / owner) and a per-row drop action with a typed-confirmation modal (must type the catalog name to arm the destructive Confirm button). Reuses the existingDELETE /demo-data/{catalog}endpoint — no new destructive paths. "Demo only" toggle filters to catalogs flagged withdemo.generated_by = 'clone-xs'TBLPROPERTIES on at least one table. - New endpoint
GET /demo-data/catalogsinapi/routers/generate.py— fans out per-catalog probes viaThreadPoolExecutor(max_workers=5), queries<catalog>.information_schema.table_propertiesfor the demo signal, returns{catalogs: [...], demo_only, total}. Per-catalog probe failures (auth denied on information_schema) surface aserroron the row; one broken catalog doesn't hide the others. Top-level catalog enumeration failure returns{catalogs: [], error}rather than 500. - Schedule streaming as a Databricks Job — new "Schedule on Databricks" button beside Start/Stop on the Streaming tab. Opens a modal collecting Quartz cron + timezone + Job name + Serverless toggle + (advanced) notebook path. Submits to a new
POST /demo-data/streaming/scheduleendpoint that:- Generates a self-contained Python notebook inlining the relevant device-profile generator + emission loop. The notebook reads its parameters via
dbutils.widgets.get(...)so reruns can vary catalog/cadence without regenerating. - Uploads the notebook to
/Users/<me>/clxs/streaming_<profile>_<isoZ>viaclient.workspace.upload(...). - Creates a real Databricks Job via
client.jobs.create(...)with the Quartz schedule + the uploaded notebook as anotebook_task+ tagscreated_by=clone-xs, kind=streaming-emit, profile=<profile>so the existingGET /clone-jobslisting automatically includes scheduled streams. - Defaults to Serverless compute so users don't need to provision a cluster; falls back to a Single-Node job cluster spec when the user opts out.
- Generates a self-contained Python notebook inlining the relevant device-profile generator + emission loop. The notebook reads its parameters via
StreamingScheduleRequestmodel inapi/models/demo.py— extendsStreamingEmissionRequest(inherits catalog/schema/volume/profile/cadence/auto-create-bronze) and addsname,schedule_quartz_cron(with shape validator: 6 or 7 fields),timezone_id,notebook_path,use_serverless. Pydantic catches empty / wrong-field-count cron at request binding.- Quick-pick cron presets in the Schedule modal: Every 5 min, Top of hour, Weekdays 9am.
useDemoCatalogs,useDemoCatalogDrop,useStreamingSchedulehooks inui/src/hooks/useApi.ts.
Non-breaking
- The Batch tab's existing form is untouched — its 4 nested tabs (Basics / Catalog Options / Data Quality & ML / Architecture) already provided the logical grouping the original plan called out.
- The existing in-process
POST /demo-data/streamingStart/Stop flow is unchanged. "Schedule on Databricks" is a sibling action; users who never click it see no behaviour change. - The existing inline
window.confirm()delete on the Batch tab is preserved for backwards compatibility. The Manage tab adds a stricter typed-confirm modal but doesn't remove the existing path. - All 1796 prior tests stay green; the 19 new tests only add coverage. Total: 1815 passing.
Tested
- 4 new tests in
tests/test_demo_data_catalogs.py: default listing returns all visible catalogs,demo_only=truefilter works, per-catalog probe failure surfaces aserrorfield (failure isolation), top-levelcatalogs.list()failure returns empty list with error. - 15 new tests in
tests/test_demo_streaming_schedule.py: per-profile notebook content (no cross-contamination between profiles, dbutils.widgets coverage),create_streaming_jobtags + schedule + Serverless skip-cluster path, end-to-end orchestration,StreamingScheduleRequestcron-shape validator + inherited validators, endpoint dispatch + 500 on SDK failure + 422 on empty cron.
Out of scope (deferred)
- Bulk drop on Manage tab — single-catalog only in v1. Bulk select is a follow-up if users ask.
- Job lifecycle management for scheduled streams (pause / resume / delete from Clone-Xs UI). v1 creates the Job and links to the Databricks Jobs UI for management.
- Packaging clone-xs as a wheel so the scheduled notebook can
importrather than inline. v1 inlines so the notebook is self-contained — wheel-based packaging is a follow-up that lets us ship richer features without ballooning the notebook. - YAML-loadable custom device profiles for the schedule path — the three built-in profiles cover today's IoT demo asks.
Unreleased — Demo Data Generator: streaming emission for IoT (file-based to UC Volume)
Added
- New "Streaming emission" card on
/demo-data— file-based IoT event emission for three built-in device profiles (generic_sensor,industrial_machine,car_obd2). The runner spawns as a background job that drops JSON event batches into a UC Volume on a configurable cadence (events-per-batch × interval-seconds × total-duration-seconds). Auto Loader / DLT consumes the files; this is the path 90% of Databricks customers use to onboard streams. UI shows live progress (events emitted / files written / current batch path) and the canonical Auto Loader SQL snippet for copy-paste. - New module
src/demo_streaming.py(~330 LOC) —DEVICE_PROFILESregistry + per-profile event generators (stateful, so values jitter around stable per-device baselines),emit_batch,write_batch_to_volume(uploads JSON viaclient.files.upload),run_streaming_emission(the loop), andcreate_bronze_streaming_table. - Auto-create Bronze streaming table (opt-in checkbox) — when enabled, the runner additionally executes
CREATE OR REFRESH STREAMING TABLE <catalog>.<schema>.bronze_<profile> SCHEDULE EVERY N MINUTES AS SELECT * FROM STREAM read_files('/Volumes/.../events_volume/<profile>/', format => 'json'). Runs on existing DBSQL serverless — no cluster or DLT pipeline. Failure isolation: if Serverless isn't enabled orCREATE TABLEis denied, the runner captures the error and continues file emission; UI shows an amber warning + falls back to the manual SQL snippet so the user can run it themselves after upgrading. - New endpoints in
api/routers/generate.py:POST /demo-data/streaming— submits astreaming-emitjob, returns{job_id}.POST /demo-data/streaming/{job_id}/stop— flips the runner'sstop_requestedflag (idempotent; runner sleeps in 0.5s slices so latency-to-stop is bounded).GET /demo-data/streaming/auto-loader-sql?catalog=…&schema=…&profile=…— returns the canonical SQL snippet so the UI panel and the auto-create path emit identical DDL.
StreamingEmissionRequestinapi/models/demo.py— Pydantic model withLiteralprofile validator, range-clampedevents_per_batch(1..10000),interval_seconds(0.1..300),total_duration_seconds(1..3600 — 1-hour cap for v1),auto_create_bronze,bronze_refresh_minutes(1..60).useStreamingEmit+useStreamingStophooks (ui/src/hooks/useApi.ts) — TanStack Query mutations matching the existing demo-data-generator hook shape.- Live progress integration: the existing
JobManager._run_jobmutation pattern is reused — runner writesevents_emitted,files_written,current_batch_path,elapsed_seconds,tickstoself.jobs[job_id]["progress"]each tick; UI polls/api/jobs/{id}every 2s and renders the dict.
Tested
- 23 new tests in
tests/test_demo_streaming.py: registry shape, per-profile event-shape + value-range invariants,emit_batchround-robin behaviour,write_batch_to_volumepath construction + JSON serialisation,run_streaming_emissionhonouringtotal_duration_seconds(mocked clock) +stop_checkearly termination, unknown-profile defense-in-depth ValueError,create_bronze_streaming_tableSQL shape + DBSQL-Serverless failure isolation,get_auto_loader_sqlmatching runner-emitted DDL, request-model validators, and four endpoint dispatch tests (start, stop, stop-404, auto-loader-sql). - All other tests preserved.
Out of scope (deferred follow-ups)
- YAML-loadable custom device profiles — the three profiles are built-in. Custom YAML profiles can come via the existing
demo_industry_loaderpattern. - Direct Kafka / Event Hubs emission — file-based via Volume covers the common case.
- Spark Structured Streaming
ratesource — needs a running cluster. - Silver/Gold downstream tables — Bronze only; cleansing/aggregation is customer-specific.
- Format options beyond JSON —
client.files.uploadis content-agnostic, so CSV/Parquet are easy follow-ups. - Realistic Faker data for VINs / lat-lng — v1 uses simple random with plausible ranges; the existing
realistic_dataflag could be hooked in.
Unreleased — Cleanup tab: small-files detection, DROP-script export, saved presets, per-finding cost
Closes the four deferred items from the original Cleanup tab batch:
Added
- Per-finding
Save / mocolumn on the Cleanup findings table — shows projected monthly storage savings per row (size_bytes × price_per_gb / 1024³). Only renders for MANAGED stale findings with stats; everything else shows "—" so users don't conflate "unknown" with "$0". Pairs with the headline "Save / month" summary card shipped previously. - Many-small-files detection (opt-in DESCRIBE DETAIL enrichment):
- New
check_small_files: bool = Falseparameter ondetect_stale_tablesanddetect_stale_tables_multi— when true, the scan runsDESCRIBE DETAILin parallel (max 8 concurrent) on up to 200 candidate tables already in the findings list and enriches them withnum_files+avg_file_size_bytes. - Heuristic:
num_files >= 50ANDavg_file_size < 64 MBflags a table for compaction. Suggested action becomes"OPTIMIZE (compacts small files)"for findings where it's actionable; intentionally preserves higher-priority actions (Run OPTIMIZE (collects stats),Review for drop, EXTERNAL/VIEW review hints) since compacting before a likely drop is wasted work. - Cleanup tab gains a "Detect small-files (slower)" toggle, a "Small files" filter chip (only when the enrichment ran), and a Files column showing
num_fileswith an amber ⚠ when flagged. Tooltip shows avg MB/file.
- New
- Export DROP script bulk-action button: select stale findings → "Export DROP script" downloads
clxs-cleanup-drop-<timestamp>.sqlwith oneDROP TABLE IF EXISTSper row, grouped by catalog with header comments. The app never executes drops — user reviews the script and runs it manually. Honors the original "maintenance ops only" UI choice while still surfacing the destructive workflow when users want it. - Saved scan presets (localStorage): "Save current as preset" captures
{mode, catalogs, days_threshold, min_size_mb, check_small_files}under a user-named key (clxs-cleanup-presets). Pills above the scan controls show saved presets with one-click apply + per-preset delete. Survives page reloads but not browser clears — durable persistence is tied to scheduled scans (deferred).
Tested
- 4 new tests in
tests/test_stale_detection.py(TestSmallFilesEnrichment): default-off behaviour preserved (no DESCRIBE DETAIL when toggle off), heuristic flags 200×32MB-files candidate, well-sized files pass through unflagged, per-table DESCRIBE DETAIL failure swallowed without aborting the scan. - Existing 24 stale-detection tests preserved (the new parameter is optional with safe default).
- All other tests (1,769 prior) preserved. Total: 1,773 passing.
Out of scope (deferred)
- Scheduled scans — saved presets ship as the persistence half; cron-style execution + notifications + result history are a real product feature deserving its own batch (jobs runner, durable storage, notifications).
- Real DROP execution from UI — script export covers the workflow with zero blast radius. If users want one-click drops, follow-up with a typed-confirmation modal pattern (preview already in the original AskUserQuestion).
Unreleased — Catalog Explorer: FinOps trend, catalog diff detail, permissions audit
This batch ships three composable governance / FinOps capabilities on top of the multi-catalog Explorer:
FinOps — cost rollup + 30-day trend
- $/month rollup on the Cleanup tab summary cards: converts
total_reclaimable_bytesto monthly spend using the configuredprice_per_gb, plus a yearly sub-line. The Per-Catalog Rollup card on Multi Overview also shows per-catalog$/moso users can spot the dominant cost catalog at a glance. - New module
src/catalog_size_history.py— auto-creates<audit_catalog>.clone_xs.catalog_size_history(Delta) on first write and upserts one row per(date, catalog)carryingnum_tables,num_schemas,total_size_bytes,total_rows,captured_at. Idempotent by(date, catalog): re-clicking Explore the same day overwrites today's row. Best-effort everywhere — never raises into/stats. - Opportunistic snapshots:
POST /stats(single + multi paths) now callsrecord_snapshots_from_stats(...)after returning, fire-and-forget. No scheduler needed; the trend chart fills in over time as users browse. - New endpoint
GET /catalog-size-history?catalogs=a,b,c&days=30— reads back per-catalog daily snapshots; returns[]gracefully when the audit catalog isn't configured or the table doesn't exist yet (UI renders an empty-state hint). - Size Trend chart on the Multi Overview tab: a
rechartsLineChartwith one line per selected catalog, GB on the Y-axis. Shows a "needs ≥2 days of snapshots" badge when there isn't enough history yet.
Catalog diff — column drift + size delta
- New module
src/catalog_diff_detail.py—compare_catalogs_detailed(...)wraps the existingsrc.diff.compare_catalogs(presence/absence) and overlays per-common-table drift:columns_only_in_source,columns_only_in_dest,column_type_changes,size_delta_bytes,row_delta. One bulkinformation_schemaquery per side joinscolumns+table_properties; ~3-5s on a 500-table catalog vs 30+s for the per-table/comparepath. - Skips classification on partial failure: if either bulk query fails, the response keeps the presence/absence diff with
drift: []and adrift_errorsentry — avoids phantom "all columns added/removed" findings that would otherwise appear. - New endpoint
POST /diff-detail— sameCatalogPairRequestshape as/diff, returns the combined response. Existing/diffendpoint unchanged for backwards compatibility. - Drifted Tables section on the existing
/diffUI page — switches the page from/diffto/diff-detailand renders a new card with summary badges (cols added / removed / type changes / total size Δ) plus a DataTable with per-row inline expansion showing the actual drifted column names. Existing presence/absence sections unchanged.
Permissions audit — risky GRANTs + PII × access overlay
- New module
src/permissions_audit.py—audit_catalog_permissions(...)bulk-queries<catalog>.information_schema.table_privilegesand classifies every (principal × table × privilege) cluster into CRITICAL / HIGH / MEDIUM / LOW based on:- Public groups (
account users,users) — escalate any read/write privilege. - Destructive privileges (
ALL PRIVILEGES,MODIFY) — escalate for any non-owner principal. - PII intersection (opt-in) — passing a
pii_columnslist (fromscan_catalog_for_pii) escalates findings on PII-bearing tables one risk level. The marquee finding: public-group SELECT on a PII table = CRITICAL.
- Public groups (
- New endpoint
POST /permissions-auditwith newPermissionsAuditRequestmodel (inheritsCatalogRequest, addspii_intersection: bool = False). Whenpii_intersection=true, runsscan_catalog_for_piiinline first (no sample data, no UC tags) and threads the results into the auditor. - Pure classifier helpers
_classify_finding,_is_public,_principal_typeare exposed for unit-test isolation. The classifier is the contract — easy to extend with new rules later. - New "Audit" tab on
/explore: PII overlay toggle + Run audit button, summary cards (CRITICAL / HIGH / MEDIUM / Tables audited), filter chips (All / CRITICAL only / HIGH+ / PII tables only), findings table with risk badges, principal-type chips, privilege list, suggested action. Single-catalog only in v1 — multi shows a "switch to Single mode" hint.
Tested
- 13 new tests in
tests/test_catalog_size_history.py(idempotent record_snapshot, swallows SQL failures, single vs multi response shape, get_history graceful degradation, endpoint dispatch). - 11 new tests in
tests/test_catalog_diff_detail.py(column drift detection, signed size deltas, no-drift filter, partial-failure fallback, endpoint dispatch). - 15 new tests in
tests/test_permissions_audit.py(classifier rules including the marquee PII × public-group → CRITICAL escalation, principal-type inference, PII overlay opt-in, sort order, INFO findings dropped from response, endpoint dispatch with/without PII overlay). - All existing tests preserved.
Out of scope (deferred follow-ups)
- Scheduled daily snapshots — opportunistic recording on
/statscovers active catalogs; a scheduled job would cover dormant ones. Hold for now. - Bulk REVOKE action from the Audit tab. v1 surfaces findings only — users execute revokes via SQL.
- Catalog diff trend — would track the diff over time. Today's snapshot is sufficient; revisit if customers ask.
Unreleased — Catalog Explorer: Cleanup tab (stale & orphan detection)
Added
- New "Cleanup" tab on
/explore— joins per-table stats (information_schema size + ANALYZE-derived rows) with read activity (system.access.audit, 90-day window) and classifies each table into HIGH / MEDIUM / LOW risk plus a suggested action. Single AND multi-catalog modes both supported (multi adds a Catalog column to the findings table). v1 ships with safe maintenance ops only — destructiveDROPis out of scope; stale tables surface "Review for drop" as a read-only hint. - New module
src/stale_detection.py—detect_stale_tables(client, wid, catalog, days_threshold=90, min_age_days=7, min_size_bytes=0, exclude_schemas=...)orchestrates the join + classification. Pure helpers (_classify_table,_risk_level,_suggested_action) are exposed for unit testing. Risk rules:- HIGH — never-accessed + MANAGED +
size_bytes >= 10 GB - MEDIUM — stale + MANAGED, OR no-stats with rows
- LOW — stale + EXTERNAL or VIEW (informational, can't drop from UI)
- NONE — fresh + analyzed (filtered out of findings)
- HIGH — never-accessed + MANAGED +
- New module
src/stale_detection_multi.py—detect_stale_tables_multifans the per-catalog scan out across N catalogs in parallel (max 3 concurrent — joining usage + stats per catalog hits two system queries, lower thanstats_multi's 5). Each finding stamped with its owningcatalog; per-catalog rollups live underper_catalog; per-catalog scan failures captured undererrorsinstead of aborting the request. - New endpoint
POST /stale-scaninapi/routers/analysis.py— dispatches single vs multi onsource_catalogs(mirrors the/statsand/pii-scanpatterns). NewStaleScanRequestmodel with Pydantic validators clampingdays_thresholdto1..365(audit window naturally caps at 90 anyway). min_age_days=7filter skips brand-new tables — a table altered yesterday wouldn't have read activity in any window, so flagging it as "never accessed" would be a false positive.- Cleanup tab UI (
ui/src/app/explore/page.tsx):- Threshold inputs (days + min size MB) + "Run scan" button.
- Summary cards: Findings | HIGH | MEDIUM | LOW | Total reclaimable size.
- Filter chips: All | HIGH only | Never accessed | Stale | No stats.
- Findings table with checkbox column for bulk-select, drill-through to existing
TableDetailDrawer, per-row OPTIMIZE / VACUUM / Open buttons. - Bulk-action toolbar (renders when ≥1 row selected): "OPTIMIZE selected" / "VACUUM selected" → opens a modal that runs the existing
POST /optimize/POST /vacuumwithdry_run=true, shows the predicted output, then re-runs withdry_run=falseon user confirmation. No new maintenance endpoints needed — the bulk action reuses what was already there. - Multi-mode rows are grouped by their owning catalog before being submitted so each
POST /optimizecall carries the rightsource_catalog.
- Shared validator constant
_NEITHER_CATALOG_MSGinapi/models/analysis.py— the four "single OR multi" request models (StatsRequest,SearchRequest,PIIScanRequest,StaleScanRequest) reference one source of truth instead of duplicating the error message.
Tested
- 19 new tests in
tests/test_stale_detection.pycovering classification rules (HIGH/MEDIUM/LOW thresholds, EXTERNAL/VIEW caps),min_age_daysskipping brand-new tables,min_size_bytesfiltering, NULLsize_bytes→Run OPTIMIZEaction, the 10-GB HIGH-risk inclusivity boundary, audit-failure fallback to stats-only signal, and/stale-scanendpoint dispatch + validator behaviour. - 5 new tests in
tests/test_stale_detection_multi.py(catalog stamping, summary aggregation, per_catalog rollup, failure isolation, empty-list rejection). - All existing tests preserved.
Out of scope (deferred follow-ups)
- Destructive actions (
DROP TABLE) — surfaced as a hint only. Users execute via SQL or the existing CLI rollback path. - Many-small-files OPTIMIZE candidates — would need per-table
DESCRIBE DETAILon the slow path. - Scheduled scans / saved findings history — re-running the scan is one click; persistence is a future Audit Trail integration.
- Cost rollup ($/month per finding) — straightforward extension once storage price config flows through.
Unreleased — Catalog Explorer: multi-catalog tab fan-outs (Option B)
Added
- Functions / Volumes / PII / Feature Store / Search are now multi-aware on
/explore. The "pick one catalog to view" placeholder cards are gone — each tab fans out across the user's selected catalogs and renders a unified result with a leading Catalog column for sort/filter. Concretely:- Functions tab: new
POST /functions/multiendpoint backed bysrc/functions_listing.py:list_functions_multifans the per-catalog UDF query out across N catalogs in aThreadPoolExecutor(max 5 concurrent), stamps each row with its owning catalog, and returns{functions, per_catalog, errors, catalogs}. Single-catalogGET /functions/{catalog}is unchanged — both routes share the extractedlist_functions_for_catalog(client, wid, catalog)helper. - Volumes tab: no backend change —
/auth/volumesalready returned all volumes the user can read; the UI just filters the global list against the active catalog selection (Set membership) instead of one catalog. - PII Detection tab: new
src/pii_multi.py:scan_catalogs_for_pii_multifansscan_catalog_for_piiacross N catalogs (max 3 concurrent — PII sampling is heavier than stats). Returns one merged report with per-detection catalog stamping, summedtotal_columns_scanned/pii_columns_found, a worst-case rolluprisk_level(NONE < LOW < MEDIUM < HIGH), and aper_catalogblock. Masking rules are re-keyed with a<catalog>.prefix so two catalogs sharing<schema>.<table>.<column>don't collide./pii-scandispatches onsource_catalogsvssource_catalog. - Search tab: new
src/search_multi.py:search_tables_multifans the regex search out across N catalogs in parallel and merges. Each match (table or column) is stamped with its owning catalog.SearchRequestnow accepts eithersource_catalog(single) orsource_catalogs(multi) — Pydanticmodel_validatorrequires at least one. Inline-fixed a latent rendering bug where the Search tab readsearch.data.lengthagainst a dict response — both single and multi modes now readmatched_tables/matched_columnsfrom the dict. - Feature Store tab: client-derived from the merged stats
tables[](already cross-catalog from Option A), so the only change is the new Catalog column in multi mode.
- Functions tab: new
Comparison views (B2)
- Size Share by Catalog donut — per-catalog relative size contribution alongside the rollup, so users can spot the dominant catalog at a glance.
- Top Schemas (per catalog, by size) — side-by-side cards, one per catalog, each showing top-8 schemas as a horizontal bar chart of size. Lets users compare which schemas live where without scrolling the merged flat list.
Tested
- 8 new tests in
tests/test_functions_multi.py(catalog stamping, per_catalog rollup, failure isolation, empty-list rejection, endpoint dispatch, invalid-catalog rejection) - 8 new tests in
tests/test_search_multi.py(catalog stamping for tables + columns, per_catalog tables/columns split, failure isolation, endpoint dispatch single vs multi, validator rejects neither) - 7 new tests in
tests/test_pii_multi.py(catalog stamping on detections, summed totals, worst-case risk rollup, masking-rule key collision avoidance, per-catalog failure → UNKNOWN risk, endpoint dispatch, validator) - All existing tests preserved.
Out of scope (deferred follow-ups)
- Per-catalog comparison "diff" view (which schemas exist in catalog A but not B). Today's side-by-side rollup gets users 80% of the way; a true diff is a follow-up if customers ask.
Unreleased — Catalog Explorer: multi-catalog selection
Added
- Multi-catalog mode on
/explore: a new "Single / Multi" pill next to the catalog picker switches the page between the existing single-catalog flow and a checkbox-popover picker that emitsstring[]. Aggregate stats (Schemas / Tables / Total Size / Total Rows) sum across the selected catalogs; the Tables tab gains a leading Catalog column for sort/filter; the Overview tab adds a Per-Catalog Rollup card showing each catalog's contribution. - New module
src/stats_multi.py—catalog_stats_multi(client, warehouse_id, catalogs, exclude_schemas, fast=True, max_parallel=5)fans the per-catalog stats run out across N catalogs in aThreadPoolExecutorand merges responses. Wall-clock latency is the slowest catalog, not the sum (3-catalog Multi explore completes in ~1-3s on the fast path). - Failure isolation: one catalog inaccessible (auth / mid-deletion) does NOT abort the whole request — the response carries
errors: [{catalog, error}]while the rest of the catalogs surface normally; the UI renders failed catalogs in red on the Per-Catalog Rollup card. StatsRequest(new model inapi/models/analysis.py) — subclassesCatalogRequest, accepts eithersource_catalog: str(single, existing contract) orsource_catalogs: list[str](new), with a Pydanticmodel_validatorrequiring at least one. Other endpoints (search, estimate, storage-metrics, profile, snapshot, export) keep the unmodifiedCatalogRequestso their single-catalog contract is unchanged./statsdispatch: whensource_catalogsis non-empty the route routes tocatalog_stats_multi; otherwise the existingfastflag pickscatalog_stats_fastvscatalog_stats. Single-catalog callers see no behavioural change.useStatshook (ui/src/hooks/useApi.ts): now accepts{ source_catalog?, source_catalogs?, fast? }and persists multi responses to sessionStorage underclxs-stats-multi-<sorted-csv>-<mode>(sorted so[a,b]and[b,a]share a slot).getCachedStatsaccepts either a single catalog string (legacy) or an array.CatalogPickercomponent: opt-inmultiprop renders a checkbox popover with "Select all / Clear" controls; click-outside closes the popover. Single-mode rendering unchanged.- Single-only tabs gracefully degrade: Functions / Volumes / PII Detection / Feature Store / Search render a "This tab requires a single catalog" placeholder card with a "Switch to Single" button when N>1, instead of running per-catalog (deferred to a follow-up batch).
Tested
- 15 new tests in
tests/test_stats_multi.py: merge correctness (totals sum, table-row catalog stamping, schema-row stamping, per_catalog rollup populated, top-N recomputed cross-catalog), per-catalog failure isolation, fast vs detailed path selection, empty list raises, endpoint dispatch (source_catalogs routes to multi, source_catalog routes to single, neither returns 422, empty source_catalogs falls back). tests/test_stats_fast.py:TestEndpointDispatchextended to cover the multi routing.
Out of scope (deferred — Option B)
- Multi-aware Functions / Volumes / PII / Feature Store / Search tabs (would require per-tab cross-catalog endpoints).
- Comparison views (per-catalog donut diff, side-by-side schema rollup).
Unreleased — Demo Data Generator: Star Schema modeling layer
Added
- New
data_modelfield onDemoDataRequest(Literal["flat", "star_schema"], defaultflat). When set tostar_schema, the orchestrator builds a<industry>_starschema on top of the existing flat industry tables (CTAS materialisation, ~5% extra runtime), with fact / dimension tables following Kimball conventions and DBT-style naming. - New module
src/demo_models.py—STAR_SCHEMA_REGISTRYcovering all 10 built-in industries (healthcare, financial, retail, telecom, manufacturing, energy, education, real_estate, logistics, insurance), plusgenerate_star_schema(client, warehouse_id, catalog, industry, …)andgenerate_star_schemas_for_industries(...). - Naming conventions (DBT-style): schemas as
<industry>_star; facts asfct_<entity>(e.g.fct_claims,fct_transactions,fct_order_items); dims asdim_<entity>(e.g.dim_patient,dim_customer,dim_product); surrogate keys as<entity>_sk(BIGINT generated viarow_number()); audit cols on dims (valid_from,valid_to,is_current). - Universal
dim_dateper Star schema, generated viasequence(date(start_date), date(end_date), interval 1 day)plus year/quarter/month/week/day_of_week/is_weekend columns. - Derived dims — extracted from fact-column DISTINCT values where the flat layer doesn't have a corresponding dim table (e.g.
dim_diagnosisfromclaims.diagnosis_code). - Fact CTAS preserves original FK columns alongside the new surrogate keys, so the fact remains queryable without dim joins; users choose which keys to use depending on demo style.
schema_only=Trueproduces empty-shell DDL for the Star layer too — tables exist with the right shape (including SCD2 audit columns) but zero rows. Generation completes in seconds.- Result shape additions: when
data_model="star_schema", the run summary gainsdata_model,star_schema.schemas_created,star_schema.facts_created,star_schema.dims_created, andstar_schema.per_industryblocks. - /demo-data UI: new "Data modeling pattern" dropdown (Flat / Star Schema) with an inline explainer card; completion summary renders a "Star Schema modeling layer" panel listing per-industry schemas and fact/dim counts.
- Per-industry failure isolation: one industry's CTAS failure doesn't abort the rest —
per_industry[i].errorcarries the failure reason while other industries' Star schemas land normally. docs/docs/guide/demo-data.md— new "Data modeling patterns" section covering layout, naming conventions, per-industry coverage matrix, the CTAS algorithm, sample query, and known trade-offs (storage cost, SCD2 history scope).
Tested
- 15 unit tests in
tests/test_demo_models.pycovering: registry shape (all 10 industries present, fct_/dim_ prefixes, FK references resolve), conformed dim CTAS (surrogate key + audit cols), derived dim CTAS (DISTINCT), fact CTAS (LEFT JOINs each registered dim, pass-through when no FKs), unknown-industry skip, schema_only DDL-only path, multi-industry orchestration with per-industry failure isolation. - 2 orchestrator integration tests (
data_model="flat"is a no-op;data_model="star_schema"attaches the result block).
Out of scope (deferred)
- Data Vault 2.0 (h_/l_/s_ tables with hash keys + load metadata)
- One Big Table (denormalised wide tables)
- Snowflake (normalised dim hierarchies)
- SCD2 row history (v1 dims have audit columns but a single row per business key — real history infrastructure deferred)
Unreleased — Demo Data Generator enhancements (4-theme batch)
Added
- Theme 1 — Realism (Faker): new
src/demo_faker.pybuilds locale-aware name / email / phone / SSN pools at generation time and embeds them as SQLarray(...)literals.realistic_data: trueonDemoDataRequestrewrites the legacy'James'/'Mary'/'patient1@example.com'/'555-XXXXXXX'patterns. Per-locale (en_US,en_GB,de_DE,fr_FR,ja_JP,zh_CN,hi_IN) + optionalseedfor deterministic output. - Theme 2 — DQ profiles + ML training labels: new
src/demo_anomalies.pywith named profiles (clean/realistic/dirty) controlling null/dup/outlier rates, andinject_labeled_anomaliesaddingis_fraud(financial.transactions),churn_risk(telecom.subscribers),is_anomaly(healthcare.encounters + manufacturing.sensor_readings) at a configurableanomaly_rate. Surfaces ananomaliesblock on the result for the UI to render. - Theme 3 — Referential integrity audit: new
_FK_RELATIONSHIPSregistry +_validate_referential_integrityruns sampledLEFT JOIN ... WHERE parent.pk IS NULLchecks across registered FKs after generation. Surfaces anreferential_integrityblock with per-FK orphan counts on the result. Skipped onschema_only=trueand whenvalidate_referential_integrity=false. - Theme 4 — UI insight + extensibility:
schema_only: trueskips every INSERT/UPDATE/DELETE — DDL-only generation completes in seconds for CI smoke + DDL-template verification. Volumes still create as DDL but skip the sample CSV writes.- New
POST /api/generate/demo-data/previewreturns per-industry row/size/cost/duration estimates without submitting a job. The /demo-data UI surfaces this as a "Per-industry breakdown" tile alongside the existing static estimate. - "Export JSON" button on /demo-data downloads the form state as a round-trippable preset.
- FK relationship diagram on the result panel visualises the audit's per-FK orphan-free / orphan rows.
- New
src/demo_industry_loader.pyparses YAML custom industry templates, validates the schema (fail-fast on malformed YAML, missing keys, reserved names), merges into the runtimeINDUSTRIESdict for the run duration. Pass paths viacustom_industriesonDemoDataRequest.
api/models/demo.py: 9 new optional fields (schema_only,realistic_data,locale,seed,validate_referential_integrity,dq_profile,anomaly_rate,inject_anomalies,custom_industries) with field validators. All defaults preserve existing behaviour — pre-batch callers see no shape change.- /demo-data UI: locale dropdown + seed input, DQ-profile dropdown + anomaly-rate slider + inject-anomalies toggle, schema-only checkbox, Per-industry breakdown tile, Export JSON button, FK integrity audit panel + Labeled training columns rollup on the completion summary.
- Faker dep:
faker>=20.0added topyproject.tomldependencies. Imported lazily — only fires whenrealistic_data=true.
Tested
- 13 new tests in
tests/test_demo_industry_loader.py(valid YAML, missing files, malformed YAML, missing required keys, reserved-name rejection, table-shape validation, duplicate detection, base-not-mutated invariant) - 19 new tests in
tests/test_demo_anomalies.py(DQ profile rates, clean=no-op, dirty>realistic, ALTER+UPDATE shape, anomaly_rate validation, orchestrator surfacesanomaliesblock) - 9 new tests in
tests/test_demo_referential_integrity.py(registry shape, sampled LEFT JOIN, orphan counts, per-FK failure isolation, orchestrator opt-out paths) - 15 new tests in
tests/test_demo_faker.py(pool shapes, determinism, locale, idempotent substitution, missing-dep error) - 7 new tests in
tests/test_router_generate_preview.py(helper edge cases + endpoint validation) - 1 new test in
tests/test_demo_generator.py(schema_only skips INSERTs)
Unreleased — Continuous sync executor (Feature 6)
Added
- Continuous sync moved from preview-only to executor. The v0.11.0
src/continuous_sync.pyonly generated a streaming plan; this release addssrc/continuous_sync_runner.pywhich submits the plan to Databricks Jobs (client.jobs.submit), tracks run-ids in a process-local registry, classifies run state into user-facing health (starting/running/stopping/stopped/failed/idle/unknown), and exposes start/stop/restart controls. - 5 new endpoints under
/api/continuous-sync:POST /start— submit a stream, get back{stream_id, run_id, status}.GET /streams— list registered streams (cached) or?refresh=trueto poll Databricks per stream.GET /streams/{stream_id}— detail view, always polls fresh state.POST /streams/{stream_id}/stop— idempotent cancel.POST /streams/{stream_id}/restart— cancel + new submit, samestream_id, newrun_id.
- Re-attachment after API server restart:
discover_existing_streams(client)scansjobs.list_runsfor runs whoserun_namestarts withclxs-continuous-sync-and re-populates the registry. Streams running on Databricks survive an API server bounce; the runner finds them again on startup. - Stable
stream_id: hash of(source, dest, schema, sorted(tables)). Callingstarttwice with the same parameters reuses the existing record — no ghost entries from idempotent retry. docs/docs/guide/sync.md— "Continuous sync" section with the lifecycle, API examples, prerequisites (CDF + PK + checkpoint write permissions), failure-mode recovery, and explicit limitations (24h+ smoke testing is a manual ops exercise, not part of the unit suite).
Tested
- 36 unit tests in
tests/test_continuous_sync_runner.pycovering: every documented Databrickslife_cycle_state×result_statemapping (13 tests), stream-id stability (sorted-table-list invariance, dest-change differentiation), submit-success + record registration, submit-failure marks failed without raising, invalid-plan ValueError surfacing, stop with cancel + idempotency on already-stopped, stop without run_id (skip cancel), cancel-failure logged not raised, restart preserves stream_id and submits fresh run, refresh translates RUN states + captures state_message on failure, list with/without refresh, get_stream/restart KeyError on unknown id, discover_existing_streams (rediscovery + skip-already-known + list_runs failure), and serialisation round-trip. - 9 router tests in
tests/test_router_continuous_sync.pycovering: legacy plan endpoint still returns preview spec, plan-rejects-no-tables-no-schema (400),POST /startreturns{run_id, status: starting, stream_id}, invalid plan via /start surfaces as 400, list-after-start, 404 on get/stop/restart for unknown stream_id, full start→stop lifecycle marks stopped + invokescancel_run.
Unreleased — Multi-target fanout (UI + backend)
Added
/cloneStep 1: Multi-target fan-out picker — new "Fan out to multiple targets (parallel multi-region clone)" checkbox under "Clone to a different workspace". Off (default): the single-target dropdown stays as-is. On: replaced by a multi-select of saved target connections + aparallelnumeric input (default 5). Selected count is shown live ("Targets (3 of 7 selected)"). Submission payload switches fromtarget_workspace(singular) totarget_workspaces(plural) plusfanout_max_parallel, dispatching to theclone_fanoutorchestrator.- /clone Step 3: Preview tile now reflects fanout — destination summary shows "Fan out → N targets" with the picked names, and a dedicated "Fanout targets" card lists each selected workspace with its host + warehouse for sanity-check before run. Pipeline diagram is hidden in fanout mode (N stacked diagrams would be visually noisy).
- /clone Step 4: Per-target rollup — when the result has
mode: "fanout", the success/failure card renders a per-target row (✓/✗ icon, host, tables/bytes/duration on success, error string on failure). Aggregate badge (SUCCESS / PARTIAL / FAILED) coloured by status. normalizeResultextended for fanout-shaped results — same flat-field mapping that worked for single-target cross-workspace results applies, so older job records without canonical aliases still render correctly.
Unreleased — Multi-target fanout (target_workspaces)
Added
- New
target_workspacesfield (list ofTargetWorkspace) onCloneRequest— when set, the job is routed to a new fanout orchestrator that runs N cross-workspace clones in parallel, one per target. Use case: N-region DR replication where the same source catalog needs to land ineu,us, andapacsimultaneously instead of sequentially. Mutually exclusive with the singulartarget_workspacefield (Pydantic XOR validator returns 422 if both set). - New
fanout_max_parallelfield (default 5) caps simultaneous target clones. Tune down for source-side bandwidth pressure or up if your source warehouse can handle the parallelism. - New module
src/clone_fanout.py—run_cross_workspace_fanout(client, config) -> dict. Per-target results aggregate into a single response withmode: "fanout",status: "success" | "partial" | "failed"(success = every target succeeded; partial = some did; failed = none did), per-target detail underper_target, and rolled-upbytes_copied/files_copied/tables_clonedtotals. - Failure-isolation contract: one target failing (auth issue, network blip, mid-clone DEEP CLONE error, same-metastore preflight rejection) does NOT fail other targets. The failure is contained to that target's per_target entry; aggregate goes
partialand the surviving targets land their data normally. This is the central reason fanout is a feature rather than a "for-loop in the caller" — per-target source-side state (share / recipient / shared-catalog) is independent, so isolating failure was always achievable, but rolling it up into one job ID for the operator is what makes this usable. - Router dispatch in
api/routers/clone.pyroutestarget_workspaces(plural) →clone_fanoutjob_type,target_workspace(singular) →clone_cross_workspace, neither →clone. JobManager picks the right entrypoint via the existing job_type dispatch chain. docs/docs/guide/clone.md— "Multi-target fanout" subsection under Cross-workspace migration with the routing table, per-target failure modes, and an example aggregated response payload.
Tested
- 10 unit tests in
tests/test_clone_fanout.pycovering the four scenarios the roadmap called out (all-succeed, one-target-connection-failure isolation, one-target-mid-clone-failure isolation, same-metastore-preflight rejection isolated to offending target), plus all-fail → status=failed, single-target degenerate case, zero-targets validation, plural-config-stripping (would otherwise infinite-recurse), max_parallel capping, and a parallel-execution timing assertion (3 × 100ms tasks complete in < 250ms wall clock). - 3 router integration tests in
tests/test_router_clone.pyconfirming/api/cloneacceptstarget_workspaces(200 with fanout-flavoured message), rejects setting both singular + plural (422), and rejectsfanout_max_parallel < 1(422).
Unreleased — Pre-clone source quiesce (quiesce_source: true)
Added
- New
quiesce_sourceopt-in flag onCloneRequestand the YAML config. When true, Clone-Xs snapshots + revokes write privileges (MODIFY,WRITE_VOLUME,CREATE_TABLE,CREATE_VOLUME,CREATE_FUNCTION,CREATE_MATERIALIZED_VIEW,CREATE_MODEL,APPLY_TAG) on the source schemas at clone start, and restores them in a finally block at clone end. Concurrent writes that arrive mid-clone fail withPERMISSION_DENIEDinstead of landing on a half-cloned target. - New module
src/quiesce.py—quiesce_source_schemas(client, source_catalog, schemas) → list[SchemaGrantSnapshot]andrestore_source_grants(client, snapshots). Reads + writes go through the SDK Grants API (client.grants.get/client.grants.update) — no SQL warehouse needed for the quiesce itself. - Wired into both orchestrators —
src/clone_catalog.py(same-workspace) andsrc/clone_cross_workspace.py(cross-workspace). Cross-workspace clones are typically longer-running (Delta Sharing + DEEP CLONE across regions), so they benefit most. Restore runs unconditionally in the existing finally block — no orphaned revocations on partial failure or budget abort. docs/docs/guide/clone.md— "Pre-clone source quiesce" section documenting the snapshot/revoke/restore flow, what stays writable (SELECT, USE_SCHEMA, owners), failure semantics for per-principal failures, and the cost/risk trade-offs.
Tested
- 13 unit tests in
tests/test_quiesce.pycovering: only write privileges are revoked (not SELECT/USE_SCHEMA/EXECUTE), CREATE_* privileges are blocked to prevent new objects mid-clone, no-op when no write principals (the roadmap's edge case), dry-run captures snapshot but skips API calls,grants.getfailure leaves schema writable, per-principal revoke failure doesn't crash, restore matches snapshot exactly, restore's per-principal failure is logged not raised, and the round-trip integration test (clone raises → restore still runs). - 1 router integration test confirming
/api/cloneacceptsquiesce_source: true(200, not 422).
Unreleased — Dry-run cost comparison: full clone vs selective re-clone
Added
/estimateAPI now returns aselectivecomparison block when caller passesdestination_catalogAND the target catalog already exists. Block contains:size_bytes/size_gb/monthly_cost_usdfor the drift-only set,tables_to_clone(drifted count),tables_in_sync(skipped count),savings_pct, adrift_breakdown(by reason), and arecommendedboolean (true when savings ≥ 50%, the threshold above which the per-table DESCRIBE HISTORY overhead is worth paying). Caller-side,EstimateRequestgains an optionaldestination_catalogfield; existing callers that omit it see no shape change.compute_selective_estimate(client, warehouse_id, source_catalog, destination_catalog, schemas, source_table_sizes, price_per_gb)helper insrc/cost_estimation.py— reusesfind_drifted_tablesfromsrc/incremental_sync.pyso the comparison tile and the actual SELECTIVE re-clone agree on what's drifted (no skew between the preview and the real run)./cloneStep 4 preview tile renders "Full clone vs selective re-clone" side-by-side (ui/src/components/PreviewPanel.tsxEstimateSection). Tile shows full size · cost vs selective size · cost, with a "Recommended: SELECTIVE" or "Recommended: FULL" badge and a drift breakdown row (never_cloned: 2 · version_drift: 5 · unable_to_compare: 1). Hidden entirely on fresh-target clones (no point comparing against an empty target) and on cross-workspace previews (source client can't read target Delta versions through the workspace boundary).
Tested
- 6 unit tests in
tests/test_cost_estimation.pycovering: target-missing → None, recommends SELECTIVE on ≥ 50% savings, recommends FULL below threshold, drift_breakdown aggregation across reasons, zero-drift edge case, and resilience when one schema's drift check raises (others still computed). - 3 integration tests on
estimate_clone_costconfirming theselectiveblock is present when target exists, absent when target is missing, and absent when caller doesn't supplydestination_catalog.
Unreleased — Selective re-clone (load_type: SELECTIVE)
Added
- Third
load_typevalue:SELECTIVE— alongside FULL and INCREMENTAL onCloneRequestand theclxs clone --load-typeCLI flag. New orchestratorsrc/selective_reclone.pyre-clones only tables whose source Delta version has drifted from target. Tables whosesource.version == target.versionare skipped (the whole point — runtime is proportional to drift, not catalog size). Tables present on source but missing from target count as drifted (reason: never_cloned). Tables Clone-Xs can't read a version from on either side (Parquet/Iceberg sources, transient SDK errors) are treated as drifted (reason: unable_to_compare) — conservative, cheaper than missing real drift. Tables on target but absent from source are NOT touched: selective re-clone is additive only, never destructive. find_drifted_tables(client, warehouse_id, source, dest, schema)helper insrc/incremental_sync.py— compares source vs target Delta versions directly via DESCRIBE HISTORY (not the json sync_state file the olderget_tables_needing_syncused). Works correctly cross-workspace too, since it only reads from the SDK.- JobManager dispatch routes SELECTIVE to the new orchestrator — same
/api/cloneendpoint, same audit-trail / run-id wiring, same dry-run plumbing. Existing FULL/INCREMENTAL callers are unaffected (defaultload_typestaysFULL). - Run summary
mode: "selective"andtotal_drifted_tables: Nkeys so downstream report generators can distinguish a selective run from a regular one. Per-table metrics (bytes_copied,files_copied) and per-format counters (formats: {DELTA: 2, PARQUET: 1}) still aggregate identically — selective benefits unchanged from the Tier 1/2 work.
Tested
- 11 unit tests in
tests/test_selective_reclone.pycovering: drift detection (never-cloned, version-drift, in-sync, unable-to-compare, target orphans ignored),get_table_current_versionedge cases (empty history, garbage version), drift breakdown helper, and the orchestrator (drifted-only invocation of_clone_single_table, no-drift no-op, metrics + format counter aggregation). - 2 router tests in
tests/test_router_clone.pyconfirming/api/cloneacceptsload_type=SELECTIVE(200) and rejects unknown values (422).
Unreleased — Mixed-format source support (Delta + Parquet + Iceberg)
Added
- Per-source-format counter on every clone run —
clone_tables_in_schemaand the cross-workspace orchestrator now emit aformatsrollup (e.g.{DELTA: 26, PARQUET: 2, ICEBERG: 1}) alongsidebytes_copied/files_copiedin the run summary. Clone-Xs has always been format-agnostic at the SQL level (Databricks'sCREATE TABLE … CLONE sourceworks for Delta, Parquet, and Iceberg sources registered in UC), but the run summary previously didn't surface the mix. The /clone Step 4 result card now renders a "Source formats:" badge row when more than one format is present in the catalog — useful for in-progress format migrations where you want to confirm your DELTA+PARQUET catalog landed entirely as DELTA on the target. - Iceberg / Parquet error wrapping — known Databricks CLONE limitations now wrap with an actionable hint pointing at the Databricks Parquet/Iceberg CLONE doc instead of bubbling the raw
[DELTA_CLONE_*]error. Covers: Iceberg with partition evolution, Iceberg with truncated decimal partitions on DBR < 13.3, partitioned Parquet referenced by path, and any source path using glob/wildcard patterns. The original Databricks error stays inline below the hint for diagnostics. docs/docs/guide/clone.md— mixed-format section under Stage 3 — Tables documenting the format-agnostic CLONE behaviour, the run summary breakdown, and the Databricks-side gotchas Clone-Xs cannot work around.
Tested
- 3 unit tests in
tests/test_clone_tables.pycovering: per-format counter aggregation across a mixed Delta/Parquet/Iceberg/no-format-tag schema, failed clones excluded from the format counter, and case-insensitive normalisation (parquet/Parquet/PARQUETall rolled up underPARQUET). - 1 cross-workspace test in
tests/test_clone_cross_workspace.pyverifying_list_tablesemits(name, format)tuples for Delta + Parquet + Iceberg, defaults to DELTA when format is unset, and excludes views.
Unreleased — Browser-side target connections + cross-workspace robustness
Added
- Scheduled cross-workspace clones —
src/scheduler.py'srun_scheduled_clonenow branches ontarget_workspace: when set, the scheduler routes torun_cross_workspace_clone(Delta Sharing + DEEP CLONE pipeline) instead of the same-workspaceclone_catalog. Drift-detection (compare_catalogs) is skipped for cross-workspace runs — it only works within one metastore — and the cross-workspace orchestrator'sdata_sync_mode(snapshot_once/incremental/force_full) handles re-run semantics directly. Enables genuine "set up DR once, daily incremental refresh runs unattended" workflows. - Six cross-workspace config fields promoted to Pydantic API models —
cleanup_after_cloneandprune_share_extrasonTargetWorkspace;clone_views,clone_functions,clone_volumes, andvolume_max_file_mbonCloneRequest. Fields were already honoured at runtime viaconfig.get(...)(soclxs cloneusers had them) but were silently dropped by Pydantic v2'sextra="ignore"when sent overPOST /api/clone. Now first-class on the API too. Defaults match the orchestrator's existing fallbacks (no behavioural change for existing callers). - Target Workspaces management in /settings — new section under Settings → Target Workspaces lets you save named cross-workspace clone targets (
prod-azure,dev-aws, etc.) once and pick them from a dropdown on /clone instead of re-entering host + PAT + warehouse_id every time. Each saved entry shows host, auth method, warehouse, sync mode, and an auto-fetched "Logged in as<user>" line so you can verify the identity at a glance. - Browser-only credential storage — saved target connections live in
localStorage["clxs_target_connections"]. The server is intentionally stateless w.r.t. target creds: clones send full creds inline per request, sourced from the picked localStorage entry. No PATs persist on disk, no yaml file to gitignore, nothing for GitHub push protection to scan. - Unified Source & Destination card on /clone — collapsed the previous two-card layout (Source & Destination + Target Workspace) into a single card. The "Clone to a different workspace" checkbox lives inside the Source & Destination card; the descriptive subtitle hides once the box is ticked.
- Destination Catalog dropdown queries the target when cross-workspace mode is on — picks from catalogs that actually exist on the target workspace (or
+ Create New), instead of source-side catalogs that don't. - "Logged in as" identity surfacing — on Settings → Authentication (source side) and on each saved target connection card. Target side uses a new lightweight
POST /target/whoamiendpoint that callsclient.current_user.me()without touching the warehouse (no cold-start cost). - Same-metastore preflight check in cross-workspace clone — before any SHARE / RECIPIENT objects are created, Clone-Xs compares source vs target
global_metastore_id. If they match, the clone fails fast in 1–2 seconds with"Source and target workspaces are in the same Unity Catalog metastore — Delta Sharing requires distinct metastores. Untick 'Clone to a different workspace' and use the in-metastore clone instead."Eliminates a whole class of confusing failures whereCREATE RECIPIENT IF NOT EXISTSsilently no-ops because you can't share to your own metastore. POST /target/catalogs— new stateless endpoint that takes inline target creds and returns catalog names. Used by the Destination Catalog dropdown when cross-workspace mode is enabled.POST /target/whoami— new stateless endpoint that returns the authenticated identity for a given target's creds. Cheap (no warehouse, no metastore lookup), used to populate "Logged in as" without forcing a full Test connection.
Fixed
- Recipient reuse-existing-or-create — Databricks Unity Catalog enforces uniqueness on
(source_metastore, target_metastore_sharing_id): at most ONE recipient per target metastore from a given source. After the first cross-workspace clone createdclone_xs_recipient_<suffix-A>pointing at the target metastore, subsequent clones from the same source to the same target (regardless of dest catalog name, regardless of recipient name we tried) failed because the target metastore "slot" was already taken. The SQLCREATE RECIPIENT … USING ID …channel via the Statement Execution API was silently swallowing the underlying"already exists with same sharing identifier"error, making each attempt look like a different bug. The fix is two parts:- Switched recipient creation to the SDK —
source_client.recipients.create(...)instead of SQL DDL. Hits a different REST endpoint (/api/2.1/unity-catalog/recipients) that surfaces the real error instead of the silent no-op. - New
_find_recipient_for_target()helper + reuse path in src/clone_cross_workspace.py — before any CREATE, scans existing recipients for one whosedata_recipient_global_metastore_idmatches the target sharing id. If found, reuses that recipient (logs the swap, updatesrecipient_nameandresult.recipient_nameso GRANT and audit see the right name). Recipients are pure auth identifiers — one can be GRANTed to many shares, so reusing across(source_catalog, dest_catalog)clone pairs is correct. The share name stays deterministic per pair.
- Switched recipient creation to the SDK —
CREATE RECIPIENT IF NOT EXISTSsilently swallowing real errors — Databricks'sIF NOT EXISTSreturns success even when the create fails for unrelated reasons (cross-region/account constraint, missing entitlement, etc.). Clone-Xs now probes viaSHOW RECIPIENTS LIKEfirst; if the recipient doesn't exist, it runs the SDKrecipients.create()(which surfaces underlying errors) instead of SQL DDL. If the post-create visibility probe still can't see the recipient, the clone fails immediately with both metastore IDs and a copy-paste diagnostic SQL — no more proceeding to GRANT and emitting the misleading "phantom recipient" message.auto_handle_masksretry-on-failure — the upfront_inventory_table_protectionsparser viaDESCRIBE EXTENDEDdoesn't reliably detect every mask/filter format. The ADD TABLE loop now catches the specific"row level security or column masks"error from Delta Sharing itself, runs inventory + drop + retry once. If inventory still misses it, falls back to a blindALTER TABLE ... DROP ROW FILTER. Source-side restoration still runs in the finally block. Fixes the case where tables with row filters (e.g. viaALTER TABLE ... SET ROW FILTER) couldn't be added to the share.- Force-refresh shared catalog when share grows —
CREATE CATALOG ... USING SHAREsnapshots the share's table list at mount time and doesn't auto-refresh. When subsequent runs added tables to the share (e.g. one that had a row filter dropped on retry), the target's mounted catalog stayed stale and DEEP CLONE failed withTABLE_OR_VIEW_NOT_FOUND. Clone-Xs now drops + recreates the shared catalog on the target wheneverto_addis non-empty. Skipped on unchanged-share re-runs (no churn). - Function migration — replaced the unsupported
SHOW CREATE FUNCTIONSQL (which returns[PARSE_SYNTAX_ERROR] Syntax error at or near 'FUNCTION'on Databricks SQL) with a Catalog SDK-based path:client.functions.get(<fqn>)returnsFunctionInfo, and a new_build_function_ddlhelper reconstructs the DDL frominput_params/full_data_type/routine_definition/language. Handles both SQL UDFs (RETURN <expr>) and Python UDFs (LANGUAGE PYTHON AS $$...$$). Catalog references inside the body are rewritten from source to dest. Fixes the case where 100% of functions failed to migrate. - Volume migration
'NoneType' object is not iterable— internalwalk()function in_copy_volume_filesdoes its work via side effects (noyieldkeyword), but was wrapped withlist(walk(...))which evaluated tolist(None)and raisedTypeErrorfor every volume. Removed thelist()wrapper. Files now actually copy. - Target SQL warehouse stale-list bug — when the user changed target host or auth method in the Settings dialog, the cached warehouse list and previously-selected
warehouse_idfrom React Query persisted, so the dropdown could show warehouses from a different workspace. Edits to credential fields (host, auth_method, token, client_id, client_secret, profile) now reset the mutation state and clearwarehouse_id, forcing a fresh Browse against current creds. - Target client env-var leakage —
WorkspaceClient(host=..., token=...)constructed for the target workspace could fall back toDATABRICKS_HOST/DATABRICKS_CLIENT_IDenv vars set during source-workspace login. Now passes explicitauth_type="pat"/"oauth-m2m"to pin the SDK auth chain to the user-selected method. /target/validatewarehouse check — old endpoint only verified auth + metastore sharing; an invalidwarehouse_idwould silently slip through and surface as a clone-time failure 30 seconds in. Now callsclient.warehouses.get(id=warehouse_id)and returns400with a clear error if the warehouse doesn't exist or is invisible. If the warehouse isSTOPPED/STOPPING, the endpoint also fires a non-blockingwarehouses.start()so it'sRUNNINGby clone time.
Removed
config/clone_config.yamltarget_connectionssection — target connection persistence moved entirely to the browser. Existing yaml entries are migrated via legacy fallback in_load_connections(read-only) on first launch; subsequent saves go to localStorage. TheTargetConnectionPydantic model and the/target/connections/*CRUD endpoints (GET/POST/PUT/DELETE/test/catalogs) are gone — replaced by stateless inline-creds endpoints.- Orphaned
TargetWorkspaceForm.tsx— the legacy inline form on /clone is replaced by a compact connection-picker row. TheTargetWorkspaceValuetype moved intoPreviewPanel.tsx(its only remaining user).
Unreleased — Cross-workspace incremental data sync
Added
- Deterministic share/recipient/shared-catalog names in cross-workspace clone —
clone_xs_share_<sha1>,clone_xs_recipient_<sha1>,clone_xs_shared_<sha1>derived from(source_host, source_catalog, target_host, dest_catalog, target_metastore_id). Subsequent clones for the same source → target pair reuse the same Delta Sharing objects instead of generating new randomly-suffixed ones each run. Eliminates orphanedclone_xs_*_<random>accumulation and the "Recipient already exists" class of errors on retries. - Recipient verification on reuse — when an existing recipient is found, its
USING IDis checked against the current target's global metastore id. If they don't match, the run fails loudly instead of silently leaking data to the wrong destination. - Share-membership diff — re-runs only
ALTER SHARE ADD TABLEfor tables that aren't already in the share. Optionalprune_share_extras: trueconfig alsoREMOVE TABLEfor tables no longer in source. data_sync_modeconfig ontarget_workspace— three values:snapshot_once(default) —CREATE TABLE IF NOT EXISTS … DEEP CLONE. Skip tables that already exist on target. Only catches newly-added tables on re-run. Safest: never overwrites target.incremental—CREATE OR REPLACE TABLE … DEEP CLONE. Mirrors source updates into target by leveraging Databricks DEEP CLONE's incremental file diff. ⚠ Overwrites any target-side writes to cloned tables.force_full—DROP TABLE IF EXISTS dst; CREATE TABLE dst DEEP CLONE src. Full re-clone every run. For recovery scenarios.- Non-default modes log a WARNING at run start describing the data-loss implication.
cleanup_after_cloneconfig ontarget_workspace— opt-in teardown (defaultfalsesince deterministic objects are designed to persist between runs). Legacykeep_shareflag still honoured for backwards compatibility.- 3-button Data sync mode picker in
TargetWorkspaceFormUI, with inline amber warning whenincrementalorforce_fullis selected. auto_handle_masksconfig ontarget_workspace— when true, Clone-Xs inventories column masks + row filters on each source table viaDESCRIBE EXTENDED, drops them so the table can be added to the Delta Share, re-applies them on the target after the clone (rewriting function FQNs to the target catalog), and (forsnapshot_once/force_fullmodes) restores them on source in the finally block. Forincrementalmode, source masks remain dropped for the lifetime of the sync — re-applying would break ongoing Delta Sharing reads. Defaultfalse.
Fixed
- View migration target qualification —
SHOW CREATE TABLEreturns 2-part view names that resolve against the target warehouse's current catalog, not the destination catalog Clone-Xs is writing to. Added_qualify_create_target()to inject the destination catalog so the CREATE target is always 3-part. Fixes[SCHEMA_NOT_FOUND] dbr_xxx.<schema>errors during view migration on cross-workspace clones. - Function migration — same 2-part qualification issue applied to
_migrate_functions. - Audit-trail visibility —
JobManagernow logs a WARNING (instead of swallowing) whenensure_audit_tablefails at job start, and skips the completion-time UPDATE if the start INSERT never happened (was producing a confusingTABLE_OR_VIEW_NOT_FOUNDat the end of every job whose audit catalog didn't exist). metastore_sharing_idnow usesclient.metastores.summary()instead ofmetastores.current()so the returned identifier is the proper<cloud>:<region>:<uuid>global form, not the bare metastore UUID. FixesINVALID_PARAMETER_VALUE: ... is an invalid id for metastoreonCREATE RECIPIENT USING ID.- LogPanel colouring — WARNING lines whose message body contains the word "failed" no longer get painted red. The colourer now anchors on the log-level prefix.
- Demo generator seasonal-pattern SQL — naive
.split(",")onddl_colswas breaking insideDECIMAL(10,2)type specs and producing malformedINSERT INTO ... SELECTstatements. Added a paren-aware splitter (_split_top_level), and the seasonal-pattern INSERT now emits an explicit column list so the SELECT mirrors target column order rather than relying on positional matching.
v0.11.0 — Cross-Workspace / Cross-Cloud Migration (2026-04-19)
Added
- Cross-workspace catalog migration via Delta Sharing + DEEP CLONE — migrate a catalog from workspace A to workspace B across clouds (AWS ↔ Azure ↔ GCP). Source creates a Delta Share + recipient pointed at the target metastore's global sharing id; target consumes via
CREATE CATALOG … USING SHAREand DEEP CLONEs data into target storage. Full scope:- Schemas + managed/external tables (DEEP CLONE)
- Views + SQL functions (DDL replay with catalog-reference rewrite)
- Volumes + files (Databricks Files API; 500 MB per-file cap)
- Grants, tags, ownership (best-effort replay)
- Target Workspace UI — new
TargetWorkspaceFormcard on the Clone page with PAT / Service Principal / CLI profile auth, target warehouse picker, Test connection button, and keep-share toggle - New API endpoint —
POST /api/target/validate— verifies target creds and returns the metastore sharing identifier before kicking off a migration - New config —
target_workspaceobject (host / auth_method / token / client_id / client_secret / profile / warehouse_id / keep_share);clone_views,clone_functions,clone_volumes,volume_max_file_mbflags - Orchestrator —
src/clone_cross_workspace.pywithrun_cross_workspace_clone()entry point wired intoJobManagerasjob_type=clone_cross_workspace - Scope Picker — partial-catalog clones from the UI. New
ScopePickercomponent on the Clone page's step 1 with a toggle between "Entire catalog" and "Select schemas + objects"; lazy-loaded schema tree with per-object checkboxes for tables, views, functions, and volumes include_objectsfield onCloneRequest— list of{schema, name, type}records. Router translates intoinclude_schemas+ anchoredinclude_tables_regex, so both orchestrators (same-workspace and cross-workspace) honor the selection without a per-type refactor- New API endpoint —
GET /api/catalogs/{catalog}/{schema}/objectsreturns{tables, views, functions, volumes}for the UI scope tree (SDK-based, no warehouse) - Preview Panel — step 3 is rebuilt: three scope-summary tiles, multi-format tabs (CLI / YAML /
curl) with per-tab copy buttons, rule-based warnings panel (empty scope, DEEP-clone without storage, invalid regex, malformed TTL,parallel_tables=1on a large scope, etc.), cross-workspace pipeline diagram whentarget_workspaceis set, and inline dry-run results card - Field tooltips across Operations pages — hover any info icon next to a label on the Clone, Sync, Rollback, Demo Data, DLT, and Advanced Tables pages for a 1-sentence description. Backed by a reusable
FieldLabel/FieldLabelSmall/InfoDotcomponent set (ui/src/components/FieldLabel.tsx) and a single root<TooltipProvider>inApp.tsx. Every Clone-Options field's hint is also mirrored in the Clone options reference table - Cost + time estimate on Preview — the Preview step now calls
POST /api/estimateon demand and renders a 4-tile summary (table count / total size / est. duration / storage $). RunsDESCRIBE DETAILon source tables; SHALLOW clones skip the duration estimate. - Clone diff preview — new "Diff vs existing destination" card in the Preview step calls
POST /api/diffand lists new in source, only on destination, and schema-changed tables. Prevents "I thought it was a fresh catalog" foot-guns. - Runtime guardrails — two new
CloneRequestfields:max_duration_min(wall-clock limit in minutes) andmax_tables(aborts after N tables touched). Enforced between schemas inclone_catalog; job summary gainsaborted: true+abort_reasonon trip. Surfaced as inputs in the Clone Options step. - Named clone snapshots (fork points) — new Operations page
/snapshots+ endpointsPOST/GET/DELETE /api/clone-snapshots. Captures per-table Delta version + size into a dedicated Delta table in the audit catalog. Clone from a snapshot by settingsource_snapshot_idon the clone request — resolves toas_of_timestampso every table clones from the snapshot's captured state. See Clone Snapshots. - Schema evolution endpoints —
POST /api/schema-evolution/detect+/apply+/evolve-catalog. Wrapssrc/schema_evolution.pyto generateALTER TABLEstatements for additive / compatible-widening changes without re-cloning the table. See Advanced Features → Schema evolution. - Cross-metastore reconciliation —
POST /api/reconciliation/cross-metastorespans twoWorkspaceClients to verify a cross-workspace clone. Row counts first (cheap); optional SHA-256 checksums (use_checksum: true) over hashable columns catch silent drift. See Advanced Features → Cross-metastore reconciliation. - Clone signing / provenance —
POST /api/provenance/sign/{job_id}+/sign+/verify. HMAC-SHA256 over a canonical manifest (sensitive keys + runtime-nondeterministic fields stripped). Secret viaCLONE_XS_SIGNING_SECRETenv var; unset → endpoints return{"signed": false, "reason": ...}instead of crypto failure. See Advanced Features → Clone signing. - AI-suggested config documentation — the existing
POST /api/ai/clone-builderendpoint +CloneBuilderUI component are now documented in Advanced Features → AI-suggested config. No code changes; docs only. - Continuous sync (preview) —
POST /api/continuous-sync/plangenerates a runnable Structured Streaming job spec (readStream CDF → writeStream) for near-real-time replication. v0.11.0 is plan-only; auto-submit + lifecycle management ship in v0.12.0. - Streaming / MV data clone (preview) —
POST /api/streaming-clone/generateproduces a DLT pipeline spec + notebook SQL that rebuilds MV / streaming-table data on the destination (existing Advanced Tables clone migrates only definitions). v0.11.0 is plan-only; auto-create + trigger ship in v0.12.0. - Catalog-level clone log output — the clone job now emits three new log signals that show up in both the Databricks run view and the Clone-Xs UI log panel:
- Startup summary:
Starting clone: 611 tables across 50 schemas → edp_01(after table pre-count) - Live Tables counter rendered inline on the Schemas progress bar:
Schemas |████| 5/50 [5ok/0fail/0skip] ETA: 2m · Tables 120/611 [115ok/2fail/3skip]— updates live per table, not just per schema - Per-schema roll-up:
Schema bronze complete: 42/45 tables cloned (2 failed, 1 skipped) in 18s— emitted as each schema finishes (silent on metadata-only schemas)
- Startup summary:
Changed
POST /api/clonenow routes to the cross-workspace orchestrator whentarget_workspaceis supplied; otherwise runs the existing same-workspace pathCloneRequestsame-catalog-name guard is skipped whentarget_workspaceis set (legitimate: prod → prod-dr with identical catalog names on a different metastore)_list_schemas/_list_tables/_list_views/_list_functionsinclone_cross_workspace.pynow honorinclude_schemas+include_tables_regex/exclude_tables_regex(matching the same-workspace behavior)- Old
destination_workspaceYAML stub inconfiguration.mdrenamed totarget_workspaceand expanded to the full Pydantic model - Secrets (
token,client_secret) in the Preview Panel's YAML +curloutput are rendered as<redacted>to avoid copy-paste leaks
Fixed
- Clone page —
src == destguard: inline error + disabled Next button, plus a Pydanticmodel_validatoronCloneRequest - Clone page —
include_tables_regex,exclude_tables_regex, andttl(^\d+[hdw]$) validated client-side beforePOST /api/clone - Clone page — leftover
console.warnremoved from the 2-second job-poll loop - Clone page — empty catalog list now surfaces a toast warning instead of silently falling back to a text input
- Clone page — wrapped in a new
ErrorBoundarycomponent so render errors show a fallback card instead of a white-screen
v0.10.4 — Enhanced Presentation Mode (2026-03-31)
Added
- Slide Transitions — smooth fade + slide-up animations between slides with staggered content entry (both live and export)
- Speaker Notes — per-cell notes editor (speech bubble icon in toolbar), notes panel in presentation (N key), persisted in save/load
- Elapsed Timer — running clock in presentation controls bar (live and export)
- Grid/Thumbnail View — press G for 4-column slide overview with click-to-jump
- Light/Dark Theme Toggle — press T to switch between dark and light presentation themes
- Print to PDF — press P to print with @media print styles hiding controls
- Touch/Swipe Navigation — swipe left/right on mobile/tablet
- All 12 Chart Types in Presentation — bar, hbar, line, area, scatter, pie, radar, stacked, composed, funnel, treemap
- Full Table Rendering — removed 20-row limit in presentation, added sticky headers and horizontal scroll
- Keyboard Hints — shown at bottom of presentation screen
- Export Enhancements — HTML export now includes transitions, notes (data-notes attributes), timer, theme toggle, touch/swipe, print support
- Explorer AI Explain — "Explain" button on Schema Breakdown sends catalog stats to AI for structured analysis
- Explorer Caching — stats cached in sessionStorage, last catalog remembered across page navigation
v0.10.3 — Notebook Power Features (2026-03-31)
Added
- Cell Result Export — CSV and JSON download buttons on every SQL cell's results toolbar
- Data Profiler per Cell — "Profile" view mode on cell results with histograms and frequency charts
- Temp View Chaining — "Create View" button creates
TEMP VIEW cell_Nfor cross-cell SQL references - Import SQL File — load
.sqlfiles, auto-splitting by;into separate cells (comments become markdown) - Notebook Templates — 5 starter notebooks: Explore Table, Data Quality Check, Schema Comparison, Row Count Audit, Cost Analysis
- Drag-and-Drop Reorder — drag the grip handle on any cell to reorder (in addition to up/down buttons)
- Find Across Cells —
Ctrl+Fsearch bar with match highlighting, count, and prev/next navigation - Cell Execution Timer — live stopwatch while running + "ran Xm ago" relative timestamp after execution
- Undo/Redo —
Ctrl+Z/Ctrl+Shift+Zfor cell structure changes (add, delete, move, content edit), capped at 50 entries - Presentation Mode — fullscreen slide-by-slide view with arrow key navigation, progress bar, and slide dots
- Export as HTML Report — standalone HTML document with branded dark theme, syntax-highlighted SQL, results tables, ToC, and execution metadata
- Data Lab Documentation — comprehensive guide page at
/guide/data-labcovering SQL Workbench, Notebooks, and Data Profiler
v0.10.2 — Data Lab Enhancements: Notebooks, Profiler & Auto-Viz (2026-03-30)
Added
- SQL Notebooks — multi-cell SQL + Markdown notebook interface for interactive data exploration
- Add, delete, reorder, duplicate cells (SQL or Markdown)
- Run individual cells or "Run All" sequentially
- Each SQL cell has its own results table and chart view with auto-visualization
- Markdown cells with rich rendering (headings, lists, bold, code, links)
- Save/load notebooks (localStorage + backend JSON API)
- Export notebooks as
.sqlfiles - New route at
/notebookswith sidebar navigation under Discovery - Backend CRUD API at
/api/notebooks - Catalog Browser Sidebar — collapsible catalog → schema → table tree; click to insert
SELECT * FROMinto focused cell - Execution Counter — Jupyter-style
[1],[2],[*]badges on SQL cells showing execution order - AI Features per Cell — Fix with AI (on error), Explain Results with AI, Generate SQL from natural language prompt
- Parameterized Cells — use
{{variable}}syntax in SQL; auto-detected parameter bar with input fields for each variable - Cell Duplication — one-click clone any cell
- Auto-save — automatic save to localStorage every 30 seconds when changes are detected
- Table of Contents — auto-generated from markdown headings; click to jump to section
- Keyboard Shortcuts —
Ctrl+Ssave,Ctrl+Enterrun cell,Shift+Enterrun & advance to next,Escblur - Output Collapse — toggle to hide/show cell results for long notebooks
- Deep Data Profiler — one-click column-level profiling with distribution charts
- Right-click any table in catalog browser → "Profile Table" for server-side deep profiling
- "Profile" tab on query results profiles via CTE wrapping (no double execution)
- Per-column stats: null count/%, distinct count/%, min, max, avg
- Visual histograms for numeric columns using
width_bucket()(Recharts) - Top-N value frequency bar charts for string/categorical columns
- Summary header with KPI cards: row count, columns, completeness %, type distribution pie
- Backend endpoints:
POST /api/profile-table,POST /api/profile-results
- Auto-Visualization — AI-powered chart recommendation engine
- Heuristic engine analyzes column types, cardinality, and naming patterns
- Automatically selects best chart type and axis mappings when results load
- Rules: time + numeric → line, category + value → bar/pie, two numerics → scatter
- "Auto" button in chart controls to re-apply recommendation
- Recommendation reason displayed as badge (e.g., "Time series: date_col over time")
- AI Explain Results — detailed plain-English data narratives
- "Explain" button in toolbar sends column stats + sample to AI (< 5KB payload)
- Returns structured markdown: What This Data Shows, Key Findings, Notable Patterns, Recommendations
- New
query_explainandai_viz_suggestsystem prompts in AI service
v0.10.1 — Data Lab, AI Features & Jobs Cloning (2026-03-30)
Added
- SQL Workbench renamed to Data Lab — new name reflecting broader data exploration capabilities
- Data Lab AI Features — 4 AI-powered tools integrated into the Data Lab:
- Fix with AI — when a query fails, click to get AI-corrected SQL with "Apply Fix" button
- Analyze with AI — summarize query results with key findings, patterns, and anomalies
- Explain Plan with AI — plain-English explanation of execution plans with performance concerns and optimization suggestions
- Generate SQL with AI — natural language to SQL via the More menu
- AI Markdown Renderer — all AI responses formatted with headings, bullet points, bold, and inline code
- Databricks LLM Integration — dual-backend AI: Anthropic API (direct) or Databricks Model Serving endpoints
- Settings page: AI Model selection with endpoint discovery, Claude badge, state indicator
- Settings page: Genie Space selection for natural language SQL
- API client sends
X-Databricks-ModelandX-Databricks-Genie-Spaceheaders automatically - AI service routes calls through Databricks serving endpoints (OpenAI chat format) or falls back to Anthropic
- AI Assistant page — under Discovery, currently marked "Coming Soon" with feature preview
- Databricks Jobs Cloning — clone job definitions within or across workspaces
- List all workspace jobs with search/filter
- Clone same-workspace and cross-workspace (with host/token)
- Job diff — field-by-field comparison
- Backup/restore — export all job definitions as JSON
- 7 REST API endpoints under
/api/jobs/
- Fullscreen button — added to Data Lab embedded mode (browser native fullscreen API)
Changed
- Data Lab (formerly SQL Workbench) — renamed throughout sidebar, header, and component
v0.10.0 — MDM, Portal Expansion & UI Declutter (2026-03-28)
Added
- Master Data Management (MDM) Portal — first open-source Databricks-native MDM. 19 pages covering golden records, entity resolution, stewardship, and hierarchies
- Entity Resolution Engine — 6 match types (exact, Jaro-Winkler, Levenshtein, Soundex, normalized, numeric), configurable blocking strategies, weighted composite scoring
- Golden Records — entity 360 drawer with source records, attribute detail, and visual timeline
- Match & Merge — 5 tabs (Duplicates, Rules, Survivorship, Source Trust, Ingest), match tuning tester, configurable auto-merge/review thresholds
- Data Stewardship — review queue with side-by-side record comparison, bulk approve/reject, SLA timer (overdue/at-risk/on-track), task assignment, comments/notes
- Hierarchy Management — create and browse entity hierarchies
- Industry Templates — Healthcare (Patient MPI), Financial (KYC/AML), Retail (Customer 360), Manufacturing (Supplier MDM) — one-click rule setup
- Reference Data Management — code lists with aliases, cross-system mapping tables
- Entity Relationship Graph — interactive SVG visualization with zoom, filter, detail panel
- Merge History — full audit trail of all merge/split decisions with undo
- DQ Scorecards — per-entity-type accuracy, completeness, and active rate metrics
- Cross-Domain Matching — match across entity types (Customer ↔ Supplier)
- Negative Match Rules — "do not link" pairs with reasons
- Consent Management — GDPR consent matrix (7 consent types per entity)
- Data Profiling — attribute fill rates and distinct value analysis
- MDM Audit Log — unified event log with search, filter, CSV export
- MDM Reports — compliance reports with JSON/Markdown export
- MDM Settings — thresholds, SLA, notifications, retention, defaults
- 6 Delta tables —
mdm_entities,mdm_source_records,mdm_match_pairs,mdm_matching_rules,mdm_stewardship_queue,mdm_hierarchies - 21 REST API endpoints under
/api/mdm/
- Databricks Jobs Cloning — clone job definitions within or across workspaces
- List all workspace jobs with search/filter
- Clone job (same workspace) — strips runtime fields, applies name/overrides
- Clone cross-workspace — with destination host/token
- Job diff — field-by-field comparison of two job configs
- Backup/restore — export all job definitions as JSON, import them back
- 7 REST API endpoints under
/api/jobs/
- 4 New Portals — Portal Switcher expanded from 4 to 8 portals
- Security — PII Scanner, Compliance, Preflight Checks
- Automation — Pipelines, Templates, Create Job, Clone Jobs, DLT Pipelines
- Infrastructure — Warehouse, Federation, Delta Sharing, Lakehouse Monitor
- MDM — 19 pages (see above)
- Notification badge fix — bell icon now tracks "last seen" timestamp; badge resets to zero when panel is opened instead of always showing 20
Changed
- Dashboard decluttered — stripped from 8 sections to 3: Metrics cards + Alerts + 3 Quick Actions (Clone, Explore, Diff). AI Insights, Catalog Health, Pinned Pairs, and Recent Operations removed from dashboard
- Sidebar reduced — from 33 items to 14 items across 4 sections (Overview, Operations, Discovery, Management). Pages moved to dedicated portals
- Pinned Catalog Pairs moved to Clone page as inline favorites bar
- RTBF & DSAR accessible only through Governance portal (removed from main sidebar)
- RBAC moved to Governance portal
- Cost Estimator & Storage Metrics moved to FinOps portal
- Observability moved to Data Quality portal
- Pipelines, Templates, Create Job moved to Automation portal
- Warehouse, Federation, Delta Sharing, Lakehouse Monitor moved to Infrastructure portal
- Docs site search — added
@cmfcmf/docusaurus-search-localfor full-text search in dev and production
v0.9.1 — DLT Clone Enhancements (2026-03-28)
Added
- Clone button per pipeline row — visible directly in the Pipelines list, no need to navigate to Detail tab
- Cross-workspace DLT clone — clone pipeline definitions to a different Databricks workspace with destination URL + PAT token
- Clone modal — same-workspace / different-workspace toggle, dry-run preview, inline error display
- Placeholder notebook creation — for serverless/SQL DLT pipelines with no notebook libraries, automatically creates a placeholder notebook in the destination workspace
Fixed
- Library-less pipeline clone — pipelines without notebook libraries (serverless/SQL) now clone successfully by creating a placeholder notebook instead of failing with "libraries must contain at least one element"
- Cross-workspace clone error display — specific error messages for auth failures (401), permission denied (403), and connection errors (502) instead of generic 400
v0.9.0 — Delta Live Tables Management (2026-03-28)
Added
- DLT Pipeline Discovery — browse all DLT pipelines with state, health, creator, and latest update info
- DLT Pipeline Clone — clone pipeline definitions (catalog, libraries, clusters, config) to new pipelines with dry-run preview
- DLT Trigger & Stop — start pipeline runs (incremental or full refresh) and stop running pipelines
- DLT Event Monitoring — view pipeline event logs (errors, warnings, flow progress) via SDK
- DLT Run History — track pipeline update history with status and timing
- DLT Expectation Monitoring — query expectation results from
system.lakeflow.pipeline_eventssystem tables - DLT Lineage Integration — map DLT datasets to Unity Catalog tables by querying target schema's information_schema
- DLT Health Dashboard — aggregate pipeline state (running/failed/idle), health (healthy/unhealthy), and recent events
- DLT UI Page — 3-tab page (Dashboard, Pipelines, Detail) with stat cards, event log, dataset lineage table, clone form
- 10 DLT API Endpoints — full CRUD under
/api/dlt/including trigger, stop, clone, events, updates, lineage, expectations, dashboard - DLT Documentation — Docusaurus guide with API reference, lineage integration, and expectation monitoring
- 22 DLT Unit Tests — covering discovery, details, events, updates, clone, trigger, stop, dashboard, lineage, expectations
v0.8.1 — Governance Consolidation & Notification Fix (2026-03-28)
Changed
- RTBF & DSAR moved to Governance portal — RTBF and DSAR pages are now accessed under
/governance/rtbfand/governance/dsarvia the Governance sidebar's Compliance section, instead of appearing as separate items in the main sidebar. Accessible through the Portal Switcher. - Notification badge fix — the header notification bell now tracks a "last seen" timestamp in localStorage so the badge only shows genuinely new events. Previously it always showed the total count of recent items (typically 20). Opening the panel marks all current notifications as read and resets the badge to zero.
Removed
- RTBF / DSAR from main sidebar — removed as standalone items from the Management section; consolidated under the Governance portal
v0.8.0 — DSAR, Clone Pipelines & Data Observability (2026-03-28)
Added
- DSAR (Data Subject Access Request) — GDPR Article 15 right-of-access workflow. Reuses RTBF's discovery engine to find subject data, then exports as CSV/JSON/Parquet. Full lifecycle: submit, discover, approve, export, deliver, complete. 3 Delta audit tables, 10 API endpoints, 11 CLI commands, 4-tab UI page
- Clone Pipelines — chain multiple operations into reusable workflows. 6 step types (clone, mask, validate, notify, vacuum, custom_sql). 3 failure policies (abort, skip, retry). 4 built-in templates (production-to-dev, clone-and-validate, refresh-dev, compliance-clone). Pipeline builder UI with drag-to-reorder, template gallery, and run history
- Data Observability Dashboard — unified health scoring (0-100) across freshness, volume, anomaly, SLA, and data quality. Health gauge visualization, category breakdown bars, top issues list, trend sparklines. Read-only aggregation from existing Delta tables — no new data collection needed
- Help Page Expansion — 11 tabs covering every portal: Clone & Ops, Data Quality, Governance, FinOps, Discovery, RTBF, DSAR, Pipelines, Observability, Shortcuts, About. Step-by-step guides for each feature
v0.7.0 — RTBF / Right to Be Forgotten (2026-03-28)
Added
- RTBF Engine — complete GDPR Article 17 erasure workflow: submit, discover, approve, execute, VACUUM, verify, certificate
- 3 Deletion Strategies — hard DELETE, anonymize (mask PII columns), pseudonymize (replace identifiers)
- Subject Discovery — finds matching rows across all cloned catalogs using PII detection patterns + information_schema + lineage tracking
- Delta VACUUM Integration — physically removes time-travel history with 0-hour retention for true GDPR compliance
- Verification Engine — re-queries all affected tables to confirm zero rows remain post-deletion
- Compliance Certificates — generates HTML + JSON deletion evidence with full action audit trail, stored in Delta
- 3 Delta Audit Tables —
rtbf_requests,rtbf_actions,rtbf_certificates(created via Settings > Initialize All Tables) - 34 Global Legal Bases — pre-configured privacy regulations from 18 jurisdictions (EU GDPR, UK GDPR, US CCPA/CPRA + 9 state laws, Brazil LGPD, India DPDPA, Japan APPI, China PIPL, and more)
- 16 REST API Endpoints — full lifecycle management under
/api/rtbf/with async job execution - 12 CLI Subcommands —
clxs rtbf submit|discover|impact|approve|execute|vacuum|verify|certificate|list|status|cancel|overdue - RTBF UI Page — 4-tab page (Dashboard, Submit, Requests, Detail) with workflow visualization, stat cards, confirmation dialogs, dry-run preview, certificate download
- Plugin Hooks — 4 lifecycle hooks:
on_rtbf_request,on_rtbf_deletion_start,on_rtbf_deletion_complete,on_rtbf_verification_failed - Slack/Teams Notifications — alerts on submission, execution, completion, verification failure, deadline warnings
- Deadline Monitor —
check_approaching_deadlines()method and/requests/approaching-deadlineAPI endpoint - Row-Level Masking — new
mask_subject_rows()function in masking engine for subject-specific anonymization - Confirmation Dialogs — destructive actions (Execute, VACUUM, Cancel) require typing confirmation text
- Dry-Run Preview — preview deletion SQL and row counts before committing
- Certificate Download —
/certificate/download?format=html|jsonendpoint with Download buttons in UI - Compliance Report Integration — RTBF section added to compliance reports (total, completed, overdue, completion rate)
- Navigation — RTBF accessible via Governance portal sidebar (Compliance section) and header search
v0.6.1 — UI Overhaul, Login Page & Session Persistence (2026-03-25)
Added
- Login Page — dedicated full-screen login page with PAT and Azure CLI auth tabs, shown before main app. Azure wizard: Login → Tenant → Subscription → Workspace selection
- Server-Side Sessions — all login methods (PAT, OAuth, Azure CLI, Service Principal) create server-side sessions with cached WorkspaceClient. Session ID stored in localStorage, sent as X-Clone-Session header. No re-authentication needed after page refresh or browser restart
- Settings Page Redesign — two-panel layout with left sidebar nav + scrollable right content. Sections: Connection, Authentication, Warehouses, Audit, Interface, Performance, Features
- Theme Picker — visual 10-theme grid in Settings (Light, Dark, Midnight, Sunset, High Contrast, Ocean, Forest, Solarized, Rose, Slate) with bi-directional sync to HeaderBar
- Sidebar Collapse — collapsible sidebar with icon-only rail. Toggle at bottom of sidebar + Settings toggle
- Warehouse Start Button — start stopped warehouses directly from Settings with auto-polling for state change
- Portal Switcher — moved to right corner with full keyboard navigation (arrow keys, Escape)
- WCAG 2.1 AA Accessibility — focus-visible outlines, print styles, ARIA tab pattern on login, required field indicators, loading state announcements, reduced-motion support
- Databricks-Style Density — compact typography (18px h1, 13px body), 48px header, tighter card/input/button spacing, 1400px max content width
Changed
- Credential storage — moved from sessionStorage to localStorage (persists across browser restart)
- Dark sidebar colors — hardcoded colors replaced with CSS variables (sidebar-primary, sidebar-accent) for proper theme support
- Typography scale — h1: 24→18px, h2: 20→15px, body: 14→13px, matching Databricks density
- Input height — h-8 → h-7, text-base → text-[13px]
- Card padding — py-4/px-4 → py-3/px-3, rounded-xl → rounded-lg
- Button styling — text-sm → text-[13px], rounded-lg → rounded-md
- Sidebar — default width 208→180px, nav items use 16px icons (was 20px), 13px font, rounded-md highlight (was rounded-r-full pill)
- Page headers — Clone, Reports, Monitor pages migrated to shared PageHeader component with breadcrumbs
- Muted text contrast — bumped from oklch(0.40) to oklch(0.45) for WCAG AA 4.5:1 ratio
Fixed
- Azure CLI browser open — prevented Databricks SDK from opening browser when az CLI not installed. Added shutil.which("az") guard and replaced bare WorkspaceClient() fallback with clear error
- SQL warehouse retry spam — "warehouse not found" and "not a valid endpoint" now fail immediately instead of retrying 3x with backoff. Empty warehouse ID caught before any API call
- Global error toasts — actionable errors (missing warehouse, expired session, auth failure) now show toast notifications automatically from api-client, debounced to avoid spam
- Environment tab removed — removed from Settings UI
Removed
- Environment section from Settings UI (was showing env vars)
v0.5.3 — Demo Data Generator Testing & Hardening
Bug Fixes
- Parameter validation —
generate_demo_catalog()now validates all inputs:catalog_name(non-empty, valid identifier),scale_factor(between 0 and 10),batch_size(1000 to 50M),max_workers(1 to 16), date format (YYYY-MM-DD), start before end, valid industry names - Silent exception logging — 6+ bare
except: passblocks in medallion generation replaced withlogger.warning()— failures are now visible in logs - Audit log insertion — Changed
breakon first error tocontinue— remaining audit entries are now inserted even if one fails - SCD2 atomic swap — Changed non-atomic DROP+RENAME to
CREATE OR REPLACE TABLE AS SELECT— original table preserved if operation fails - Seasonal patterns — Now uses
add_months()to actually shift dates into peak months (was duplicating rows without date shift) - FK regex safety — Added
re.escape()and\bword boundary to prevent partial column name matches - UC Objects metastore fix —
client.metastores.get(id)now used instead of.current()for full metastore details; cloud inferred from workspace host
New Features
- Referential integrity — FK values now scaled to match actual dimension table sizes at the given
scale_factor. JOINs return results instead of empty sets - Seasonal data patterns — Healthcare (winter peak), Retail (Q4 spike), Energy (summer peak), Education (fall), Insurance (spring) — creates realistic chart distributions
- Business table comments — 26 detailed business descriptions across industries (e.g., "Insurance claims submitted by healthcare providers...")
- CHECK constraints — 32 business rule constraints (e.g.,
claim_amount >= 0,rating BETWEEN 1 AND 5) - Grants/permissions — Auto-grants to
data_analysts(SELECT) anddata_engineers(ALL PRIVILEGES) - Pre-built clone template — Saves
config/demo_clone_{catalog}.jsonwith optimal settings - Configurable date range — CLI:
--start-date,--end-date. API:start_date,end_datefields. UI: date picker inputs - Progress ETA — UI shows estimated time remaining based on elapsed time and industries completed
- Multi-catalog generation — CLI:
--dest-catalog. API:dest_catalog. Auto-clones generated catalog to destination - 33 unit/integration tests — Full test suite in
tests/test_demo_generator.pycovering FK ranges, parameter validation, data coverage, generation flow, cleanup
Testing
- 33 tests in
tests/test_demo_generator.pycovering:- Parameter validation (invalid catalog names, out-of-range scale factors, bad dates)
- FK referential integrity (value ranges match dimension table sizes)
- Seasonal data coverage (peak months present per industry)
- Full generation flow (end-to-end with mocked SQL execution)
- Cleanup and error handling paths
- Run with:
python3 -m pytest tests/test_demo_generator.py -v
v0.5.2 — Demo Data Generator Fixes & Parallel Generation
Bug Fixes
- DELTA_METADATA_CHANGED — Column comments now run sequentially instead of parallel to avoid concurrent metadata conflicts
- PK on nullable columns — ID columns now set to NOT NULL before adding PRIMARY KEY constraint
- Volume CSV export — Changed from external LOCATION (invalid cloud path) to managed sample tables via CTAS
- Row filter syntax — Row filter functions now accept column value as parameter (
state_val STRING) instead of referencing column directly - SCD2 non-deterministic UPDATE — Replaced UPDATE with CTAS + table swap to avoid Databricks
INVALID_NON_DETERMINISTIC_EXPRESSIONSerror - Progress bar capped at 100% — Fixed enrichment phase showing >100% progress
New Features
- Parallel medallion generation — Bronze/Silver/Gold schemas now generate in 3 parallel phases across industries instead of sequential per-industry. ~3x faster for multi-industry runs.
- Create UDFs checkbox — New UI checkbox to toggle UDF creation (20 per industry)
- Create Volumes checkbox — New UI checkbox to toggle volume and sample file creation
v0.5.1 — Demo Data Generator
Demo Data Generator
- New
demo-dataCLI command and Web UI page for generating realistic demo catalogs - 10 industries: Healthcare, Financial, Retail, Telecom, Manufacturing, Energy, Education, Real Estate, Logistics, Insurance
- Each industry generates 20 tables, 20 views, 20 UDFs (200 total of each)
- Medallion architecture: Bronze (raw ingestion), Silver (cleaned), Gold (aggregated) schemas per industry
- Scale factor: 0.01 (~10M rows) to 1.0 (~2B rows) — all data generated server-side via Databricks SQL
- Post-generation enrichment:
- Column comments and Unity Catalog tags on PII tables
- Primary key and foreign key constraints (39 FK relationships)
- Table partitioning by date columns on large fact tables
- Business metadata table properties (owner_team, sla_tier, refresh_frequency, etc.)
- Data quality issues injection (nulls, duplicates, outliers)
- Delta version history via UPDATEs for time travel demos
- Cross-industry views (5 JOINs across industries)
- Managed volumes with sample CSV files (1000 rows per table)
- Column masks on PII columns (email, phone, name)
- Row filters on dimension tables
- SCD2 columns (valid_from, valid_to, is_current) on dimension tables
- OPTIMIZE + Z-ORDER on large fact tables
- Data catalog views (table_inventory, column_inventory, pii_columns)
- Pre-populated audit logs (20 fake clone operations for Dashboard)
- Cleanup command:
clxs demo-data --cleanup --catalog demo_source - API:
POST /api/generate/demo-data,DELETE /api/generate/demo-data/:catalog_name - UI: Template presets (Quick/Sales/Full), generation preview with cost estimate, per-industry progress bars, cleanup button, explore link
v0.5.0 — Plugin System, Schedule Backend, RBAC Enforcement
Preflight UC Permission Checks (ENHANCED)
- Enhanced all permission checks to recognize implicit and inherited Unity Catalog privileges
dest_manage_permission: Checks ownership first, then catalog-level grants, then schema-level MANAGE grantsdest_create_table: Recognizes ownership and MANAGE as implying CREATE TABLE; checks schema-level grantssource_use_catalog: Shows "(owner)" when user owns catalog; displaysGRANTcommand on failurecreate_catalog_permission: Checks metastore-level CREATE CATALOG grant- Web UI preflight page shows
GRANTcommands as clickable code blocks (click to copy) with links to UC privileges documentation
Settings & Config — API as Source of Truth (NEW)
- Settings page now loads config from
GET /config(backend is the single source of truth, replaces sessionStorage) - Warehouse selection persists to backend via
PATCH /config/warehouse - Consistent card heights across Settings:
CardHeader className="pb-2",text-basetitles,h-4icons - Auth status endpoint now reflects the actual auth method from the resolved client (pat, cli-profile, service-principal, azure-cli, oauth)
Clone Page — Config from API (ENHANCED)
- Clone page now loads saved config from
GET /configon mount (source_catalog, dest_catalog, clone_type, load_type, max_workers, etc.) instead of hardcoded defaults
Warehouse Page — Set as Active (NEW)
- Added "Set as Active" button on warehouse page with green border and ACTIVE badge on the selected warehouse
- New
PATCH /config/warehouseAPI endpoint inapi/routers/config.py - Added
patchmethod toui/src/lib/api-client.ts
Demo Data Generator Fixes (FIXED)
- Replaced all
timestamp_add()calls withdateadd()for Databricks SQL compatibility - Fixed column comments: now only applies to columns that actually exist in the table DDL
- Fixed sample data export: replaced invalid
COPY INTO(load-only) withCREATE OR REPLACE TABLE ... AS SELECT - Added
uc_best_practicesparameter for medallion schema naming:true(default): sharedbronze,silver,goldschemas with industry-prefixed tablesfalse: legacyhealthcare_bronze,healthcare_silvernaming
- Added volume creation before sample data export
- Web UI: New "UC Best Practices Naming" checkbox on demo-data page with link to Microsoft documentation
Plugin System (NEW)
- Full plugin lifecycle: load, enable, disable, and hook execution
- Wired into
clone_catalogandsync_catalogoperations - 3 example plugins shipped:
logging,optimize,slack-notify - CLI:
clxs plugin list/enable/disable - API:
GET /plugins,POST /plugins/toggle - 8 hook points available for custom logic (pre-clone, post-clone, pre-sync, post-sync, on-error, on-validate, on-rollback, on-complete)
- State persisted to
~/.clone-xs/plugin_state.json - Extend
ClonePluginbase class to write custom plugins - Config:
plugins: [{path: "plugins/my_plugin.py"}]
Schedule Backend (NEW)
- Persistent schedule storage in
~/.clone-xs/schedules.json - Full CRUD:
list_schedules,create_schedule,pause_schedule,resume_schedule,delete_schedule - Integrates with Databricks Jobs via
create_persistent_job() - API endpoints:
GET /schedule,POST /schedule,POST /schedule/{id}/pause,POST /schedule/{id}/resume,DELETE /schedule/{id}
RBAC Enforcement (ENHANCED)
- RBAC now enforced on
clone,sync,diff, andincremental-syncoperations (previously clone only) - Operation-level permissions via
allowed_operationsfield in policy (e.g.,clone,sync,diff,*) - API endpoints for policy management:
GET /rbac/policies,POST /rbac/policies,DELETE /rbac/policies - Policy CRUD functions:
list_policies,create_policy,delete_policy
CLI Improvements
--catalogalias added to 16 single-catalog commandspii-scannow supports--schema-filterand--table-filterstatecommand now accepts--source/--destCLI argsimpact --thresholdnow properly wired upmetrics --format jsonnow outputs machine-readable JSONpluginCLI command added (list,enable,disable)include_schemasconfig option now passed through onschema-drift,storage-metrics,profile
PII Detection Enhancements
- Batch insert for scan store: changed from single-row INSERT to multi-row INSERT with 50-row chunks (reduces N SQL calls to ceil(N/50))
- Schema filter and table filter support in Web UI and API
- Web UI has new filter input fields on the PII scan page
API Enhancements
- New
PATCH /config/warehouseendpoint for setting the active warehouse - Added
patchmethod to the TypeScript API client - Auth status (
/auth/status) now reports the actual auth method from the resolved Databricks client
Test Coverage
- 25 new test files added covering previously untested modules
- Total tests: 856 (up from 539)
v0.4.1 — CLI Improvements
--catalog Alias
- Added
--catalogas an alias for--sourceon 16 single-catalog commands:stats,storage-metrics,optimize,vacuum,profile,export,search,snapshot,estimate,cost-estimate,dep-graph,usage-analysis,sample,view-deps,pii-scan,state - Users can now write
clxs stats --catalog edp_devinstead ofclxs stats --source edp_dev
PII Scan Enhancements
- New
--schema-filterflag to limit scans to specific schemas (e.g.,--schema-filter bronze) - New
--table-filterflag for regex filtering on table names (e.g.,--table-filter "customer.*")
Bug Fixes
statecommand: added--source/--destCLI args (previously only read from config and would crash without them)impact --threshold: now properly wired to control the high-impact thresholdmetrics --format json: now properly outputs JSON when--format jsonis specified
Config Passthrough
include_schemasconfig option now correctly passed through onschema-drift,storage-metrics, andprofilecommands
v0.4.0 — PII Detection Overhaul
PII Detection Engine
- Multi-phase detection: column name regex + data value sampling + Unity Catalog tag reading
- Structural validators — Luhn checksum (credit cards), IBAN mod-97, IP octet range validation reduce false positives
- Weighted confidence scoring — numeric 0.0–1.0 scores: column name (0.85), sampling (match rate + validator bonus), UC tags (0.95)
- Cross-column correlation — tables with co-occurring PII types (e.g., name + DOB + address) flagged as
identity_clusterwith confidence boosts - 5 new value patterns — IBAN, US passport, Aadhaar, UK NINO, MAC address
- 2 new column patterns — MAC_ADDRESS, VIN
Custom Patterns
- User-defined PII patterns via
pii_detectionconfig key in YAML - Disable built-in patterns, add custom column/value patterns, override masking strategies
- Web UI pattern editor with regex tester and enable/disable toggles
Unity Catalog Integration
- Read existing UC column tags (
pii_type,sensitive,classification) to enhance detection - Auto-tag detected PII columns with
ALTER TABLE ... ALTER COLUMN ... SET TAGS - Dry-run mode, configurable tag prefix and minimum confidence threshold
Scan History & Remediation
- Scan results persisted to 3 Delta tables (
pii_scans,pii_detections,pii_remediation) - Compare two scans to see new, removed, and changed detections
- Remediation workflow: detected → reviewed → masked → accepted → false_positive
New API Endpoints
GET /pii-patterns— effective patterns (built-in + custom)GET /pii-scans— scan historyGET /pii-scans/{id}— scan detailGET /pii-scans/diff— compare two scansPOST /pii-tag— apply UC tagsPOST /pii-remediation— update remediation statusGET /pii-remediation— list remediation statuses
UI Enhancements
- Tabbed interface: Current Scan / Scan History / Remediation
- Custom Patterns editor (collapsible panel)
- "Apply UC Tags" button with dry-run preview
- Detection method and correlation flags columns in results table
CLI & TUI
- New flags:
--read-uc-tags,--save-history,--apply-tags,--tag-prefix - TUI prompts for UC tag reading and post-scan tagging
Optional NLP
pip install 'clone-xs[nlp]'enables Microsoft Presidio entity detection- Maps Presidio entities to Clone-Xs PII types
Bug Fixes
- Fixed
result["total_pii_columns"]→result["summary"]["pii_columns_found"]in CLI and TUI
Documentation
- New dedicated PII Detection & Protection guide (15 sections)
- Standalone HTML reference page (
PII_Detection_Reference.html) - Governance page updated with link to new PII guide
v0.3.3
True Delta Rollback with RESTORE TABLE
- Rollback now uses
RESTORE TABLE ... TO VERSION AS OFinstead of destructive DROP - Pre-clone Delta versions recorded for each destination table before clone overwrites it
- Three rollback modes: version-based (precise), timestamp-based (fallback), legacy DROP (old logs)
- Tables that existed before clone: RESTORED to pre-clone version
- Tables newly created by clone: DROPped
- Rollback UI shows per-table plan: green "RESTORE to vN" badges vs red "DROP" badges
clone_started_attimestamp recorded in rollback logs for timestamp-based restore- New rollback_logs Delta table with full history (schemas_count, tables_count, restored_count, etc.)
Explorer Page Enhancements
- Added Monthly Cost and Yearly Cost estimate cards (8 stat cards total)
- Storage price configurable from Settings (default $0.023/GB/month)
- Currency selection in Settings (USD, EUR, GBP, AUD, CAD, INR, JPY, CHF, SEK, BRL)
- Cost Estimator page now reads price from Settings
- Column usage fallback to information_schema when system tables unavailable
Error Handling Improvements
/api/column-usage— returns empty result instead of 500 when system tables unavailable/api/dependencies/functions— returns empty result instead of 500/api/dependencies/views— returns empty result instead of 500/api/dependencies/order— returns empty result instead of 500
Template Fixes
- Template API now returns
keyfield (was returningnameas dict key) - Template API now returns full
configdict for config badges - Category filter fixed:
schema-onlyadded to Development, fallback inference for unknown keys
v0.3.2
Dashboard Enhancements
- Extended dashboard from 4 to 10 stat cards: added Avg Duration, Tables Cloned, Data Moved, Views Cloned, Volumes Cloned, Week-over-Week trend
- Added 3 new charts: Clone Type Split (DEEP vs SHALLOW donut), Operation Type Split (clone/sync/rollback donut), Peak Usage Hours (bar chart)
- Added 2 insight tables: Top Source Catalogs (bar progress), Active Users (avatar + bar progress)
- Added Catalog Health Score card with per-catalog scoring (0-100) based on failure rates and operation history
- Added Pinned Catalog Pairs — localStorage-based favorites for quick clone access
- Added Notification Center — bell icon in header with recent clone events from Delta tables
- Dashboard now queries all 3 Delta tables (run_logs, clone_operations, clone_metrics) with SQL alias normalization for column name differences
Templates Page Redesign
- Category filter pills (All, Development, Production, Disaster Recovery, Security)
- Unique icon and color per template
- Config detail badges (Permissions, Validate, Rollback, Checksum, PII Masking)
- Expandable "More details" with full long_description for each template
- Click-anywhere-on-card to use template
- Templates now pass ALL config values as URL params to clone page
Clone Page Improvements
- Clone page reads URL query params on mount — template settings (checkboxes, clone type, workers) are now correctly applied
- Auto-populate Storage Location from source catalog's storage root via
GET /catalogs/{catalog}/info
Audit Trail Redesign
- Summary stats bar (Total Operations, Succeeded, Failed, Avg Duration)
- Enhanced filters: free-text search, status dropdown, operation type, catalog filter, date range, "Clear all" button
- Expandable entry rows with detail grid (User, Host, Started, Completed, Tables Cloned/Failed, Data Size, Clone Mode, Trigger)
- Log Detail Panel — fetches full execution logs from
/audit/{job_id}/logswith color-coded log viewer - Error message display with mono-font
- Download Full Log as JSON
Cost Estimator Fix
- Fixed field name mismatch between API response and frontend (total_gb vs total_size, monthly_cost_usd vs total_cost, etc.)
- Now shows: Total Size (GB/TB), Tables Scanned, Monthly Cost, Yearly Cost
- Deep vs Shallow comparison cards
- Top 10 Largest Tables with size percentage bars
Page State Persistence (JobContext)
- New React Context (
JobContext) that persists scan/operation results across page navigation - 10 pages updated: PII Scanner, Schema Drift, Preflight, Diff & Compare, Cost Estimator, Profiling, Impact Analysis, Compliance, Monitor, Storage Metrics
- Navigate away and come back — results are preserved
New Delta Table Columns
clone_operations: added tables_skipped (INT), clone_mode (STRING), trigger (STRING), destination_existed (BOOLEAN)run_logs: added tables_cloned (INT), tables_failed (INT), total_size_bytes (BIGINT)clone_metrics: added user_name (STRING), status (STRING), job_type (STRING)- ALTER TABLE ADD COLUMN on init for existing tables
Backend Improvements
- New endpoints:
GET /notifications,GET /catalog-health GET /monitor/metricsnow queries all 3 Delta tables with SQL alias normalization- Metrics enabled by default in config
- Template API now returns full config dict and key field
- Settings page loads audit catalog/schema from YAML config instead of stale sessionStorage
Documentation
- New API Reference page (69+ endpoints across 12 router groups)
- New Web UI Guide (all 33 pages documented)
- New Changelog page
- Updated sidebars.ts and intro.md with links to new docs
- Updated TTL documentation with native Databricks comparison
Docs Site
- Navbar logo: SVG icon only + CSS-rendered text for crisp display
- Increased subtitle readability
- Primary color changed to Clone-Xs red (#E8453C)
v0.3.1
Lineage Enhancements
- Interactive SVG lineage graph with pan/zoom, node highlighting, and curved bezier edges
- Multi-hop tracing up to 5 hops deep with configurable depth slider
- Column-level lineage from
system.access.column_lineage - Notebook/job attribution via
entity_typeandentity_idfields - Time range filtering (from/to date pickers)
- JSON and CSV export
- Insights panel: most connected tables, root sources, terminal sinks, top columns by usage, active users
Explorer Page Major Enhancements
- Catalog Browser — Databricks-style tree sidebar showing all catalogs, schemas, and tables with lazy loading, search filter, expandable tree nodes, hideable via Settings toggle or X button, and resizable via drag
- UC Objects tab — lists all Unity Catalog workspace objects: External Locations, Storage Credentials, Connections, Registered Models (ML), Metastore info, Shares, and Recipients via new
GET /uc-objectsendpoint - Views tab — dedicated tab listing all views with column counts
- Functions tab — lists all UDFs across schemas with lazy loading
- Volumes tab — lists volumes with type and path
- PII Detection tab — inline PII scanner within Explorer
- Feature Store tab — auto-detects feature tables by naming convention
- Table Detail Drawer — click any table to open a slide-out panel with columns, properties, owner, storage location, and dates via
GET /catalogs/{catalog}/{schema}/{table}/info - Schema size donut chart and Table type distribution donut on overview
- Top Used Tables card from
POST /table-usageendpoint - Most Used Columns on overview from column usage data
- Schema filter pills — toggle schemas on/off to filter displayed tables
- Quick actions — Preview, Clone, Profile buttons per table row
- Compare shortcut — button to jump to Diff page with current catalog pre-filled
- Export CSV — download all table data as CSV
- Cost estimates — Monthly/Yearly cost cards with configurable currency
Settings Enhancements
- UI Preferences section with toggles for Export Buttons and Catalog Browser visibility
- Currency selector — 10 currencies (USD, EUR, GBP, AUD, CAD, INR, JPY, CHF, SEK, BRL)
- Storage price — configurable $/GB/month with links to Azure Pricing Calculator and Databricks Pricing
Resizable Panels
- Main sidebar, Catalog Browser, Table Detail Drawer, and Lineage Graph all support drag-to-resize with widths persisted in localStorage
- Reusable
ResizeHandlecomponent
Column Usage Analytics
- New
POST /api/column-usageendpoint queryingsystem.access.column_lineageandsystem.query.history - Most frequently used columns with per-user breakdown
- Integrated into both Lineage Insights tab and Explorer page
- Default mode uses
information_schema.columns(fast, < 2s); system tables (system.access.column_lineage) only whenuse_system_tables: true; query history only wheninclude_query_history: true
New API Endpoints
GET /uc-objects— list all UC workspace objects (External Locations, Storage Credentials, Connections, Models, Metastore, Shares, Recipients) via SDKPOST /table-usage— top used tables by query frequencyPOST /column-usage— optimized with fast/full modes
Create Job Enhancements
- Auto-populated storage location from source catalog's
DESCRIBE CATALOG EXTENDED - Clone-Xs job dropdown (filters by
created_by=clone-xstag) for updating existing jobs - New
GET /api/generate/clone-jobsandGET /api/catalogs/{catalog}/infoendpoints
Bug Fixes
- Fixed Audit Trail field name mismatch (rebuilt as expandable card layout)
- Fixed Config Diff API to accept JSON dicts/YAML strings instead of file paths
- Fixed Lineage
get_lineageimport error with 4-tier data source fallback - Fixed Impact Analysis function signature mismatch and response field mapping
Changed
- SDK-first metadata access — ~42 SQL warehouse queries replaced with Databricks SDK API calls (
client.schemas.list(),client.tables.list(),client.functions.list(), etc.). Metadata browsing (list catalogs, schemas, tables) now works without a running SQL warehouse. SQL fallback preserved for reliability. - New SDK helpers in
src/client.py:list_schemas_sdk,list_tables_sdk,list_views_sdk,list_functions_sdk,list_volumes_sdk,get_table_info_sdk,get_catalog_info_sdk,delete_table_sdk
Removed
- Schedule page removed from sidebar (scheduling handled by Create Job)
v0.3.0
Dashboard Overhaul
- Added 10 stat cards: Total Clones, Success Rate, Completed, Failed, Avg Duration, Tables Cloned, Data Moved, Views Cloned, Volumes Cloned, Week-over-Week trend
- Added 5 charts: Clone Activity (7 days), Status Breakdown, Clone Type Split, Operation Type Split, Peak Usage Hours
- Added 2 insight tables: Top Source Catalogs, Active Users
- Added Catalog Health Score card with per-catalog scoring
- Added Pinned Catalog Pairs (localStorage-based favorites)
- Added Notification Center bell icon in header with recent clone events
- Dashboard now reads from Delta tables (
run_logs,clone_operations) instead of in-memory job store — data persists across API restarts
API Enhancements
GET /monitor/metrics— now queries Delta tables for comprehensive dashboard statsGET /notifications— new endpoint for recent clone eventsGET /catalog-health— new endpoint for per-catalog health scoring- Enabled
metrics_enabledby default in config
v0.2.0
Advanced Cloning
- Data filtering with
--whereand--table-filterfor cloning subsets - TTL policies for auto-expiring cloned catalogs via Unity Catalog tags
- Plugin system with pre/post-clone hooks and custom plugin directory
- Execution plan preview with console, JSON, HTML, and SQL output formats
- Captured SQL file export for DBA review
Web UI
- 33 pages covering all operations, discovery, analysis, and management
- Multi-step clone wizard with progress tracking
- Real-time WebSocket updates during clone operations
- Dark/light theme toggle
- Command palette search across all pages
v0.1.1
Operations
- Incremental Sync — sync only changed tables using Delta version history
- Multi-Clone — clone one source to multiple destinations in parallel
- Create Databricks Job — schedule persistent clone jobs with cron, retries, and alerts
- Rollback — undo clone operations using Delta time travel RESTORE
- Serverless execution — run clones via serverless notebook jobs
Discovery & Analysis
- Explorer — browse catalog hierarchy with size metrics
- Diff & Compare — object-level and column-level catalog comparison
- Schema Drift Detection — detect changes between source and destination
- Impact Analysis — blast radius analysis before schema changes
- Dependency Graph — view/function dependency ordering
- PII Scanner — detect personally identifiable information patterns
- Cost Estimator — estimate storage and compute costs
- Data Profiling — column statistics and data quality analysis
- Storage Metrics — per-table ANALYZE TABLE storage breakdown
v0.1.0
Deployment
- Databricks App — deploy as a native Databricks App with service principal auth
- Desktop App — native macOS/Windows Electron app
- Notebook API — install as wheel package, use from Databricks notebooks
- REST API server — expose all operations as HTTP endpoints
Safety & Governance
- Pre-flight checks — validate connectivity, permissions, and config
- Auto-rollback on validation failure
- Checkpointing — resume long clones from last checkpoint
- RBAC policies — control who can clone what
- Approval workflows — require approval before cloning
- Compliance reports — governance, PII audit, and permission reports
v0.0.2
Core Features
- Deep and shallow Delta Lake cloning
- Schema, table, view, function, and volume replication
- Permission, tag, and constraint copying
- Audit trail logging to Delta tables
- Clone templates (dev, staging, production profiles)
- Scheduled cloning with cron expressions
v0.0.1
Initial Release
- CLI tool for Unity Catalog catalog cloning
- Deep clone with full data copy
- Shallow clone with metadata-only references
- Basic progress reporting and error handling
- YAML configuration file support
- Authentication via Personal Access Token