Unstructured Demo Data
The /demo-data page hosts six tabs that generate unstructured
demo corpora — files and per-line records that complement the
structured-catalog generator documented in
Demo Data Generator. They exist for RAG, GenAI,
observability, and code-search demos where the input is a file (PDF,
WAV, log) or a long-form text row, not a typed Delta column.
| Tab | What it generates | Per-type cap | Extra deps |
|---|---|---|---|
| Documents | PDF / DOCX / PPTX / XLSX / EML | 10,000 | clone-xs[documents] |
| Media | PNG / WAV / MP4 | 5,000 | Pillow (images); ffmpeg (video) |
| Knowledge | Markdown wiki, Q&A JSON, JSONL chat | 10,000 | none (pure stdlib + Faker) |
| Logs | NGINX, JSON, syslog, OTel traces | 1,000 files | none |
| Code | Python / JS / Java repos | 50 repos | none |
| Live Capture | Webcam photos / video chunks → Volume + Delta with inline BINARY | per-tab session (no fixed cap) | none (browser MediaRecorder + <canvas>.toBlob()) |
The first five tabs share the same destination model, the same catalog/schema/volume picker, the same industry registry, and the same preview → submit → poll lifecycle. Read Shared architecture once; per-tab sections cover only what's specific.
Live Capture is architecturally different — it's synchronous (no JobManager / no polling), each capture is one HTTP multipart upload the request handler completes before returning, and the bytes arrive from the user's browser webcam rather than being synthesised on the server. See Live Capture below.
Shared architecture
Three destinations
Every tab exposes the same destination radio. Pick the shape that matches the demo:
| Destination | Files written? | Catalog table written? | Use when |
|---|---|---|---|
volume | yes — to <catalog>.<schema>.<volume> | no | Pure file-corpus demo. RAG ingestion lands directly on the Volume path. |
volume_with_catalog (default) | yes | yes — one row per file, with metadata | You want both the Volume (for downstream readers) and a Delta index for SQL discovery. |
direct_table | no | yes — content inline in the table | The bytes-in-Delta shape your demo expects (e.g. embedding pipelines that read content directly). |
The direct_table content column type varies by tab:
- Documents / Media —
content BINARY(raw bytes). - Knowledge / Code —
content STRING(text inline, queryable). - Logs — one row per line (not per file) with
message STRINGattrs MAP<STRING, STRING>.
Catalog / schema / volume picker
All six tabs use the same picker component (ui/src/components/CatalogSchemaVolumePicker.tsx). Each of the three fields renders as a dropdown of existing names with a Custom name… (create new) fallback that swaps in a free-text input for the typed-in name. The picker calls:
GET /api/catalogs→ list of catalogs the workspace user can read.GET /api/catalogs/{catalog}/schemas→ schemas under the chosen catalog. Skipped while the user is still typing a custom catalog name (the catalog doesn't exist yet, so there's nothing to enumerate).GET /api/auth/volumes→ volumes scoped to the chosencatalog.schema. The Volume dropdown shows existing names plus the default name (demo_unstructured) and a Custom name… fallback.
When the user picks (or types) a name that doesn't yet exist, the runner auto-creates it on submit:
CREATE SCHEMA IF NOT EXISTS <catalog>.<schema>;
CREATE VOLUME IF NOT EXISTS <catalog>.<schema>.<volume>;
The picker label flips to "(unused for direct_table)" when the
destination radio is set to direct_table — Volume isn't needed,
but the field stays visible so the layout doesn't shift.
Industry pattern
Every tab defaults to one of ten industries — healthcare,
financial, retail, telecom, manufacturing, energy,
education, real_estate, logistics, insurance — same set the
structured generator uses. Industry drives template selection within
each generator (e.g. the Documents pdf_invoice type renders as
"Medical invoice" for healthcare, "Premium invoice" for insurance,
"Freight invoice" for logistics). The Documents tab additionally
hides types that don't make sense for the chosen industry (e.g.
pdf_lab_report is healthcare-only).
Lifecycle: types → preview → submit → poll
Every tab follows the same four-call lifecycle:
GET /api/generate/demo-{kind}/types— registry + dependency probe. Response includesavailable: booland an optionalunavailable_reason. The UI uses these to render an install banner instead of an error toast when an extra is missing.POST /api/generate/demo-{kind}/preview— pure arithmetic onbytes/type × counts. No warehouse round-trip. Called on every form change so the operator sees an estimate without waiting.POST /api/generate/demo-{kind}— submits the job. Returns{job_id, status: "queued"}immediately.GET /api/clone/{job_id}— same poll endpoint every other long-running job uses. Surfaces progress, per-type counters, and the final summary.
Validation is shared too:
- Catalog / schema / volume must each be a single Unity Catalog identifier (no dotted FQNs). The most common operator mistake is pasting a multi-part prefix into the catalog field — the validator catches it before the warehouse does.
volumeis required when destination isvolumeorvolume_with_catalog; ignored ondirect_table.countskeys must appear intypes(catches stale form state and typos).
Documents tab
Generates a corpus of PDFs, Word/PowerPoint/Excel docs, and .eml
emails. Twenty-nine document types ship in the registry — nine
industry-aware originals plus twenty industry-specific additions
(lab reports, account statements, BOL/customs forms, property
listings, syllabi, …). The picker shows only the types that make
sense for the chosen industry.
Module:
src/demo_documents.py.
Router:
api/routers/demo_documents.py.
UI tab:
ui/src/app/demo-data/DocumentsTab.tsx.
Per-type cap
10,000 files per type. Beyond that the request fails validation; split into multiple smaller runs.
Dependency gate
The [documents] extra (reportlab, python-docx, python-pptx,
openpyxl) is required. The /types endpoint surfaces
available: false with an install hint when the extra isn't
present, and POST /demo-documents returns a structured 503:
{
"error": "dependencies_missing",
"extra": "documents",
"install_command": "pip install clone-xs[documents]",
"reason": "<probe message>"
}
AI mode (realistic narrative content)
When realistic_content: true, narrative text in the generated
documents (clinical notes, invoice descriptions, contract clauses,
cover-letter prose) is drafted by an LLM instead of a template. The
adapter is dual-backend:
- Databricks Model Serving (preferred) — used when the request
carries an
X-Databricks-Model: <endpoint-name>header. The UI's api-client sets this automatically fromlocalStorage.dbx_modelwhenever the user has picked a Model Serving endpoint in Settings. Same pattern the AI assistant uses. - Anthropic API (fallback) — used when the header is absent and
ANTHROPIC_API_KEYis set in the runtime environment.
If neither is configured the runner logs a warning and runs in template-only mode. Spreadsheets ignore the flag (no narrative content).
Token budget — ai_token_budget caps the per-job AI cost.
Default 50,000 tokens (≈ $0.50 on Sonnet at typical max_tokens);
range 0–10,000,000. Accounting is conservative — every call
charges the full requested max_tokens (the underlying SDK doesn't
surface usage), which biases toward stopping early. When the budget
is exhausted, remaining draft() calls return their template
fallback instead of calling the LLM. Set the budget to 0 to
disable AI entirely even when realistic_content=True.
The job summary includes:
{
"ai_backend": "databricks:my-endpoint",
"ai_calls": 427,
"ai_tokens_used": 49600,
"ai_fallbacks": 3
}
Distinctness — content variation
To avoid the "every PDF reads identical" problem, the generators use three small primitives:
_rotate(*variants)—random.choiceover phrasing variants for things like opening sentences and closing salutations._maybe_section(prob)— random optional inclusion of secondary sections (e.g. "Additional Notes", "References") so document length and shape vary.- An expanded
_INDUSTRY_CONTEXTregistry — diagnosis codes, treatment codes, department names, transaction types, store codes, product categories, services across all ten industries — sized large enough that a 10,000-row corpus has visible variety.
These run regardless of AI mode; AI mode adds a fourth variation axis (LLM-drafted narrative) on top.
Media tab
Generates synthetic images, audio, and short video clips. Five
generators ship: img_xray (512×512 grayscale with overlaid
"radiograph" text), img_scan (800×1000 off-white scanned-document
look), img_photo (600×400 stock-photo placeholder with shapes),
audio_voicemail (2-second sine + Faker-generated transcript line),
and video_clip (320×240 H.264 MP4 at 15 fps).
Module:
src/demo_media.py.
Router:
api/routers/demo_media.py.
UI tab:
ui/src/app/demo-data/MediaTab.tsx.
Per-type cap: 5,000 (lower than Documents because media files are bigger).
Dependency gating: Pillow is required for the three image types
and for the voicemail's transcript fallback; ffmpeg is required only
for video_clip. The /types endpoint surfaces both signals
separately:
{
"available": true,
"ffmpeg_available": false,
"unavailable_reason": null
}
When ffmpeg_available is false the UI greys out the Video Clip
checkbox; the other four types remain selectable.
direct_table caveat for video — Delta has a ~16 MB row-size cap
that a busy video_clip run can blow through. The runner doesn't
split or truncate today. For video-heavy demos prefer
volume_with_catalog; for direct-table demos keep the count low.
The job summary includes per-type counters for files written and
per-type failures (e.g. video_clip_failed: 12, reason: ffmpeg_missing).
Knowledge tab
Generates wiki articles, Q&A pairs, and chat threads — the corpora
behind knowledge-base RAG and conversational-AI demos. Three
generators ship: wiki_article (markdown body with YAML frontmatter
and a synthesized topic registry), qa_pair (JSON, one
question/answer per file), chat_thread (JSONL Slack-export-shaped
threads).
Module:
src/demo_knowledge.py.
Router:
api/routers/demo_knowledge.py.
UI tab:
ui/src/app/demo-data/KnowledgeTab.tsx.
Per-type cap: 10,000.
No extra deps — Knowledge is pure stdlib + Faker. The /types
endpoint always returns available: true.
Topic IA — each output file lands in a per-industry
<topic> sub-directory under the type folder, so RAG demos can
filter on topic cleanly:
knowledge/
├── wiki_article/
│ ├── billing/ ← topic
│ │ ├── billing_001.md
│ │ └── …
│ └── compliance/
└── qa_pair/
└── billing/
└── billing_001.json
direct_table content type — STRING (not BINARY) because
knowledge bodies are text and operators want to query them inline:
SELECT topic, content FROM demo_knowledge
WHERE topic = 'billing' AND content LIKE '%refund%';
Logs tab
Generates synthetic log corpora for observability, SIEM, and
anomaly-detection demos. Four generators ship: nginx_access
(combined-log-format with a 24-hour traffic curve peaking at 10
and 16 UTC), app_json (JSON Lines with realistic level mix —
~94% INFO / 5% WARN / 1% ERROR), syslog (RFC 5424 with a
per-industry service registry), and otel_trace (OpenTelemetry
span trees, 3–8 spans per trace with parent_span_id wired).
Module:
src/demo_logs.py.
Router:
api/routers/demo_logs.py.
UI tab:
ui/src/app/demo-data/LogsTab.tsx.
Caps and extra inputs
| Field | Default | Range |
|---|---|---|
| files per type | — | 1–1,000 |
lines_per_file | 1,000 | 1–100,000 |
days_back | 7 | 1–365 |
Files are spread evenly across days_back UTC days with peak-hour
clustering inside each day, so a 7-day corpus produces a realistic
weekly pattern.
direct_table is one row per LINE — the natural shape for log
analytics. The Volume + catalog destinations write one row per file
(file-level metadata); only direct_table decomposes lines:
CREATE OR REPLACE TABLE <fqn> (
log_id STRING,
log_type STRING,
service STRING,
ts TIMESTAMP,
level STRING,
message STRING,
attrs MAP<STRING, STRING>,
generated_at TIMESTAMP
) USING delta;
attrs is the open-ended bag for log-type-specific structure —
nginx writes remote_addr, request_method, status,
response_size; OTel writes trace_id, span_id, parent_span_id,
span_name, attributes_json. Operators can attrs['status'] etc.
without reshaping the table.
Code tab
Generates synthetic source-code repos for code-search and
Copilot-style demos. Three generators ship: python_repo
(src/<pkg>/*.py + tests/test_*.py + README + pyproject.toml),
js_repo (src/*.js + tests/*.test.js + README +
package.json, ES6), java_repo (src/main/java/.../*.java +
src/test/java/.../*Test.java + README + pom.xml).
Module:
src/demo_code.py.
Router:
api/routers/demo_code.py.
UI tab:
ui/src/app/demo-data/CodeTab.tsx.
Per-type cap: 50 — but each "count" is a repo, not a file. A repo is ~25–35 files, so the cap maps to ≈1,500 source files per type. The cap exists because building the per-repo file set has non-trivial cost.
direct_table is one row per source FILE with content STRING
inline. Embeddings work at the file level (not the repo level) so
code-search demos can ingest directly:
SELECT repo_name, file_path, content
FROM demo_code
WHERE language = 'python' AND content LIKE '%def __init__%';
API reference
Every tab exposes the same three endpoints under
/api/generate/demo-{kind} where {kind} ∈
code.
GET /api/generate/demo-{kind}/types
List the registered types and dependency status.
curl $CLXS_HOST/api/generate/demo-documents/types?industry=healthcare
{
"types": [
{"type": "pdf_claim", "category": "PDF", "label": "Healthcare claim form", "extension": "pdf"},
{"type": "pdf_invoice", "category": "PDF", "label": "Medical invoice", "extension": "pdf"},
{"type": "pdf_lab_report","category": "PDF", "label": "Lab report", "extension": "pdf"}
],
"available": true,
"unavailable_reason": null
}
For Documents, pass ?industry=<name> to receive industry-resolved
labels and have industry-incompatible types filtered out. The other
four tabs ignore the parameter.
POST /api/generate/demo-{kind}/preview
Pure-arithmetic estimate. No warehouse round-trip; the UI calls it on every form change.
curl -X POST $CLXS_HOST/api/generate/demo-documents/preview \
-H 'Content-Type: application/json' \
-d '{"types": ["pdf_invoice", "docx_letter"], "counts": {"pdf_invoice": 200, "docx_letter": 50}}'
{
"per_type": [
{"type":"pdf_invoice","category":"PDF", "label":"Invoice", "count":200,"estimated_bytes":3072000,"estimated_seconds":1.2},
{"type":"docx_letter","category":"Word","label":"Business letter", "count":50, "estimated_bytes":768000, "estimated_seconds":0.3}
],
"total_files": 250,
"total_bytes": 3840000,
"estimated_seconds": 1.5,
"unknown_types": []
}
POST /api/generate/demo-{kind}
Submit the job. Returns immediately with {job_id, status: "queued"}. Poll GET /api/clone/{job_id} for progress.
curl -X POST $CLXS_HOST/api/generate/demo-documents \
-H 'Content-Type: application/json' \
-H 'X-Databricks-Model: my-llama-endpoint' \
-d '{
"catalog": "demo_data",
"schema": "unstructured",
"volume": "demo_unstructured",
"destination": "volume_with_catalog",
"industry": "healthcare",
"types": ["pdf_claim", "pdf_lab_report"],
"counts": {"pdf_claim": 100, "pdf_lab_report": 100},
"realistic_content": true,
"ai_token_budget": 100000
}'
The X-Databricks-Model header is Documents-only — the other
four tabs don't draft narrative text, so they don't read it.
Examples
Volume corpus for a RAG demo (Documents)
End-to-end: 500 healthcare claim forms + 500 lab reports, AI-drafted narrative, default token budget, written to a Volume + per-file catalog table.
curl -X POST $CLXS_HOST/api/generate/demo-documents \
-H 'Content-Type: application/json' \
-H 'X-Databricks-Model: my-sonnet-endpoint' \
-d '{
"catalog": "demo_data",
"schema": "rag",
"volume": "claims_corpus",
"destination": "volume_with_catalog",
"industry": "healthcare",
"types": ["pdf_claim", "pdf_lab_report"],
"counts": {"pdf_claim": 500, "pdf_lab_report": 500},
"realistic_content": true
}'
# → {"job_id": "demo-documents-<uuid>", "status": "queued"}
Then point the RAG ingestion at
/Volumes/demo_data/rag/claims_corpus/ and the catalog table at
demo_data.rag.demo_documents.
Direct-table corpus for log analytics (Logs)
50 NGINX access logs × 10,000 lines = 500,000 rows landing one per-line in a single Delta table:
curl -X POST $CLXS_HOST/api/generate/demo-logs \
-H 'Content-Type: application/json' \
-d '{
"catalog": "demo_data",
"schema": "observability",
"destination": "direct_table",
"industry": "retail",
"types": ["nginx_access"],
"counts": {"nginx_access": 50},
"lines_per_file": 10000,
"days_back": 7
}'
Then:
SELECT level, count(*)
FROM demo_data.observability.demo_logs
WHERE log_type = 'nginx_access' AND attrs['status'] LIKE '5%'
GROUP BY level;
UI walkthrough
- Navigate to Operations → Demo Data in the sidebar and pick one of the six unstructured tabs.
- Pick a destination (Volume / Volume + catalog / Direct table). The picker rewires itself automatically.
- Use the catalog / schema / volume picker — pick an existing trio or "Custom name… (create new)" any of the three. The runner creates schemas and volumes on submit if they don't exist.
- Pick an industry (Documents / Knowledge / Logs / Code) — the type checkbox grid relabels and (Documents only) filters.
- Tick the types to generate and the counts for each. The preview line updates as you type.
- (Documents only) Toggle AI mode and adjust the token
budget if you have a Model Serving endpoint or
ANTHROPIC_API_KEYconfigured. - Click Generate — the job submits, the page subscribes to progress, and the toast bar tracks it through completion.
Live Capture tab
Browser webcam → UC Volume + Delta catalog with inline
BINARYbytes.
Live Capture inverts the data flow of the other five tabs: instead of
a synthetic generator on the server building bytes from Pillow /
ffmpeg, the bytes arrive from the user's browser webcam (one HTTP
multipart request per snapshot or video chunk). Each capture is
processed synchronously — uploaded to a Volume and INSERTed into a
single indexed catalog table with the bytes embedded inline as a
BINARY column.
What it produces
Every capture lands as one row in <catalog>.<schema>.<table> (default
demo_capture_catalog) with both a file_path (Volume pointer
for browsable / downloadable bytes) and content BINARY (inline
bytes for SQL-only RAG demos that don't want to round-trip the
Volume). The bytes also exist on the Volume so any tool that prefers
file paths over BLOBs keeps working.
The table schema:
| Column | Type | Notes |
|---|---|---|
capture_id | STRING | UUID hex |
capture_type | STRING | photo or video |
file_path | STRING | /Volumes/<catalog>/<schema>/<volume>/capture/<type>/<YYYY-MM-DD>/<file> |
file_extension | STRING | jpg / webm / mp4 |
size_bytes, width, height, duration_ms | numeric | duration is NULL for photos |
mime_type, industry, captured_at, session_id, submitted_by | metadata | session_id is one-per-tab; submitted_by is best-effort current_user.me() |
caption | STRING | 1 sentence, ≤14 words |
alt_text | STRING | 1 sentence accessibility text, ≤18 words |
summary | STRING | 2–3 sentence scene description |
tags | STRING | 5–8 single-word visual keywords, comma-separated |
detected_text | STRING | OCR of any visible text (signs, screens, whiteboards) |
scene_category | STRING | 1–2 word category (office, lab, outdoor, …) |
content_full | STRING | summary \n\n caption \n\n alt_text \n\n detected_text — queryable RAG projection |
content | BINARY | Raw bytes, inline |
metadata_json | STRING | JSON copy of dimensions / mime / industry / captured_at |
Tables are created with CREATE TABLE IF NOT EXISTS (not OR REPLACE) so captures accumulate across browser sessions. Existing
tables get the newer columns added on next call via ALTER TABLE ADD COLUMN IF NOT EXISTS.
AI mode — one consolidated multimodal call
When AI mode is on and a Databricks Foundation Model is selected
in Settings, every photo capture triggers one multimodal call to
that endpoint asking for all six AI-derived fields (caption,
alt_text, summary, tags, detected_text, scene_category) as
a single JSON blob. The response is parsed locally and any field
missing or malformed falls back to a templated string.
Image bytes are forwarded as base64 inline via the OpenAI-style
image_url content block — the same shape databricks-llama-4-maverick
and databricks-claude-3-7-sonnet accept. Video chunks (webm / mp4)
do not go to the vision endpoint (Llama 4 / Claude vision accept
images, not video); video captures use a metadata-only prompt and
the visual-only fields (detected_text, scene_category) are
forced to "" / "unknown" so SQL aggregates aren't polluted with
hallucinated values.
When AI mode is off (the default) or no Foundation Model is selected, every field uses templated fallbacks and the row still inserts cleanly. No Anthropic API path is exercised by Live Capture — only Databricks Model Serving endpoints listed in your workspace are called.
Description style — Strict vs Permissive
A small segmented control next to the AI mode toggle picks the prompt style:
- Strict (default) — industry-neutral, demographics-neutral.
No gender, age, ethnicity, profession, or industry claims. People
are referred to as
"a person"(or"two people") and only directly-observable features are described (clothing colour, posture, action). Best for accessibility demos and for avoiding the "man-at-desk-in-healthcare-mode → labelled nurse" failure mode. - Permissive — vivid description. Industry priming is back on and the model may describe apparent gender / profession when the scene supports it. Caller has accepted the bias risk.
Defence-in-depth: any unknown style value from the wire (typo, enum
drift) clamps back to strict server-side rather than silently
re-enabling the bias-prone permissive prompt.
Capture modes
| Mode | What happens | Notes |
|---|---|---|
| Take photo | One JPEG via <canvas>.toBlob() per click | Industry default, simplest path |
| Burst photos | Same as Take photo, repeated every N ms | Warning fires under 500 ms (warehouse INSERT load) |
| Record video | MediaRecorder chunks every N ms; each chunk is a separate row | First chunk carries the WebM init/header; subsequent chunks are continuation segments and won't play standalone (concatenate by session_id to reassemble) |
Endpoints
POST /api/capture/init— idempotent volume + table create. Called on tab mount so the first/framedoesn't pay the create cost.POST /api/capture/frame— multipart upload: blob + form fields → Volume upload + INSERT row. Returns the row that was written so the UI can append it to the live "Recent" strip without a follow-up SELECT.GET /api/capture/recent— recent metadata rows for the live UI strip. Never carries the inlineBINARYcontent — the response stays small even when the table has thousands of rows. Filters bysession_idso concurrent browser tabs don't see each other's captures.
UI walkthrough
- Navigate to Operations → Demo Data → Live Capture.
- Pick the catalog / schema / volume trio (a default
demo_unstructuredvolume is created if missing). Optionally override the table name. - Pick an industry — drives templated fallbacks and (Permissive mode only) the AI prompt prime.
- Toggle AI mode on if you want image-grounded captions, summary, tags, OCR, and category. Pick a Foundation Model in Settings if you haven't already.
- Pick Strict or Permissive description style. Strict is the default and is what you want for accessibility / unbiased demos.
- Click Take photo, Burst photos, or Record video. Rows appear in the Recent strip immediately, with the AI summary, scene category, tag chips, and detected text rendered per tile.
SQL — explore captures
-- Most recent captures, with all AI-derived fields
SELECT capture_id, capture_type, scene_category, summary, tags,
detected_text, captured_at, session_id
FROM <catalog>.<schema>.demo_capture_catalog
ORDER BY captured_at DESC
LIMIT 20;
-- Group by scene category (works because Strict mode never
-- pollutes scene_category with hallucinated values on text-only paths)
SELECT scene_category, count(*) AS n
FROM <catalog>.<schema>.demo_capture_catalog
WHERE capture_type = 'photo'
GROUP BY scene_category
ORDER BY n DESC;
-- RAG-style search over the unified content_full projection
SELECT capture_id, summary
FROM <catalog>.<schema>.demo_capture_catalog
WHERE content_full ILIKE '%whiteboard%';
Troubleshooting
- "Internal Server Error" on capture, table existed previously —
the four newer columns (
summary/tags/detected_text/scene_category) need to be added viaALTER TABLE ADD COLUMN IF NOT EXISTS. If the ALTER fails (permission denied, warehouse doesn't support it on that table), the next INSERT fails with "column not found". Check the API log forALTER ADD COLUMN … failedwarnings. Quickest fix: change the table name field to a fresh value, orDROP TABLEand let the next capture recreate it with all 22 columns. - AI says "nurse" when the photo shows a man at a desk — you're on Permissive mode with industry priming on. Switch to Strict in the Description style toggle.
- Video won't play in the notebook — only the first chunk of a
recording session carries the WebM init segment; later chunks are
continuation segments. Concatenate
contentbysession_idordered bycaptured_at, or play the first chunk only.