Continuous Sync
Streaming replication via Structured Streaming jobs — keep replicas fresh without polling. Status: PREVIEW.
Overview
While Incremental Sync is a periodic CDF/version-comparison sweep, Continuous Sync runs as a long-lived Structured Streaming job. Source-side change data feeds into a Spark stream, the stream writes idempotent MERGEs to the destination, and lag between source and replica typically stays under 30 seconds.
Under the hood:
- A Job spec is generated by
src/continuous_sync.py - The runner (
src/continuous_sync_runner.py) executes per-table streams, manages checkpoints, and surfaces lag metrics - The API surface lives in
api/routers/continuous_sync.pyunder/api/continuous-sync
When to use
| Mode | Latency | Cost profile | Use when |
|---|---|---|---|
| Incremental Sync (CDF) | minutes (cron) | warehouse-only on each fire | Most teams. Predictable cost. Easy to reason about. |
| Continuous Sync | seconds | always-on streaming compute | Mirror tables for low-latency dashboards, fraud, freshness-sensitive ML features. |
| DEEP CLONE on schedule | hours | warehouse on each fire | Daily/weekly snapshots; not for live mirrors. |
Configuration
YAML (per-pipeline)
source_catalog: prod
destination_catalog: prod_mirror
schema_name: orders
tables: ["orders", "order_items"] # omit to mirror every table in the schema
checkpoint_volume: /Volumes/clone_audit/streaming/cs_checkpoints
trigger_interval_seconds: 30
allow_schema_evolution: true # pick up new columns without restarting the job
REST submission
curl -X POST $CLXS_HOST/api/continuous-sync \
-H "X-Databricks-Host: $DATABRICKS_HOST" \
-H "X-Databricks-Token: $DATABRICKS_TOKEN" \
-d '{
"source_catalog": "prod",
"destination_catalog": "prod_mirror",
"schema_name": "orders",
"trigger_interval_seconds": 30
}'
The endpoint returns { job_id, run_url } — the run_url opens the underlying Databricks Job in the workspace. From there you control pause/resume/cancel via the standard Jobs UI.
Source-side requirements
- Tables must have Delta Change Data Feed enabled:
ALTER TABLE … SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true'). - Source rows must be uniquely identifiable — either an explicit primary key (preferred) or a composite of columns the runner can hash (
continuous_syncinfers from UC PK constraints first, then falls back toid/ first column withnot null).
The /api/incremental/cdf-check endpoint will tell you whether each table in a schema is ready for continuous sync.
Lifecycle
- Create —
POST /api/continuous-syncbuilds the job spec and submits it. Tags includecreated_by=clone-xs,kind=continuous-sync,source=<src>,destination=<dst>. - Running — the job continuously reads from source CDF and MERGEs to destination. Per-table lag is published as Spark structured-streaming metrics.
- Pause — stop the run from the Jobs UI; the next start resumes from the checkpoint.
- Reattach — if the checkpoint volume is preserved, reattaching after a long pause replays the CDF window the source still retains (24 hours by default — adjust
delta.changeDataFeed.retainCommitsForDays). - Drop —
DELETE /api/continuous-sync/{job_id}cancels the Job and (optionally) cleans up the checkpoint volume.
Failure isolation
Per-table streams run in parallel within the Job. A schema mismatch on one table doesn't kill the others — the failing table's stream is marked as degraded in the per-table status panel and continues to be retried with exponential backoff until either resolved or explicitly dropped from the pipeline spec.
Observability
Continuous Sync writes one row per checkpoint to clone_audit.continuous_sync.checkpoints with:
pipeline_id,source_fqn,dest_fqnlast_committed_version,lag_seconds,rows_mergederror(nullable),degraded(boolean)
Hook this table into your dashboarding stack (or use the Observability page) for SLA alerts.
Related
- Sync — periodic two-way and incremental modes
- Sync Reconciliation — verify replica equivalence
- Data Quality — block cutover on DQ regression