Skip to main content

Continuous Sync

Streaming replication via Structured Streaming jobs — keep replicas fresh without polling. Status: PREVIEW.

Overview

While Incremental Sync is a periodic CDF/version-comparison sweep, Continuous Sync runs as a long-lived Structured Streaming job. Source-side change data feeds into a Spark stream, the stream writes idempotent MERGEs to the destination, and lag between source and replica typically stays under 30 seconds.

Under the hood:

When to use

ModeLatencyCost profileUse when
Incremental Sync (CDF)minutes (cron)warehouse-only on each fireMost teams. Predictable cost. Easy to reason about.
Continuous Syncsecondsalways-on streaming computeMirror tables for low-latency dashboards, fraud, freshness-sensitive ML features.
DEEP CLONE on schedulehourswarehouse on each fireDaily/weekly snapshots; not for live mirrors.

Configuration

YAML (per-pipeline)

source_catalog: prod
destination_catalog: prod_mirror
schema_name: orders
tables: ["orders", "order_items"] # omit to mirror every table in the schema
checkpoint_volume: /Volumes/clone_audit/streaming/cs_checkpoints
trigger_interval_seconds: 30
allow_schema_evolution: true # pick up new columns without restarting the job

REST submission

curl -X POST $CLXS_HOST/api/continuous-sync \
-H "X-Databricks-Host: $DATABRICKS_HOST" \
-H "X-Databricks-Token: $DATABRICKS_TOKEN" \
-d '{
"source_catalog": "prod",
"destination_catalog": "prod_mirror",
"schema_name": "orders",
"trigger_interval_seconds": 30
}'

The endpoint returns { job_id, run_url } — the run_url opens the underlying Databricks Job in the workspace. From there you control pause/resume/cancel via the standard Jobs UI.

Source-side requirements

  • Tables must have Delta Change Data Feed enabled: ALTER TABLE … SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true').
  • Source rows must be uniquely identifiable — either an explicit primary key (preferred) or a composite of columns the runner can hash (continuous_sync infers from UC PK constraints first, then falls back to id / first column with not null).

The /api/incremental/cdf-check endpoint will tell you whether each table in a schema is ready for continuous sync.

Lifecycle

  1. CreatePOST /api/continuous-sync builds the job spec and submits it. Tags include created_by=clone-xs, kind=continuous-sync, source=<src>, destination=<dst>.
  2. Running — the job continuously reads from source CDF and MERGEs to destination. Per-table lag is published as Spark structured-streaming metrics.
  3. Pause — stop the run from the Jobs UI; the next start resumes from the checkpoint.
  4. Reattach — if the checkpoint volume is preserved, reattaching after a long pause replays the CDF window the source still retains (24 hours by default — adjust delta.changeDataFeed.retainCommitsForDays).
  5. DropDELETE /api/continuous-sync/{job_id} cancels the Job and (optionally) cleans up the checkpoint volume.

Failure isolation

Per-table streams run in parallel within the Job. A schema mismatch on one table doesn't kill the others — the failing table's stream is marked as degraded in the per-table status panel and continues to be retried with exponential backoff until either resolved or explicitly dropped from the pipeline spec.

Observability

Continuous Sync writes one row per checkpoint to clone_audit.continuous_sync.checkpoints with:

  • pipeline_id, source_fqn, dest_fqn
  • last_committed_version, lag_seconds, rows_merged
  • error (nullable), degraded (boolean)

Hook this table into your dashboarding stack (or use the Observability page) for SLA alerts.