Engineering Blog · Post #108

The Hub That Doesn't Move Data: Cataloging Integration As a First-Class Concern

From "we have seven integration surfaces — PAS, Excel rater, document repository, policy generation, internal stores, submission/workflow, external — each with its own connector, its own retry path, its own audit trail, and ops can't tell you in under an hour whether last night's PAS dispatch backlog is BOR-update-related or note-attach-related" to "one catalog table that declares each flow, one append-only exchange-log table that records every dispatch, one trace endpoint that walks correlation_ids across systems, one replay endpoint that re-triggers via existing idempotency keys, and one health summary with green/amber/red bands plus SLA breach notifications — without replacing any of the existing connectors" — through 4 new tables, one service with a log exchange helper that's best-effort and never raises, 4 high-traffic services hooked, and 12 default flows seeded on startup.

The Problem

Each of the seven integration surfaces in the workstation has its own connector. They all work. They're all audit-trailed. They're all reliable. The problem isn't that they're broken — the problem is that they're invisible to ops as a system.

A. Policy system → Policy System Connector + Pas Message + Pas Outbox + PAS adapter framework (pas_adapters/). Cap #7. Idempotency keys, retry queue, dead letter, ack reconciler. Solid.
B. Excel rater → Excel Workbook + System Rater + Rater Session. Input payload + output capture. Solid.
C. Document repository → Submission Document + Storage Provider + folder/efile/naming engines. Multi-storage backends. Solid.
D. Policy generation → manuscript template engine + uw document gen service. Word/PDF round-trip. Solid.
E. Internal data sources → Insured + Broker + Claim + intelligence panels. Read mostly. Solid.
F. Submission details → Submission Queue + intake + ACORD parsing. Solid.
G. Workflow → Workflow Designer + workflow-engine + route rules + holds. Solid.
H. External sources → 18 enrichment providers + Research Assistant module (5 caps). Solid.

But ops calls Tuesday afternoon: "PAS is failing dispatches; what's stuck?" Engineering opens five tabs — Pas Outbox, Email Outbox, AIDraft Outbox, the SharePoint sync log, the broker-portal-inbound log — and pieces together a picture. It takes 35 minutes. The picture answers Tuesday's question. Wednesday, ops has a different question. 35 minutes again.

Two naïve framings:

Build a new ETL pipeline that aggregates all seven surfaces into a warehouse. Now there's an eighth integration surface with its own pipeline, its own freshness lag, and its own truth divergence. The data warehouse becomes ops' view of what actually happened, but it's hours behind, and the troubleshooting case requires real-time.
Add a global "integration log" table and rewrite all seven connectors to write to it. Now every connector has a hard runtime dependency on the global log; if the log write fails, the connector fails; the integrations get more fragile, not less.

The right framing: the hub doesn't move data and isn't on the critical path. It catalogs flows declaratively, accepts best-effort log writes from connectors that already work, builds a trace by walking correlation_ids across logs, replays through the connectors' existing idempotency keys, and watches SLAs without owning them. If the hub falls over, every connector keeps working. The hub is a read-side observability layer plus a thin write-side trigger for replay.

The InsightUW Approach

graph TD subgraph Existing["Existing connectors (unchanged)"] PAS["PAS adapter framework + Pas Message / Pas Outbox"] Email["Email Outbox + uw email platform service"] AID["AIDraft Outbox + uw email ai draft service"] DOC["Submission Document + Storage Provider"] INT["uw intelligence coordinator + Intelligence Fetch Request"] News["uw news coordinator + news adapters"] end subgraph Hub["Data Integration Hub (new)"] Flow[("Data Flow (catalog)")] Contract[("Data Contract (versioned schema)")] LOG[("Data Exchange Log (append-only audit)")] WEB[("Webhook Endpoint (inbound webhook inventory)")] end subgraph Service["uw data integration service"] LX["log exchange (best-effort, never raises)"] HS["health summary"] TR["trace(correlation id)"] RE["replay outbound"] SLA["sla check (sweeper)"] end subgraph UI["UI"] Dash["integrations-dashboard /uw/admin/integrations"] DET["data-flow-detail /uw/admin/integrations/:flow code"] TRV["data-trace-viewer /uw/trace/:correlation id"] end PAS -.->|try/except: pass| LX Email -.->|try/except: pass| LX AID -.->|try/except: pass| LX DOC -.->|try/except: pass| LX INT -.->|try/except: pass| LX News -.->|try/except: pass| LX LX --> LOG Flow --> HS LOG --> HS LOG --> TR LOG --> RE RE --> PAS HS --> Dash HS --> SLA TR --> TRV LOG --> DET Flow --> DET Contract --> DET

The seven connectors stay exactly as they were. The hub watches.

Data Flow — declarative catalog

12 default flows seeded on startup: pas.policy_create (critical, sla=60min), pas.bor_update (critical, sla=1440min), pas.note_attach, excel_rater.input_payload (critical), excel_rater.output_capture, sharepoint.doc_upload, doc_intake.email_inbound (critical, sla=60min, webhook), enrichment.duns_lookup, intelligence.pulse_360, intelligence.news_fetch, ai_draft.email_compose, broker_portal.message_in (sla=30min, webhook).

Admin can add rows; the hub picks them up automatically. Adding a new tracked flow = one row.

Data Exchange Log — append-only audit

One row per dispatch. The pointer columns let the hub navigate back to the source-of-truth row in the originating system without duplicating payload content. Append-only — no updates, no deletes.

log exchange — best-effort, never raises

The bare except is the point. If the hub's database is down, if the schema is wrong, if the disk is full — the integration that called log exchange doesn't notice. The hub goes dark; the connectors keep working.

This is what "hub isn't on the critical path" means in code.

Hooks — four high-traffic services in Phase 1

Each hook is a one-time addition wrapped in try/except. Phase 1 covered the 4 highest-traffic dispatchers; the remaining surfaces (PAS quote/policy/amendment, document upload, SharePoint sync, broker portal inbound) are ~3 lines each. Adding more is incremental.

health summary — what ops asks for daily

The dashboard reads this. Critical-flow stars at the top, green/amber/red pills, p95 latency. Window selector flips between 1h / 6h / 24h / 3d / 7d. One round-trip; no scrolling through five tables.

trace — the cross-system timeline

/uw/trace/sub-2026-ex-0188 returns a chronological timeline — intake → enrichment → rater → quote → PAS push → email — across all 12 flows. Embedded into the submission workspace via <app-data-trace-viewer [correlationId]="submission.guid" [embedded]="true">.

replay outbound — re-trigger via existing idempotency keys

The hub doesn't have its own retry logic. Each connector already has its idempotency strategy: PAS uses sha256; email uses outbox + retry policy; AI drafts have their own idempotency key. Replay re-queues; the connector dedupes; nothing happens twice.

For inbound flows — broker portal inbound, doc intake email inbound — the hub refuses replay with a clear error. The source provider has to re-send; the hub captures whatever lands.

sla check — the SLA breach sweeper

Default sweep interval DATA_INTEGRATION_SLA_INTERVAL_SECONDS=1800. Disable via DATA_INTEGRATION_SLA_DISABLED=1. Same-day dedup on the data integration sla breach notification — a flow can breach once per calendar day.

Notifications land under a new "Integrations" inbox tab.

Worked Example: Tuesday Afternoon Triage

2:14 PM. Ops Slack: "PAS dispatches look stuck. What's happening?"

The on-call engineer opens /uw/admin/integrations. The dashboard shows 12 flow rows.

pas.policy_create — critical · last 24h · 47 total · 31 success · 16 failure · success_rate 66% · band RED · p95 4200ms · last_success_at 1h 12m ago.

That's the one.

She clicks the row. data-flow-detail for pas.policy_create loads:

Catalog metadata: outbound, source_system=workstation, target_system=pas:dragon, idempotency_strategy="sha256(connector|policy_guid|type|version)", owner_team=integration_eng.
Recent failures (16):
All 16 land in the last 1h 18m
Error class: all ConnectionError
Error message summary (truncated): "Connection refused: pas-dragon-prod.example.com:443"

She doesn't need to open five tabs. The integration with policy admin system is unhealthy, not the workstation. She pages the policy admin system team. policy admin system was rolling a deploy; they roll back. Six minutes later, Tuesday 2:20 PM, the ops engineer hits "Run replay" on the most recent failure log row. The hub re-queues via uw_pas_dispatch_service.requeue; the existing PAS outbox idempotency keys dedupe (none of these messages successfully landed — the queue is the source of truth, not the failed log). The 16 messages re-dispatch; 16 successful acks come back over the next 4 minutes.

The dashboard's pas.policy_create row flips back to green at the next refresh. The success_rate over the last 24h is now 47/47 → 100% on the in-window rows once the failed ones are filtered out by the replay.

3:12 PM. Different question, same dashboard. A user reports their submission seems to have lost an enrichment row. The on-call hits /uw/trace/sub-2026-ex-0188:

Time	Flow	Direction	Outcome	Pointer	Latency
09:14:02	doc_intake.email_inbound	inbound	success	email:m-9912	—
09:14:11	enrichment.duns_lookup	inbound	success	fetch:f-3344	411ms
09:14:13	intelligence.pulse_360	inbound	success	fetch:f-3345	624ms
09:14:15	intelligence.news_fetch	inbound	partial	fetch:f-3346	8120ms
09:14:21	excel_rater.input_payload	outbound	success	other:r-7711	218ms
09:14:24	excel_rater.output_capture	inbound	success	other:r-7712	412ms
09:14:30	ai_draft.email_compose	outbound	success	ai:d-2201	2110ms
09:15:01	pas.policy_create	outbound	success	pas:p-9981	4101ms

The intelligence.news_fetch row is partial — one of three news adapters failed (error_message_summary: "newsapi rate-limited"), the other two succeeded. The user's missing row was a NewsAPI item that didn't make it on this fetch. The on-call clicks "Replay" on the news_fetch row; the hub re-queues via the news coordinator's existing path; the next pull picks up the missing items.

Both incidents resolved with the dashboard + the trace + the replay. No grep across five logs.

When the hub itself is unavailable

A different scenario — the hub's database is having a slow night. Every connector continues working: PAS dispatches succeed, emails send, drafts generate, news fetches run. The log exchange calls each silently fail (the bare except). The dashboard goes stale; the trace shows fewer rows; the replay button works on what's already logged but doesn't have new rows.

When the hub recovers, future dispatches log normally. The window of missing visibility is exactly the duration of the hub outage. Connectors lost zero throughput. Operations lost a few hours of observability — recoverable cost.

This is the design's point: the hub's failure mode is less observability, not less integration.

What's Deferred (Phase 2)

Contract editor UI. Today, Data Contract rows are written via publish contract API; no UI. Field mappings and validation rules live in JSON. Phase 2: a contract editor that diffs versions and shows breaking-change risk.
Webhook registry UI. Same shape — Webhook Endpoint is API-managed today. UI for inbound-webhook secrets rotation + per-endpoint health is straightforward.
Replay button polish. Today, "Replay" works for outbound and refuses for inbound. Phase 2: per-flow replay rules (e.g., "replay only the last failure," "replay all in last hour"), batch replay, dry-run.
Universal correlation_id propagation. Today, every hooked path passes correlation id explicitly (usually submission.guid or insured.guid). Phase 3: a request-scoped correlation context that auto-tags every log without each caller threading it through. Reduces hook ceremony from 3 lines to 1.
Contract-validation hook at parse step. Today, contracts document field maps but aren't enforced at runtime. Phase 2: an inbound-payload validator that compares against the active contract version and writes outcome='partial' with field-level errors.
Per-flow alert routing. Today, data integration sla breach lands in the "Integrations" inbox tab. Phase 2: per-flow severity, per-flow recipient list, escalation chains.
Long-tail flow coverage. Phase 1 hooked 4 services. Remaining surfaces — PAS quote/policy/amendment, document upload, SharePoint sync, broker portal inbound — are 3-line hooks each. Mechanical; deferred for incremental rollout.

What This Means for Underwriters and Ops

The hub doesn't move data. It catalogs, observes, traces, and replays. Every existing connector keeps its idempotency, its retry, its dead-letter. The hub watches.
log exchange is best-effort, always. Wrapped in try/except at every call site. If the hub's database is down, the connectors don't notice. The hub goes dark; the integrations keep working.
One row per dispatch, append-only. The exchange log is a fact table. No updates, no deletes; replays write new rows referencing the same correlation_id.
Pointer columns, not payload duplication. The log has pas message guid, email message guid, etc. The source-of-truth row stays in the originating system.
Trace by correlation_id. Submission GUID, insured GUID, broker GUID — whatever was set on the dispatch. The trace endpoint walks across all 12 flows and returns chronological order.
Replay re-queues via existing idempotency keys. The hub doesn't second-guess the connector. PAS dedupes by sha256 key; email dedupes by outbox key; AI draft has its own. Replays are safe by construction.
Inbound replay is refused with a clear error. The hub can't make a remote provider re-send. The error message tells the on-call to contact the source.
Health bands + SLA sweeper. Green ≥99% / amber ≥95% / red <95% / idle / stale. SLA breach fires once per calendar day per flow. Critical-flow stars on the dashboard.
12 default flows seeded on startup. PAS create / update / note-attach, Excel rater in/out, SharePoint, doc intake, enrichment, Pulse, news, AI draft, broker portal. Adding more = one row.
Contracts are versioned schemas (catalog-only in Phase 1). Field maps + validation rules + sample payload are documented; runtime enforcement is Phase 2.
Embeddable trace viewer. <app-data-trace-viewer [correlationId]="submission.guid" [embedded]="true"> drops onto the submission workspace. UWs see the cross-system timeline for their submission without going to the admin dashboard.
Adding a new dispatch path is one line. try: log_exchange(...) except: pass. No new tables, no new infrastructure. The hub picks up the new flow once a Data Flow row exists.

What's Next

Three-tier location matching (blog #109) is a much narrower piece — the algorithm behind YoY exposure comparisons. But it shares a design tension with the hub: how do you keep a deterministic system useful when the data is dirty? The hub's answer is "log everything and let humans replay." The matcher's answer is different.

Want to see how InsightUW gives ops a single dashboard for seven integration surfaces without rebuilding any of them? Request a demo.