Skip to content

02. Ingestion Patterns That Do Not Lie

⏱️ Estimated time: 23 min | Level: intermediate

ELI5 callback: In the car factory, the loading dock sets arrival rhythm, the conveyor belt sets work rhythm, the showroom exposes finished output, the reject bin protects trust, and the manifest explains every move. This file teaches which gate should receive which kind of raw data.

Match the pattern to source behavior

See. Ingestion starts by respecting how the source actually emits change.

Databases expose inserts, updates, and deletes differently than apps.

Event streams already speak in append-only language.

SaaS APIs often hide rate limits and pagination traps.

File drops are simple, but not necessarily reliable.

Webhooks are fast, but delivery can fail silently.

So what to do?

First classify the source: database, application, partner, or filesystem.

Then classify frequency: continuous, periodic, or human-triggered.

Then classify trust: can you replay from source or not?

CDC is strong when row changes matter.

Event streaming is strong when business events matter.

API polling is acceptable when freshness needs are modest.

File drops help when partner maturity is low.

Webhooks fit notifications better than bulky datasets.

Simple, no?

Choose by behavior, not by brand logo.

Wrong ingest mode creates pain before transformations even begin.

Know the common paths clearly

Debezium reads database logs without burdening application code.

Kafka handles fan-out and durable buffering well.

Fivetran trades custom control for fast connector setup.

Custom pullers give flexibility, but they also create maintenance debt.

Polling must track cursors, pagination, and retries carefully.

File ingestion must validate naming, schema, and completeness.

Webhooks need signature checks and idempotent consumers.

Now watch.

Ingestion is not just moving bytes.

┌───────────┐ ┌─────────┐ ┌──────────────┐ │ DB logs │──▶│ Debezium│──▶│ Kafka / raw │ └───────────┘ └─────────┘ └──────────────┘ ┌───────────┐ ┌─────────┐ │ SaaS API │──▶│ Poller │───────────────▶ same raw zone └───────────┘ └─────────┘ ┌───────────┐──▶ Webhook receiver ────────▶ same raw zone

It is also preserving intent, order, and replay options.

A durable landing zone makes later recovery much easier.

Teams regret skipping raw retention.

Without retention, bad logic becomes permanent data loss.

With retention, you can rebuild after schema mistakes.

Land raw first when feasible.

Normalize later when understanding improves.

Ingest metadata together with payload.

Future debugging will thank you.

Ordering, duplicates, and backpressure are normal

Sources lie more than brochures admit.

APIs resend pages, webhooks retry, and CDC can rebalance.

So duplicates are normal, not exceptional.

Carry source identifiers whenever possible.

Carry source timestamps whenever possible.

Ordering guarantees vary by partition and protocol.

Do not promise global order unless you truly have it.

See.

Backpressure starts at ingress, not only in processing.

If consumers slow down, queues absorb pain for some time.

After that, lag grows and freshness degrades.

You need alarms on lag, error rate, and schema drift.

Reject unknown fields only when producers understand the contract.

Otherwise route them safely for inspection.

The fastest pipeline can still be unusable if observability is weak.

So what to do?

Measure ingest lag separately from transform lag.

That separation reduces confused incident response.

Choose the path that minimizes surprise

Use CDC when you need deletes, updates, and low source overhead.

Use event streams when applications can publish clear business facts.

Use polling when APIs are the only contract.

Use file drops when partner maturity is low.

Use webhooks for immediate signals, then fetch details if needed.

Blend patterns inside one platform without shame.

The platform should hide variety behind common contracts.

Those contracts include schema, retention, lineage, and ownership.

Think again using the factory analogy.

The loading dock receives parts from many gates, the conveyor belt standardizes movement, the showroom only wants ready data, the reject bin isolates broken arrivals, and the manifest records source truth.

Simple, no?

Your best ingest design minimizes surprise.

It should replay cleanly after source bugs.

It should fail loudly when contracts break.

It should stay boring during peak load.

Do not over-automate unknown sources on day one.

Stabilize one pattern at a time.

Then scale connector count confidently.

Where this lives in the wild

  • CDC is common for operational databases feeding analytics mirrors.
  • SaaS finance tools usually enter through polling or managed connectors.
  • Partner ecosystems still rely heavily on file drops and SFTP exchanges.
  • Event-driven product telemetry often lands through Kafka or Kinesis.

Pause and recall

  • When is CDC better than application events?
  • Why should raw retention exist before complex normalization?
  • What makes webhook ingestion tricky during retries?
  • Why separate ingest lag from transform lag?

Interview Q&A

Q: When would you avoid polling? A: When freshness is tight or the API rate limits heavily. Common wrong answer to avoid: Polling is fine for every source.

Q: Why is raw landing valuable? A: It preserves replay and audit options after downstream mistakes. Common wrong answer to avoid: Because storage is fashionable.

Q: What is the main risk with webhooks? A: Delivery may be duplicated, delayed, or silently dropped. Common wrong answer to avoid: They are basically the same as Kafka.

Q: Why use a schema contract at ingress? A: It catches producer drift before corruption spreads downstream. Common wrong answer to avoid: Schemas only matter in the warehouse.

Apply now (5 min)

Pick one source you know: database, SaaS API, webhook, or file drop. List its replay ability, ordering guarantee, and freshness need. Choose an ingest pattern and one fallback plan. Define the raw landing fields you must always capture. Add one alert you would page on immediately.

Bridge. Data arrives. Now it needs cleaning and reshaping. → 03