Skip to content

Forward-Deployed / Solutions Engineering — Interview Questions

A fast-growing 2026 interview category. As AI products move from demos to enterprise deployments, companies hire forward-deployed engineers (FDEs) — part systems engineer, part solutions architect — who embed with a customer, integrate the product into their stack, and own the outcome. These loops test a different muscle than the RAG/agents/evals files: messy real-world integration, flaky client systems, adoption, and trust under pressure. The senior tell here is almost always the same — diagnose the actual situation before acting, and treat the customer relationship as part of the system you're debugging.

These questions blend technical design with judgment. Answer them like an engineer who has shipped into someone else's production, not like a consultant.


Deployment design

Q: "A logistics company wants an AI agent to automate shipment rerouting. They have SAP data, real-time weather APIs, and 400 warehouse managers on different regional systems. Walk me through your approach."

Tags: senior · common · design · source: FDE Academy 2026 interview guide; forward-deployed engineer loop

Answer outline: - Resist designing the agent first. Start with the integration surface and the failure cost: rerouting a shipment is an irreversible, money-moving action, so this is a human-in-the-loop system, not full autonomy on day one. - Map the data plane: SAP is the system of record (batch or event feed for shipment state), weather APIs are real-time signals, the 400 managers are heterogeneous edge systems with no uniform API. The hard part is the 400 managers, not the LLM. - Phase it: (1) read-only co-pilot that recommends reroutes with reasoning, manager approves; (2) auto-execute low-risk reroutes within guardrails (cost delta < $X, delivery SLA preserved), escalate the rest; (3) widen autonomy as the eval record earns trust. - Integration strategy for the 400 systems: don't build 400 connectors. Define one canonical event/command schema; adapt at the edge with thin per-region adapters or a webhook contract; tolerate that some regions will be CSV-over-email for a while. - Reliability: SAP and the manager systems will be offline sometimes — idempotent commands, retries with backoff, a durable queue, and a reconciliation job that detects when the agent's view of the world diverged from SAP's. - Eval: an offline suite over historical disruptions ("would the agent have rerouted correctly?") plus an online guardrail — maintain delivery SLA while not overspending on expedited shipping. - Numbers to drop: "irreversible action → HIL by default", "one canonical schema + edge adapters, not N connectors", "idempotent commands + reconciliation against system of record", "phase autonomy behind an eval record"

Common follow-ups: - "How do you handle a warehouse system that's offline when the agent needs to act?" - "When do you let the agent act without a human?" - "How do you prove to the customer it's safe to widen autonomy?"

Traps: - Designing the LLM agent loop while hand-waving the 400 heterogeneous systems — that's the actual engineering. - Full autonomy on an irreversible, money-moving action from day one. - No reconciliation — the agent's world model silently drifts from SAP.

Related cross-cutting: Production patterns, Architecture choices Related module: learning/01_ai_engineering/01_agentic_system_design/, learning/01_ai_engineering/19_tool_integration_contracts/, learning/09_career/07_solution_architecture_presales.md


Q: "Walk through how you'd set up a reliable webhook integration with a client system that frequently goes offline."

Tags: senior · common · design · source: FDE Academy 2026 interview guide

Answer outline: - Assume the client endpoint is unreliable — design for it instead of hoping. The contract is: at-least-once delivery, idempotent receiver, durable retry. - Outbound (you → client): enqueue events to a durable queue; deliver with exponential backoff + jitter and a max retry window; after exhaustion, move to a dead-letter queue and alert. Sign payloads (HMAC) so the client can verify. - Idempotency: every event carries a unique id; the client dedups on it, because at-least-once means they will get duplicates. Document this in the contract — don't assume they handle it. - Inbound (client → you): same discipline, plus return 2xx fast and process async; if you do work synchronously and they time out, they'll retry and you'll double-process. - Backpressure & recovery: when the client comes back online after an outage, don't stampede them with a backlog — rate-limit replay. Offer a reconciliation/catch-up endpoint (poll "what did I miss since cursor X") as a fallback to pure push. - Observability: per-client delivery success rate, retry depth, DLQ size, and an alert when a client has been failing for N minutes — because in FDE work, their outage becomes your escalation. - Numbers to drop: "at-least-once + idempotent receiver", "exp backoff + jitter, then DLQ", "HMAC-signed payloads", "rate-limited replay on recovery", "reconciliation/cursor endpoint as push fallback"

Common follow-ups: - "The client can't implement idempotency on their side. Now what?" (you dedup outbound, or expose a pull/cursor API instead of push) - "How do you avoid overwhelming them when they recover?"

Traps: - Fire-and-forget webhooks with no retry or DLQ. - Synchronous processing on the inbound side → timeouts → duplicate processing. - Replaying a full backlog at full speed the moment the client recovers.

Related cross-cutting: Production patterns Related module: learning/06_system_designing/06_event_driven_distributed_systems/, learning/08_infrastructure_tooling/07_sqs/, learning/08_infrastructure_tooling/08_kafka/


Q: "Design a monitoring system for a multi-tenant SaaS deployment where each client has a different SLA."

Tags: senior · common · design · source: FDE Academy 2026 interview guide; multi-tenant ops probe

Answer outline: - Tenant is a first-class dimension on every metric and every alert — latency, error rate, cost, quality are all per-tenant, never just global. A healthy global p95 can hide one whale tenant breaching their SLA. - Encode SLAs as data, not code: a per-tenant config of targets (p95 latency, uptime, quality floor) drives alert thresholds, so a premium tenant's 99.9% and a free tenant's 99% use the same pipeline with different numbers. - Error budgets per tenant: burn-rate alerts fire when a tenant is consuming its monthly budget too fast, not on every blip. This prevents alert fatigue and ties monitoring to the contractual promise. - Noisy-neighbor detection: track per-tenant resource consumption so one tenant's traffic spike that degrades others is visible and attributable — and enforce quotas/rate-limits to contain it. - Cost attribution per tenant (tokens, tool calls, retrieval) so an unprofitable tenant is visible before renewal, and so you can prove value at QBRs. - Reporting: each tenant gets an SLA dashboard / report they can see; internally, a roll-up that ranks tenants by SLA-risk so the FDE team works the right account. - Numbers to drop: "tenant_id on every metric", "SLA as per-tenant config driving thresholds", "burn-rate alerts not blip alerts", "per-tenant cost attribution + quotas for noisy neighbors"

Common follow-ups: - "How do you stop one tenant's spike from breaching everyone's SLA?" - "How do you alert without drowning in per-tenant noise?"

Traps: - Global-only dashboards — the breaching tenant is invisible in the aggregate. - Hard-coding SLA thresholds instead of config-per-tenant. - No cost attribution — you discover an unprofitable account at renewal.

Related cross-cutting: Production patterns Related module: learning/06_system_designing/10_security_governance_multi_tenancy/, learning/01_ai_engineering/13_prompt_lifecycle_operations/09-multi-tenant-prompts.md


Trust & adoption under pressure

Q: "A healthcare client deployed your AI platform but adoption is at 12% after 90 days. They're blaming the product. What do you do?"

Tags: senior · common · scenario · source: FDE Academy 2026 interview guide; adoption/judgment probe

Answer outline: - Low adoption is a symptom with many causes — don't accept "the product is bad" or defend the product. Go find which cause it actually is, with data and conversations. - Instrument first: who are the 12% who do use it, and what do they do? Where do the other 88% drop off — never logged in, tried once and bailed, or use it for one narrow thing? The funnel tells you whether it's access, onboarding, trust, or fit. - Talk to non-users, not just champions. Common healthcare-specific blockers: clinicians don't trust outputs without provenance, the workflow adds clicks instead of removing them, it's not in the tool they already live in (the EHR), or training never happened. - Separate product gaps from deployment gaps. Often it's not the model — it's integration friction, missing SSO, a scary UI, or no executive mandate. The FDE's job is to find the real blocker and fix what's fixable now (integration, workflow, training) while routing genuine product gaps back with evidence. - Re-baseline success with the client: 12% of whom? Some users aren't the target. Define the activation metric that matters and a 30-day plan to move it. - The judgment tell: you neither blame the customer nor cave — you replace opinion with funnel data and user interviews, then act. - Numbers to drop: "adoption funnel: invited → activated → habitual", "interview the 88%, not just the champions", "separate product gaps from integration/workflow/trust gaps", "re-baseline the denominator"

Common follow-ups: - "What if the data says the product genuinely doesn't fit their workflow?" - "How do you push back on a customer who's wrong without losing the account?"

Traps: - Accepting "the product is bad" and forwarding it to engineering with no diagnosis. - Defending the product instead of investigating. - Measuring adoption against the wrong denominator.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/18_human_ai_product_experience/, learning/09_career/07_solution_architecture_presales.md


Q: "You deployed a fix two hours ago. The client reports the same problem is back. They're frustrated and losing confidence. What do you do?"

Tags: senior · common · scenario · source: FDE Academy 2026 interview guide; incident-under-pressure probe

Answer outline: - Two parallel tracks: the technical fix and the relationship. Mishandle either and you lose the account. Acknowledge fast, set a clear next-update time, and stop promising fixes you haven't verified. - Technical: "same problem is back" after a fix usually means one of three things — (1) the fix didn't actually deploy / wasn't applied to their environment, (2) you fixed a symptom, not the root cause, or (3) it's a different problem with the same surface symptom. Verify the deploy first (is the new version actually running for them?), then pull a fresh trace from the recurrence and compare to the original. - Don't fix-forward again blindly — that's what burned the trust. If you can't root-cause quickly, consider a mitigation/rollback to a known state while you diagnose, and tell them that's what you're doing and why. - Relationship: over-communicate. "Here's what we changed, here's why it didn't hold, here's what we're doing now, next update at [time]." Confidence is rebuilt with a predictable cadence and honesty, not with a heroic silent fix. - After resolution: a short written root-cause + what prevents recurrence (a regression test that reproduces their case) — that's what converts a frustrated client back to a confident one. - The judgment tell: you separate "did the fix deploy" from "was the fix correct," and you manage the human with cadence, not bravado. - Numbers to drop: "verify the deploy before re-diagnosing", "fresh trace from the recurrence vs the original", "mitigate/rollback to known-good while diagnosing", "scheduled update cadence rebuilds trust", "regression test that reproduces the client's exact case"

Common follow-ups: - "How do you decide between rolling back and pushing another fix?" - "What do you put in writing afterward?"

Traps: - A third blind fix-forward without confirming the previous one deployed or root-causing. - Going silent while you debug — the client's confidence erodes in the silence. - Treating it as purely technical when half the problem is communication.

Related cross-cutting: Production patterns Related module: learning/01_ai_engineering/05_ai_incident_operations/, learning/02_ai_infrastructure/06_ai_runbooks_oncall/