Home / Applied AI / 01. AI Engineering / 04. Resilient Agent Systems Resilient Agent Systems¶ The chapters in this module, in reading order. # Chapter 00 Reliability & Failure Management for AI Systems — The Five-Year-Old Version 01 Failure taxonomy — name the illness before prescribing treatment 02 Error detection — hear the alarm before the patient crashes 03 Retry & backoff — repeat carefully, not desperately 04 Circuit breakers — close the ward before infection spreads 05 Fallback strategies — send the patient somewhere safer 06 Graceful degradation — keep the patient stable first 07 Timeout management — spend time like money 08 Idempotency & deduplication — do the action once, even if the request shouts twice 09 Human escalation — know when to call the senior doctor 10 Cascading failure — one bad step should not become everyone's problem 11 Chaos testing for AI — practice the bad day before the bad day arrives 12 Incident response — run the emergency room, not a group panic 13 Honest admission — what we still do not know how to guarantee