Resilient Agent Systems¶

The chapters in this module, in reading order.

#	Chapter
00	Reliability & Failure Management for AI Systems — The Five-Year-Old Version
01	Failure taxonomy — name the illness before prescribing treatment
02	Error detection — hear the alarm before the patient crashes
03	Retry & backoff — repeat carefully, not desperately
04	Circuit breakers — close the ward before infection spreads
05	Fallback strategies — send the patient somewhere safer
06	Graceful degradation — keep the patient stable first
07	Timeout management — spend time like money
08	Idempotency & deduplication — do the action once, even if the request shouts twice
09	Human escalation — know when to call the senior doctor
10	Cascading failure — one bad step should not become everyone's problem
11	Chaos testing for AI — practice the bad day before the bad day arrives
12	Incident response — run the emergency room, not a group panic
13	Honest admission — what we still do not know how to guarantee