03. Cost, quotas, regions — the production surface¶
~10 min read. AWS in development is forgiving. AWS in production reveals cost surprises, service quotas, region-pair latency, and a long tail of operational subtleties. The catalogue.
Builds on: 02-ec2-s3-rds-day-to-day.md.
The previous chapters covered foundations and daily services. This is what production teaches.
1) The cost surprises that fund AWS¶
AWS bills are full of surprises. The recurring offenders:
NAT Gateway data processing. $0.045/GB processed. A chatty service in a private subnet that downloads gigabytes from S3 (via NAT) racks up costs. Fix: VPC endpoint for S3, bypassing NAT.
Cross-AZ data transfer. $0.01/GB between AZs in the same region. An app server in 1a calling a database in 1b incurs this. Fix: AZ-local services where possible; accept the cost where HA requires cross-AZ.
Cross-region data transfer. $0.02-0.09/GB depending on regions. A replication pipeline transferring TBs/day adds up. Fix: PrivateLink, regional architecture, or accept the cost as the price of multi-region.
Public IPv4 addresses. $0.005/hour per public IPv4 (since Feb 2024). For a service with hundreds of public-facing instances, this is significant. Fix: use IPv6, load-balancer-fronted private instances, or NAT.
CloudWatch Logs ingestion. $0.50/GB ingested. Verbose application logs at high RPS add up fast. Fix: log levels (DEBUG off in prod), sample non-error logs, log only what you'll query.
CloudWatch Metrics custom dimensions. Each unique dimension combination is a metric ($0.30/metric/month). High-cardinality custom metrics (per-user, per-tenant) explode the bill. Fix: aggregate at app level; send only meaningful dimensions.
EBS volumes from deleted instances. Detached EBS volumes still cost storage. After deleting an EC2 instance, check for orphaned volumes.
EIPs not attached. Unattached Elastic IPs cost $0.005/hour. Audit periodically.
Provisioned IOPS on RDS. Setting IOPS too high or storage type to io1/io2 when gp3 would suffice. Recheck quarterly.
S3 versioning without lifecycle. Versioning piles up old versions forever. Configure lifecycle to expire non-current versions after N days.
2) Service quotas — the limits you'll hit¶
AWS has per-account, per-region quotas. Soft quotas can be raised; hard quotas cannot.
Common quotas you hit:
- VPCs per region. Default 5. Easy to raise.
- EIPs per region. Default 5. Often raised.
- Lambda concurrent executions. Default 1000 per region. Raise on request.
- RDS instances per region. Default 40.
- EC2 vCPUs per region per family. Limits horizontal scale.
- S3 buckets per account. 100 (hard limit until 2024; now raisable to 1000).
- SQS messages per second. No hard limit per queue (Standard); FIFO has 300 TPS soft.
- Route 53 hosted zones. 500 per account.
- IAM users per account. 5000.
Use Service Quotas console to monitor and request increases. Some quotas take days to raise; plan ahead. The pattern: integration test against staging with quotas raised; quotas in production raised early during capacity planning.
3) Region selection — the things that vary¶
Not all AWS regions are equal:
- Service availability. New services launch first in
us-east-1, then expand. Some services are not available in all regions for years. - Instance types. Not all instance types are in all regions; check before relying on a specific family.
- Pricing. Prices vary by region.
ap-south-1(Mumbai) is comparable tous-east-1;eu-west-1(Ireland) is similar. Some regions are notably more expensive. - Latency to users. Choose regions near your users.
- Compliance. Data residency (GDPR for EU users, RBI for Indian payments) constrains region choice.
For Indian-market apps: ap-south-1 (Mumbai) for the primary; consider ap-south-2 (Hyderabad, launched 2022) for DR.
For global apps: a region per major user cluster; cross-region replication; latency-based routing via Route 53.
4) Multi-region patterns¶
Active-passive. Primary region serves all traffic; secondary is warm standby. On failure, manual failover (DNS swap). Cost-efficient; RTO measured in minutes-to-hours.
Active-active. Both regions serve traffic. Route 53 latency-based routing routes users to nearest. RDS Aurora Global Database for sub-second cross-region replication. RTO near-zero, RPO seconds. Operationally complex.
Pilot light. Minimal infrastructure in secondary (RDS replica running, app servers scaled to zero). On failover, scale up app servers. Slower than active-passive but cheaper.
Backup and restore. S3 cross-region replication for data; infrastructure rebuild via Terraform in the secondary on disaster. Cheapest; slowest RTO (hours).
Choose based on RTO/RPO requirements vs. operational and cost tolerance.
5) Disaster recovery — the runbook¶
Even without multi-region, every production AWS deployment needs a DR runbook:
- Backups. RDS daily snapshots + transaction logs (point-in-time restore). Test restore monthly.
- Configuration as code. Terraform/CloudFormation in version control. Rebuild the entire infrastructure from code.
- Secrets management. SSM Parameter Store + Secrets Manager. Cross-region replication for critical secrets.
- Documentation. Runbook with named owners, escalation paths, decision trees.
- DR drills. Quarterly exercises that test the runbook on a non-prod region.
The first time you do a DR exercise, you'll find broken assumptions. The fifth time, your team is competent.
6) IAM at scale¶
Past a few accounts, IAM management is itself a workload:
- AWS SSO (now IAM Identity Center). Federated login for humans. Manages permission sets, applies to multiple accounts.
- AWS Organizations. Multi-account structure. SCPs for hard limits.
- AWS Control Tower. Opinionated account factory; sets up Organizations, SSO, baseline accounts.
- Permission boundaries. Cap what developers in production accounts can grant via IAM.
- CloudTrail Org. Centralised audit log; cross-account access tracked in one bucket.
Without these, IAM grows organically and becomes ungovernable. The standard production setup includes all of the above; the time to set them up is the first month, not the third year.
7) Observability — beyond CloudWatch¶
CloudWatch is the AWS-native observability surface. For most production, it's not enough alone:
- CloudWatch Logs. Application logs. Costly at scale (~$0.50/GB ingested). For high-volume logs, ship to a cheaper store (S3 + Athena) after a hot window.
- CloudWatch Metrics. System and app metrics. Custom metrics are billed; use sparingly. Embedded Metric Format (EMF) is more efficient than PutMetricData.
- CloudWatch Alarms. Alert on metric thresholds. Pair with composite alarms for "AND" conditions.
- X-Ray. Distributed tracing for AWS-native services. Useful for ECS/Lambda traces.
Third-party tools that fit on top:
- Datadog, New Relic, Grafana Cloud. Comprehensive APM + logs + traces; integrate with CloudWatch.
- Honeycomb, Lightstep. High-cardinality observability; useful for complex distributed systems.
- OpenTelemetry. Vendor-neutral instrumentation; ships to CloudWatch, third-party tools, or self-hosted backends.
Standard production stack: CloudWatch for AWS service metrics + a third-party APM for application observability + structured logs to S3 for long-term retention.
8) Common production gotchas¶
Lambda + VPC cold start. Lambda in a VPC provisions an ENI on cold start; adds 1-10 seconds. Use VPC only when necessary; consider Lambda@Edge or non-VPC Lambda for low-latency use cases.
EC2 burst credits. t-family instances earn CPU credits when idle, spend when busy. A long CPU-heavy load can exhaust credits and throttle the instance. For sustained CPU, use t.unlimited (extra charges if exceeded) or move to c family.
EBS volume types. gp2 is older; gp3 is the modern general-purpose default and cheaper. io1/io2 for high IOPS workloads; rarely needed. Audit existing volumes; migrate to gp3 for cost savings.
RDS storage autoscaling. Enable it. Without it, a sudden growth fills the disk; the database becomes read-only.
S3 request rate hot spots. Sequential prefixes ("2026/01/01/file-001") concentrate writes on one partition. Use random prefixes or accept eventual scaling.
Kinesis vs. Kafka. Kinesis is AWS-managed; cheaper at low scale; bigger at scale tilts toward MSK (managed Kafka). For < 10 MB/sec, Kinesis is fine; > 100 MB/sec, evaluate MSK or self-managed Kafka.
SNS message ordering. SNS is best-effort; FIFO SNS is available but rarely used. For ordered fan-out, use SNS FIFO + SQS FIFO.
KMS API costs. SSE-KMS on high-volume S3 or SQS workloads can cost more than the underlying service. Audit; use SSE-S3 / SSE-SQS where regulatory allows.
9) Terraform / IaC discipline¶
Production AWS is managed via IaC, not the console:
Terraform is the standard. Pulumi for those preferring real programming languages. CloudFormation for AWS-native (less popular than Terraform now).
Patterns:
- State in S3 with DynamoDB locking. Multi-developer safety.
- Modules for shared patterns. VPC, ECS, RDS as reusable modules.
- Per-environment workspace. Dev/staging/prod as separate Terraform states.
- CI applies; humans don't. Pull requests; CI runs
terraform planon PR; merge to main triggersterraform apply. - Drift detection. Periodic checks that the live state matches the IaC.
Without IaC discipline, AWS infrastructure becomes ungovernable. With it, every change is reviewed and versioned.
10) The threaded example — preparing for a 10× traffic surge¶
A team expects 10× traffic over the next quarter (e-commerce sale, viral product, regulated event). The pre-flight:
Capacity:
1. ECS service: raise max tasks; verify ALB target capacity.
2. RDS: vertical scale to the next size; verify read replica count and lag.
3. ElastiCache: vertical scale; verify connection pool sizing.
4. NAT Gateway: per-AZ NAT to avoid cross-AZ bottleneck.
Quotas:
5. Request EC2 vCPU quota raise (lead time: days).
6. Request Lambda concurrent execution quota raise.
7. Verify RDS instance quota.
8. Verify S3 PUT/GET rate (S3 auto-scales, but pre-warm by ramping up gradually).
Cost:
9. Set CloudWatch budgets with alerts at 50%, 80%, 100% of expected spend.
10. Enable Cost Anomaly Detection for the production account.
Resilience:
11. Run a multi-AZ failover drill on the staging RDS to validate the runbook.
12. Verify backups: confirm the last successful daily snapshot.
13. Document the rollback plan: feature flags, ALB swap, full rollback.
Observability:
14. Pre-create dashboards for the event.
15. Configure alarms: ECS task health, RDS CPU, ALB 5xx, RDS connections, NAT bandwidth.
16. Page-able SLOs: latency p95, error rate.
Communications:
17. Pre-write status-page updates for likely incident classes.
18. Pre-assign incident commanders for the event window.
The pre-flight catches what the live event would expose. Skipping it produces incidents during the highest-attention window.
Operational signals¶
Healthy. Cost trending with traffic; quota utilisation below thresholds; backups succeeding; DR drills passing; IaC drift near zero.
First degrading metric. Cost growing faster than traffic — indicates a misconfiguration (a missing lifecycle, an over-provisioned RDS, NAT bandwidth on chatty traffic).
Misleading metric. Total monthly spend — masks per-service trends; a service whose cost doubled may still be a small share.
Expert graph. Per-service per-month cost trend; quota utilisation per region; CloudTrail anomaly detection findings; IaC drift report.
Where this appears in production¶
- Netflix — public Cost Allocation patterns; aggressive use of Spot, RIs, Savings Plans.
- Capital One — strict IAM, SCP-enforced controls; multi-account discipline well documented.
- Stripe — multi-region active-active for payments; sub-second failover via Aurora Global.
- A Bengaluru fintech — quarterly DR drills; runbook updates after each.
- A Mumbai SaaS — moved S3 lifecycle policy after a cost surprise; bill dropped 40%.
- A Pune analytics platform — VPC endpoints for S3 + DynamoDB; NAT cost dropped 70%.
- A Goa-based startup — Terraform-managed; IaC drift detection alerts on every console change.
- A Hyderabad e-commerce — pre-flight checklist for sale events; 10× traffic handled cleanly.
Recall / checkpoint¶
- What are the most common cost-surprise services?
- What is the difference between a soft and a hard service quota?
- What are the four multi-region patterns and when does each fit?
- What does CloudWatch cost at scale, and how do you control it?
- What is IaC drift and how is it detected?
- What is the pre-flight checklist for a traffic surge?
- Why is IAM Identity Center (SSO) part of the standard production setup?
Interview Q&A¶
Q1. AWS spending is up 50% with no obvious traffic growth. Walk through diagnosis. Open Cost Explorer; break down by service. Compare current month to baseline. Common offenders: (1) NAT Gateway costs from a chatty new service — check VPC Flow Logs for the traffic source; add S3 VPC endpoint if S3 is the destination. (2) CloudWatch Logs ingestion — a new debug log line; reduce log level. (3) Cross-AZ transfer — a new service hitting cross-AZ resources unnecessarily. (4) Unattached EIPs and EBS volumes accumulating. (5) RDS storage growth from missing log rotation. Each has a structural fix and a few-line investigation. Common wrong answer to avoid: "call AWS sales for a discount" — fix the architecture first.
Q2. The team needs to launch in a new region. Walk through the gotchas. Several. (1) Service availability — not all services are in the new region; check what you need. (2) Instance types — your typical instance family may not be available; have a fallback. (3) Quotas — new region starts at default quotas; request raises before launch. (4) IAM and KMS keys are regional (mostly); replicate or recreate. (5) S3 buckets are global names; pick a different name or use region-suffix. (6) Pricing varies; budget per-region. (7) Latency from existing users may degrade; data transfer between regions costs. Common wrong answer to avoid: "spin it up and ship" — quota and configuration delays often add a week.
Q3. The team's production AWS account is one big mess: long-lived IAM users, console-only changes, no Terraform. Walk through the migration path.
Gradual, not big-bang. Step 1: Set up AWS Organizations + Identity Center (SSO) for new account creation; existing account remains. Step 2: Start importing existing resources into Terraform (terraform import) one at a time, starting with the simplest (S3 buckets, IAM policies). Step 3: As you import, refactor: scope IAM, add tags, fix obvious issues. Step 4: New resources only via Terraform — enforce via CloudTrail alerts on console changes. Step 5: Migrate humans from IAM users to SSO; deprovision long-lived access keys. Each step is bounded; the whole migration takes months but recovers control. Common wrong answer to avoid: "rebuild from scratch" — too risky for production.
Q4. A team's monthly bill jumped 3× after a new feature shipped. Walk through investigation. Cost Explorer, last-30-days, grouped by service and tag. Identify which service grew. Then drill into that service: which resource is responsible, which feature uses it. Common causes: (1) New service in private subnet with no VPC endpoint — NAT cost spiked. (2) New high-cardinality CloudWatch metric — metrics count exploded. (3) New S3 prefix without lifecycle — storage grows linearly. (4) New RDS read replicas or instance size bump. The feature owner needs to be in the conversation. Common wrong answer to avoid: "rollback the feature" — first understand the cost; the feature may be justified at the new cost level.
Q5. The team's DR plan exists on paper but has never been tested. Walk through the assessment. Untested DR is theatre. The plan must be exercised: quarterly tabletop drills walk through the steps verbally; semi-annual live drills failover non-production systems and bring them back. Each drill produces findings — runbook gaps, broken automation, missing access. Fix the findings before the next drill. The first live drill always exposes 5-10 issues; the third drill is fast. Without drills, the first real disaster is the first test. Common wrong answer to avoid: "we have backups" — backups are necessary, not sufficient; restore is the operation, not the artifact.
Q6. The team's production account has no SCPs. Walk through what to add first.
Standard production SCPs: (1) Deny IAM user creation (force SSO-managed roles). (2) Deny disabling CloudTrail. (3) Deny disabling encryption (S3, EBS). (4) Deny IMDSv1 (force v2). (5) Deny resource creation in unauthorised regions. (6) Deny deletion of specific tagged resources (production data stores). (7) Deny *:* policies on roles. SCPs are hard ceilings — they apply to everyone in the account, including admins. Layer with IAM policies and permission boundaries. Test in audit mode (AWS Config rules) before enforcing. Common wrong answer to avoid: "lock down everything" — SCPs are powerful; over-restrictive ones break legitimate ops; iterate.
Operational memory¶
This chapter explained AWS's production surface: cost surprises, quotas, multi-region patterns, DR, IAM at scale, observability, and IaC discipline. The important idea is that AWS at scale is a series of structured choices — multi-account, IaC-managed, cost-aware, quota-planned, DR-drilled. Each choice has a cost and a payoff; the production-mature deployment makes them consciously.
You learned to identify cost offenders, manage quotas, choose multi-region patterns, structure DR, set up Organizations + SSO + SCPs, and instrument observability beyond CloudWatch. That completes the AWS production surface.
Carry this diagnostic forward: when AWS surprises you in production, ask which production surface is involved — cost, quota, region, DR, IAM at scale, or IaC. Each has structural defences.
Remember:
- Cost surprises live in NAT, cross-AZ, CloudWatch, EIPs, KMS — audit quarterly.
- Quotas are regional and per-service; raise before you need them.
- Multi-region pattern matches your RTO/RPO and operational tolerance.
- IaC + Org + SSO + SCPs is the production baseline.
- DR is theatre without drills.
Bridge. AWS Core is the foundation of cloud-native infrastructure. The infrastructure tooling track is complete — Docker, Postgres, Redis, Django, nginx, Celery, SQS, Kafka, AWS Core. Each module is one tool; together they cover the operational surface of a modern Python web application running in the cloud.