02. EC2, S3, RDS — the daily compute and storage surface¶

~11 min read. The three services you'll touch every day in AWS: EC2 for compute, S3 for objects, RDS for databases. Each has its own model, its own knobs, its own gotchas.

Builds on: 01-iam-vpc-account-internals.md.

The IAM-and-VPC foundation is in place. This chapter is the compute and storage you actually use.

1) EC2 — instances, AMIs, instance profiles¶

An EC2 instance is a virtual machine. You pick:

AMI (Amazon Machine Image). The disk template — OS plus pre-installed software. Use Amazon Linux 2023, Ubuntu LTS, or a hardened image.
Instance type. Family (general purpose t3, compute c7, memory r7, storage-optimised i4, GPU g5), size (micro, large, 2xlarge). The size doubles vCPU and RAM at each step.
VPC + subnet + security group. Where it runs and what it can talk to.
IAM role (instance profile). What the instance can do in AWS.
Storage. Root EBS volume, optional additional EBS or instance store.
Key pair (for SSH) or SSM Session Manager. How you'll log in (if at all).

aws ec2 run-instances \
    --image-id ami-0abcdef1234567890 \
    --instance-type t3.medium \
    --subnet-id subnet-abc123 \
    --security-group-ids sg-xyz789 \
    --iam-instance-profile Name=app-role \
    --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=app-1}]'

The instance starts. Cloud-init runs your user-data script. The app starts on boot.

In modern AWS deployments, EC2 instances are usually managed by orchestrators (ECS, EKS, ASG) rather than individually. The pattern: define an autoscaling group with a launch template; the ASG manages instance count and replacement; instances are ephemeral.

For SSH access, use AWS Systems Manager Session Manager instead of SSH key pairs. SSM lets authorised users open a shell on the instance through the AWS console or CLI; no SSH key management, no port 22 open, full audit trail. Modern teams have moved off direct SSH entirely.

2) Instance lifecycle and autoscaling¶

Autoscaling groups (ASG). Define a launch template (the AMI + config); set min/max/desired count; attach scaling policies (CPU > 70% → add instance). ASG replaces failed instances automatically; spreads them across AZs.

Spot instances. EC2 capacity AWS isn't using; 60-90% cheaper than on-demand. AWS can reclaim them with 2-minute notice. Use for stateless, fault-tolerant workloads (batch processing, web servers behind a load balancer); not for databases.

Reserved instances / Savings Plans. Commit to 1- or 3-year usage; pay less. Suitable for baseline load. Spiky workloads use on-demand or spot.

The standard production stack: ASG for baseline web/app servers (on-demand or RI) + spot instances for burst capacity + serverless (Lambda, Fargate) for spike-driven workloads.

3) EC2 networking specifics¶

Elastic IP (EIP). A static public IPv4. Costs nothing while attached to a running instance; costs $4/month if unattached. Don't accumulate unattached EIPs.

ENI (Elastic Network Interface). A network card attachment. Each instance has one or more. Security groups attach to ENIs, not instances.

Private DNS. Instances get internal DNS like ip-10-0-1-42.ap-south-1.compute.internal. Use this for inter-instance communication.

Public DNS / IP. Optional; assigned in public subnets. Costs IPv4 charges (since 2024) — ~$3/month per public IPv4. Migrate to IPv6 or load-balancer-fronted private instances to avoid.

4) S3 — buckets, objects, and consistency¶

S3 is object storage. Buckets are top-level namespaces; objects are key-value pairs.

aws s3 mb s3://my-app-data --region ap-south-1
aws s3 cp file.json s3://my-app-data/orders/2026/01/file.json
aws s3 ls s3://my-app-data/orders/2026/

Key properties:

Globally unique bucket names. my-app-data can be claimed by only one AWS customer worldwide.
Regional storage. Bucket lives in one region; cross-region access is paid traffic.
Strong read-after-write consistency. Since 2020. Read after PUT is consistent.
11 nines of durability (99.999999999%). Objects are replicated across multiple AZs.
Unlimited storage. Effectively no per-bucket size limit.

Storage classes:

Standard. Default. Costs ~$0.025/GB/month.
Standard-IA (Infrequent Access). ~$0.0125/GB/month + per-retrieval cost. For data accessed less than monthly.
Glacier Instant/Flexible/Deep Archive. $0.001-$0.004/GB/month + retrieval cost + retrieval latency (minutes to 12 hours).
Intelligent-Tiering. AWS auto-moves objects between Standard and IA based on access; predictable for unknown access patterns.

For typical application data, Standard. For logs and backups older than 30 days, lifecycle to Standard-IA or Glacier.

5) S3 — security and configuration¶

Block public access. Every bucket should have all four "Block Public Access" settings enabled unless you specifically need public objects (static site hosting). Misconfigured S3 buckets are the most common AWS data leak.

Encryption at rest.

SSE-S3 (default since 2023). AWS-managed key. No cost.
SSE-KMS. Customer-managed KMS key. Per-API-call cost (significant at high volume).
SSE-C. Customer-supplied keys. Rare.

Enable encryption explicitly in bucket policy; default since 2023 but enforce by policy for safety.

Versioning. Every PUT creates a new version; DELETE creates a delete marker (recoverable). Versioning is the defence against accidental deletes. Cost: pay for each version's storage.

Object lock. Prevent deletion/overwrite for a retention period. Compliance feature (SEC 17a-4, FINRA).

Lifecycle policies. Auto-transition objects to cheaper storage classes after N days; auto-delete after M days. Reduces storage costs dramatically for long-lived buckets.

Bucket policy. Resource-based JSON policy on the bucket. Use for cross-account access or fine-grained control.

{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::my-bucket/*",
  "Condition": {"Bool": {"aws:SecureTransport": "false"}}
}

Denies non-HTTPS access. Standard hardening.

6) S3 — performance and patterns¶

S3 scales by prefix. Each prefix (the part before the last / in the key) can sustain ~3,500 writes/sec and ~5,500 reads/sec. For higher throughput, distribute across prefixes:

Bad: keys like 2026/01/01/file-001.json (all under same date prefix).
Good: keys with high-entropy prefix f47ac10b/2026/01/01/file.json (random hex prefix distributes load).

For random access patterns, this is automatic. For sequential workloads (write 1000 files in a row), prefix them.

Multi-part upload. For files larger than 100 MB. Upload chunks in parallel; assembled server-side. AWS CLI does this automatically; in code, use the SDK's TransferManager.

Presigned URLs. A time-limited URL granting access to an S3 object without IAM credentials. The application generates one server-side, hands it to the client; the client uploads/downloads directly. The standard pattern for user uploads — avoids streaming through the application.

url = s3.generate_presigned_url(
    'put_object',
    Params={'Bucket': 'uploads', 'Key': f'user-{user_id}/photo.jpg'},
    ExpiresIn=3600,
)

7) RDS — managed relational databases¶

RDS provides managed Postgres, MySQL, MariaDB, Oracle, SQL Server, and Aurora (Postgres- and MySQL-compatible). AWS handles backups, OS patching, hardware failure recovery.

Engine choice:

Postgres or MySQL on RDS. Standard. Open source. Pay for the instance.
Aurora. AWS-rewritten engine with separated storage. Faster failover (seconds vs. minutes), automatic scaling, cross-region replication. More expensive than standard RDS.
Aurora Serverless v2. Scales capacity dynamically with load. Good for spiky or bursty workloads.

Configuration:

aws rds create-db-instance \
    --db-instance-identifier prod-app-db \
    --db-instance-class db.r6g.xlarge \
    --engine postgres \
    --engine-version 16.4 \
    --master-username admin \
    --master-user-password <secret> \
    --allocated-storage 100 \
    --storage-type gp3 \
    --multi-az \
    --vpc-security-group-ids sg-db-prod \
    --db-subnet-group-name db-subnet-group \
    --storage-encrypted \
    --backup-retention-period 14 \
    --deletion-protection \
    --copy-tags-to-snapshot \
    --auto-minor-version-upgrade

Three settings that matter:

--multi-az. Synchronous replica in another AZ. Failover ~30-60 seconds. Mandatory for production.
--storage-encrypted. Encryption at rest. Always enable.
--deletion-protection. Prevents accidental deletion. Always enable for production.
--backup-retention-period 14. Daily automated backups retained for 14 days. Point-in-time restore.

8) RDS — connection pooling and Performance Insights¶

Connection pooling. RDS instances have a max connection limit (~100-2000 depending on size). Application servers connecting directly exhaust this quickly. Use:

RDS Proxy. AWS-managed connection pooler. Sits between app and RDS; pools connections. Adds latency (~5ms) but unifies the connection pool.
PgBouncer (self-hosted). Open-source connection pooler. Run as a sidecar or shared instance.

For most production apps with > 10 application instances, connection pooling is required.

Performance Insights. RDS feature that tracks per-query latency and database load. Shows top queries by load, wait events, and resource consumption. Enable on all production RDS instances — free for the last 7 days of data; pay for longer retention.

Slow query log. Postgres log_min_duration_statement = 1000 (log queries > 1s). RDS supports this via parameter groups. Slow query log + Performance Insights together = production query diagnostics.

9) RDS — read replicas and failover¶

Read replicas. Asynchronous copies of the primary, in the same or another region. Application can route reads to replicas to offload the primary.

aws rds create-db-instance-read-replica \
    --db-instance-identifier prod-app-db-replica-1 \
    --source-db-instance-identifier prod-app-db

Replication lag: typically < 1 second under normal load; can grow under heavy write load. Reads-after-writes can see stale data if the read hits the replica before the write replicates.

Multi-AZ failover (standard RDS). On primary failure: AWS promotes the standby; DNS updates within 60 seconds. Connection pools see brief errors; reconnect and resume on the new primary.

Aurora failover. Much faster — 10-30 seconds typically. Aurora's storage layer is shared, so the "promotion" is just routing.

10) The threaded example — a production Django stack on AWS¶

Account: production
Region: ap-south-1
VPC: 10.0.0.0/16

ECS Fargate tasks (Django app):
  - 4 tasks across 2 AZs
  - In private subnets
  - IAM role: app-role
  - Security group sg-app: inbound 8000 from sg-elb only

Application Load Balancer:
  - In public subnets
  - HTTPS listener with ACM cert
  - Target group routing to ECS tasks
  - Security group sg-elb: inbound 443 from 0.0.0.0/0

RDS Postgres:
  - db.r6g.large, multi-AZ, encrypted
  - Backup retention 14 days
  - Deletion protection on
  - In DB subnets
  - Security group sg-db: inbound 5432 from sg-app only

ElastiCache Redis:
  - Two cache.t3.medium nodes, replication, multi-AZ
  - In DB subnets
  - sg-cache: inbound 6379 from sg-app only

S3 buckets:
  - app-media: user uploads
  - app-static: collectstatic output (CloudFront in front)
  - app-backups: pg_dump outputs from a daily job
  - All with Block Public Access enabled, encryption on, versioning on

CloudWatch:
  - Logs: ECS task logs, ALB access logs, RDS logs
  - Alarms: ALB 5xx, RDS CPU > 80%, ECS task failures, RDS connection count
  - Metrics: Custom app metrics via EMF

Route 53:
  - Hosted zone for example.com
  - A record alias → ALB

ACM:
  - Wildcard cert *.example.com, validated via DNS

Components: ECS (compute), ALB (load balancer), RDS (database), ElastiCache (cache), S3 (storage), CloudWatch (observability), Route 53 (DNS), ACM (TLS). Each is sized, secured, and connected. The Terraform for this is ~500-800 lines.

Operational signals¶

Healthy. EC2/ECS CPU 30-70%; RDS CPU < 70%; S3 latency p99 < 200ms; connection pool utilisation < 70%; backups succeeding daily.

First degrading metric. RDS connection count climbing toward limit → connection pool exhaustion.

Misleading metric. Total spend without resource breakdown — a single misconfigured resource can dominate the bill.

Expert graph. Per-service cost dashboard; per-RDS query latency; S3 access patterns.

Where this appears in production¶

Most AWS-native startups — ECS or EKS for compute; RDS Aurora for database; S3 for storage.
A Bengaluru fintech — RDS Aurora multi-region with read replicas; sub-second failover.
A Mumbai SaaS — ECS Fargate (no EC2 ops); ALB; RDS multi-AZ; standard three-tier.
A Pune analytics platform — S3 lifecycle to Glacier after 30 days; cost dropped 60%.
A Goa-based AI startup — Spot instances for ML training; on-demand for inference.
A Delhi e-commerce — RDS with PgBouncer; connection pool unified across 50+ ECS tasks.
A Hyderabad logistics SaaS — Aurora Serverless v2; capacity scales with daily peak.
A Chennai data platform — multi-part upload + presigned URLs for large dataset ingest.

Recall / checkpoint¶

What is the difference between an Instance Profile and an IAM role?
When should you use Spot instances?
What does "Block Public Access" do for S3?
What is the difference between SSE-S3 and SSE-KMS for S3?
Why is multi-az standard for production RDS?
What does RDS Proxy solve?
What is a presigned URL and when do you use it?

Interview Q&A¶

Q1. The team's S3 bucket is publicly accessible by accident. Walk through the response. Immediate: enable all four Block Public Access settings on the bucket. Check the bucket policy and ACLs for explicit public grants; remove them. Then audit: was sensitive data exposed? CloudTrail logs and S3 access logs reveal who downloaded what. If sensitive data was accessed by external IPs, escalate per the incident response plan (notify users, regulators, etc.). Going forward: enforce Block Public Access at account level (AWS Account Block Public Access feature); SCP that denies S3 bucket creation without Block Public Access. Common wrong answer to avoid: "rotate the data" — first stop the leak, then assess.

Q2. RDS connections are exhausting. Walk through the response. Diagnose: how many connections is the application opening? Often the answer is "one per worker per pod" — many idle connections. Short-term: increase the RDS instance size (more connections allowed). Medium-term: introduce a connection pooler — RDS Proxy or PgBouncer. The pooler multiplexes app connections onto a smaller real-RDS pool. Long-term: app-level pooling discipline — close connections promptly; use a pool per worker, not per request. Common wrong answer to avoid: "max_connections = 5000" — Postgres struggles past 500-1000 connections; pooling is the structural fix.

Q3. The team's S3 bill is climbing. Walk through the diagnosis. Three common causes. (1) No lifecycle policy: data accumulates in Standard storage forever. Fix: lifecycle to Standard-IA after 30 days, Glacier after 90, delete after 365 if appropriate. (2) Versioning enabled but not pruning old versions: every overwrite adds a version. Fix: lifecycle policy that expires old versions. (3) Cross-region transfer: data accessed from another region pays per-GB egress. Fix: replicate to the access region or co-locate compute. Use AWS Cost Explorer to break down S3 costs by bucket and storage class. Common wrong answer to avoid: "switch to a different storage provider" — fix the lifecycle first.

Q4. The team uses RDS Aurora and a read replica. A new feature reads from the replica; users see stale data. Walk through the response. Replication lag is the cause. Aurora's lag is typically subsecond but not zero. Two fixes. (1) For reads that need read-your-write consistency (just-saved data must be visible), route to the primary. Use a "primary endpoint" for those reads; "reader endpoint" for analytics and dashboards. (2) For tolerable staleness, accept it and document the eventual-consistency boundary. The application makes the choice per query. Common wrong answer to avoid: "increase Aurora performance" — doesn't change lag.

Q5. Walk through deploying a Django app on ECS Fargate with proper security. ECS service in private subnets; tasks have ENIs with sg-app (inbound only from sg-elb). ALB in public subnets with sg-elb (inbound 443 from 0.0.0.0/0). RDS in DB subnets with sg-db (inbound 5432 from sg-app only). IAM task role grants S3, SQS, SSM access scoped to specific resources. Secrets (DB password, API keys) in SSM Parameter Store or Secrets Manager, fetched at startup by the task. CloudWatch Logs receives stdout. ALB access logs and CloudTrail capture everything. Each tier of the deployment has explicit reasons for its security boundary. Common wrong answer to avoid: "put everything in a public subnet for convenience" — defeats network isolation.

Q6. The team wants to support 10× current load. Walk through the scaling plan. Tier by tier. (1) ECS tasks: ASG/service scaling rules on CPU or per-task RPS; raise max tasks. Fargate scales linearly. (2) ALB: scales automatically; no action. (3) RDS: vertical scale (bigger instance) for write capacity; read replicas for read scale. (4) ElastiCache: bigger nodes or more shards. (5) S3: scales automatically. (6) NAT Gateway: ensure per-AZ NAT; bandwidth scales with NAT instance. (7) Watch service quotas — some AWS limits (ENIs, EIPs, etc.) cap horizontal scaling. Validate with load testing; address bottlenecks as they appear. Common wrong answer to avoid: "just scale everything" — without diagnosing the actual bottleneck, you spend on dimensions that don't help.

Operational memory¶

This chapter explained AWS's three most-used services: EC2 (compute), S3 (objects), RDS (databases). The important idea is that each service has a model — instance lifecycle, object storage classes and consistency, managed database failover — and production maturity comes from understanding the model, not just calling the API.

You learned to provision EC2 with the right IAM and networking, harden S3 buckets, configure RDS for multi-AZ and proper backup, choose storage classes by access pattern, and connect them with the right security groups. That solves the day-to-day surface; cost and quotas come next.

Carry this diagnostic forward: when a service misbehaves, ask which AWS-specific quirk is at fault — IAM, networking, connection limits, storage class, replication lag. Each has a known structural fix.

Remember:

EC2 instances are ephemeral; manage via ASG or container orchestrator.
S3 Block Public Access on by default; encryption on by default.
RDS multi-AZ + encryption + deletion protection + backup retention 14d = production minimum.
Use presigned URLs for user uploads; never proxy uploads through the app.
Connection pooling (RDS Proxy or PgBouncer) is required past ~10 app instances.

Bridge. The compute and storage surface is set. Production has its own surface: cost surprises, service quotas, multi-region patterns, region resilience. The next chapter is that catalogue. → 03-cost-quotas-region-prod-gotchas.md