01. IAM, VPC, account structure — the guard and the road network¶

~12 min read. Before anything in AWS works, IAM has to say yes. Before traffic flows, VPC has to allow it. These two services underpin every other AWS service. This chapter is their model.

Builds on: 00-eli5.md.

The country-and-cities picture is enough to start. To debug "why can't this Lambda reach this RDS?" or "why is this IAM policy not working?" you need the model.

1) IAM — the identity model¶

AWS evaluates every API call against IAM. The decision: allow or deny.

The actors:

Root user. The email that created the account. Has full permissions. Never use it for daily work. Set up MFA, lock it away.
IAM user. A named identity with credentials (password and/or access keys). Used by humans.
IAM role. A named identity assumed temporarily. Used by services (EC2, Lambda) and federated humans (SSO).
IAM group. A collection of users for permission management.
Service principal. Internal AWS services (e.g., lambda.amazonaws.com) that can assume roles.

The permissions:

Identity-based policies. Attached to a user, group, or role. JSON document with Allow/Deny on actions and resources.
Resource-based policies. Attached to a resource (S3 bucket, KMS key, SQS queue). Specifies who can access the resource.
Permissions boundaries. A cap on what an identity can be granted. Used to prevent privilege escalation.
Service Control Policies (SCPs). Account-wide caps managed via AWS Organizations. Override any identity policy.

The decision algorithm:

1. Default: implicit deny.
2. Evaluate every applicable policy (identity + resource + boundary + SCP).
3. If any Deny matches: final = Deny.
4. Otherwise, if any Allow matches: final = Allow.
5. Otherwise: implicit deny.

Deny always wins. Adding an Allow doesn't override a Deny.

2) IAM policy structure¶

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3ReadOrders",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::my-orders-bucket",
        "arn:aws:s3:::my-orders-bucket/*"
      ]
    },
    {
      "Sid": "DenyDeleteEverything",
      "Effect": "Deny",
      "Action": "s3:DeleteObject",
      "Resource": "*"
    }
  ]
}

Key fields:

Effect — Allow or Deny.
Action — the API operation (s3:GetObject, ec2:RunInstances). Wildcards allowed (s3:*).
Resource — the ARN. Wildcards allowed.
Condition — extra constraints (source IP, time of day, MFA required, request tags).

Conditions example:

"Condition": {
  "StringEquals": {"aws:PrincipalTag/Team": "payments"},
  "IpAddress": {"aws:SourceIp": ["10.0.0.0/8"]},
  "Bool": {"aws:MultiFactorAuthPresent": "true"}
}

Conditions are the most-underused IAM feature. They let you express policies like "this action requires MFA," "only from the corporate VPN," "only by the team tagged X."

3) Roles and AssumeRole¶

Roles are temporary identities. An EC2 instance, a Lambda function, or a federated user assumes a role to get credentials.

EC2 instance with instance profile attached
    ↓ requests credentials
EC2 metadata service (169.254.169.254)
    ↓ returns temporary credentials (15 min - 12 hours)
Application code uses credentials in API calls
    ↓ credentials expire; SDK refreshes via metadata

The pattern: never put access keys in code or environment variables on EC2/ECS/EKS/Lambda. Use roles. The SDK fetches and refreshes credentials transparently.

For local development, use AWS SSO or AWS Vault — same principle (short-lived credentials), no long-lived keys.

Cross-account roles. Account A wants to access resources in Account B. Account B creates a role with a trust policy: "Account A can assume this role." Account A's principal assumes the role; gets temporary credentials valid in Account B.

This is the standard pattern for multi-account architectures, central audit accounts, and shared services.

4) The principle of least privilege¶

The aspiration: each identity has only the permissions it needs, nothing more.

The reality: easier said than done. Common shortcuts:

AdministratorAccess to humans for "convenience." Avoid for production. Use SSO with role-based access.
* actions and resources in policies. Tighten incrementally.
Wide trust policies on roles (any principal can assume). Use external IDs and account constraints.

Tools that help:

IAM Access Analyzer. Generates least-privilege policies from CloudTrail history.
IAM Policy Simulator. Test policies before applying.
Permissions boundaries. Cap what developers can grant.
SCP at the org level. Hard ceilings (e.g., deny IAM user creation in production accounts).

Least privilege is a discipline, not a one-time configuration. Review policies quarterly.

5) VPC — the basics¶

A VPC is your private network in a region. CIDR block (e.g., 10.0.0.0/16) is the IP range.

Subnets. Slices of the CIDR placed in a specific AZ. Public subnets have routes to the Internet Gateway; private subnets don't.

VPC 10.0.0.0/16 in ap-south-1
   ├── Public subnet  10.0.1.0/24  in ap-south-1a   (route to IGW)
   ├── Public subnet  10.0.2.0/24  in ap-south-1b   (route to IGW)
   ├── Private subnet 10.0.11.0/24 in ap-south-1a   (route via NAT)
   ├── Private subnet 10.0.12.0/24 in ap-south-1b   (route via NAT)
   ├── DB subnet      10.0.21.0/24 in ap-south-1a   (no internet route)
   └── DB subnet      10.0.22.0/24 in ap-south-1b   (no internet route)

Route tables. Per-subnet rules for routing traffic. The default route (0.0.0.0/0) goes to:

Public subnet: Internet Gateway (IGW). Traffic flows directly.
Private subnet: NAT Gateway (in a public subnet). Outbound only.
DB subnet: no default route. No internet access in either direction.

Internet Gateway. Allows public subnets to reach the internet (and vice versa, with public IPs).

NAT Gateway. Allows private subnet resources to make outbound internet calls. AWS-managed; costs per hour + per GB. Place one per AZ for HA — otherwise the NAT's AZ becomes a single point of failure for outbound traffic.

VPC Endpoints. Private connections to AWS services (S3, DynamoDB, SQS, etc.) that bypass the internet. Reduces NAT cost and improves latency for high-volume AWS API calls.

6) Security groups vs. NACLs¶

Two firewall layers:

Security Groups. Stateful. Attached to ENIs (network interfaces) — EC2, RDS, Lambda in a VPC. Rules are allow-only (no deny). Default-deny inbound, default-allow outbound.

sg-web (security group for web servers):
  Inbound:
    - Port 80 from 0.0.0.0/0
    - Port 443 from 0.0.0.0/0
    - Port 22 from 10.0.0.0/16 (only from inside VPC)
  Outbound:
    - All to all (default)

Security groups are the primary defence. You can reference other security groups in rules:

sg-db (security group for RDS):
  Inbound:
    - Port 5432 from sg-web   # only the web tier can reach the DB

This is the pattern: tier-based security groups; rules reference SG IDs, not IPs.

Network ACLs (NACLs). Stateless. Attached to subnets. Allow and deny rules in numbered order. Default allows all. Useful for coarse subnet-level controls; less commonly tuned.

For most workloads, security groups are sufficient. NACLs are reserved for specific defence-in-depth scenarios (block known bad IPs at the subnet level).

7) The connectivity table — what reaches what¶

                          Internet
                              ▲
                              │ (via IGW)
                          Public Subnet
                            │     │
                            ▼     ▼
                            ELB   EC2 (in public subnet — rare)
                              │
                              │ (private IP)
                              ▼
                          Private Subnet
                            │     │
                            ▼     ▼
                            EC2   ECS/EKS
                              │
                              │ (private IP)
                              ▼
                          DB Subnet
                              │
                              ▼
                              RDS

The standard layout: ELB in public; app servers in private (with NAT for outbound); database in DB subnets (no internet). This is the three-tier VPC pattern.

For Lambda in a VPC: place in private subnets. Lambda gets ENIs in those subnets; outbound goes via NAT Gateway. Cold start increases slightly (ENI provisioning).

8) Multi-account structures¶

Most production AWS deployments use multiple accounts under AWS Organizations:

management account. Org root. Billing consolidation; SCP management. No workloads.
production account. Live customer-facing workloads.
staging account. Pre-production environments.
dev account. Developer sandboxes.
audit / logging account. Centralised CloudTrail, security tools.
shared services account. Common infrastructure (CI, artifact stores, DNS).

Benefits:

Blast radius. A misconfigured IAM policy in dev can't damage production.
Cost attribution. Per-account billing is automatic.
Compliance. Production access is restricted; audit access is read-only.

The pattern is set up via AWS Control Tower or manually via AWS Organizations. Cross-account access uses roles + trust policies.

9) Tagging — the discipline that pays off¶

Tags are key-value labels on resources. They're free to add; they're invaluable for cost analysis, access control, and operations.

Standard tags:

Environment — production, staging, dev.
Team — owning team.
CostCenter — finance attribution.
Application — which app this resource belongs to.
ManagedBy — terraform, cloudformation, manual.

Tag-based policies:

{
  "Effect": "Deny",
  "Action": "ec2:TerminateInstances",
  "Resource": "*",
  "Condition": {"StringEquals": {"ec2:ResourceTag/Environment": "production"}}
}

Tag-based cost allocation: Cost Explorer can group by tag, showing how much each team or environment costs. Without tags, the bill is a wall.

Tag discipline starts at resource creation — enforce via IaC (Terraform modules require tags), or via SCPs (deny untagged resource creation).

10) The threaded example — provisioning a three-tier app¶

A team launches a Django app on AWS. The minimal correct setup:

1. Account: production (separate from dev/staging).
2. VPC: 10.0.0.0/16 in ap-south-1.
3. Subnets:
   - Public: 10.0.1.0/24 (1a), 10.0.2.0/24 (1b) — for ELB.
   - Private: 10.0.11.0/24 (1a), 10.0.12.0/24 (1b) — for ECS tasks.
   - DB: 10.0.21.0/24 (1a), 10.0.22.0/24 (1b) — for RDS.
4. Internet Gateway attached to VPC. Route 0.0.0.0/0 → IGW from public subnets.
5. NAT Gateways in each public subnet's AZ. Route 0.0.0.0/0 → NAT from private subnets.
6. Security groups:
   - sg-elb: inbound 443 from 0.0.0.0/0.
   - sg-app: inbound 8000 from sg-elb only.
   - sg-db: inbound 5432 from sg-app only.
7. IAM roles:
   - Role for ECS tasks: S3 read/write to app bucket; CloudWatch logs write; SSM Parameter Store read.
   - Role for RDS: managed by AWS.
8. RDS Postgres in DB subnets, multi-AZ, encryption enabled, deletion protection on.
9. S3 bucket: versioning enabled, encryption enabled, public access blocked.
10. CloudWatch logs for application logs; CloudWatch alarms on ELB 5xx, RDS CPU, ECS task failures.

Each of these is one or two Terraform resources; the whole stack is 200-300 lines of HCL. The discipline is in the choices: private subnets for app and DB; security groups referencing each other (not IPs); IAM roles per workload; encryption everywhere; deletion protection on the database.

Operational signals¶

Healthy. IAM Access Analyzer findings near zero; VPC Flow Logs show expected traffic patterns; CloudTrail captures all API calls; security groups are tight.

First degrading metric. IAM policy attached with *:* — sign of expediency over least-privilege.

Misleading metric. Number of IAM users — many users don't mean unsafe; many roles often safer than few users.

Expert graph. SCP compliance dashboard; IAM Access Analyzer findings count; CloudTrail anomaly detection; VPC Flow Logs for unexpected traffic.

Where this appears in production¶

Most AWS-native startups — Control Tower for multi-account; Terraform for VPC + IAM.
Netflix — well-documented multi-account strategy; per-team accounts; centralised audit.
Capital One (financial services) — strict IAM; SCP-enforced controls; least-privilege at scale.
A Bengaluru fintech — production account isolated from dev; SCP prevents IAM user creation in prod.
A Mumbai SaaS — three-tier VPC pattern; sg-app references sg-elb; security audit pass.
A Pune analytics platform — VPC endpoints for S3 and DynamoDB; NAT cost dropped 70%.
A Goa-based AI startup — IAM roles for everything; no long-lived access keys in code.
A Delhi e-commerce — multi-region VPC with peering; cross-region RDS replicas.

Recall / checkpoint¶

What is the IAM decision algorithm?
What is the difference between an IAM user and an IAM role?
When are conditions used in IAM policies?
What is the difference between an Internet Gateway and a NAT Gateway?
Why reference security groups (not IPs) in rules?
What is the difference between a security group and a NACL?
What is the typical multi-account structure and why?

Interview Q&A¶

Q1. An EC2 instance can't read from S3. The IAM policy looks correct. Walk through the diagnosis. Several layers to check: (1) is the IAM role actually attached to the instance? (aws sts get-caller-identity from the instance reveals which role); (2) does the role's policy allow the action and the specific S3 resource ARN? Wildcards in resources matter — arn:aws:s3:::my-bucket is different from arn:aws:s3:::my-bucket/* (one is the bucket, the other is objects); (3) is there a deny anywhere (resource-based policy on the bucket, SCP)? (4) is the bucket in a different region or account? Default IAM doesn't cross account boundaries without an explicit cross-account policy. Common wrong answer to avoid: "make the role admin" — bypasses the diagnostic and creates a security hole.

Q2. A Lambda in a private subnet can't reach the internet. Walk through the fix. The private subnet needs a route to a NAT Gateway. Check: (1) is the Lambda actually in the private subnet (configured with VPC settings)? (2) does the private subnet's route table have 0.0.0.0/0 → NAT Gateway? (3) is the NAT Gateway in a public subnet with its own route to the Internet Gateway? (4) does the NAT Gateway have an Elastic IP? (5) security group on the Lambda allows outbound to the destination? Each step is a possible failure point. Tools: VPC Reachability Analyzer simulates the path. Common wrong answer to avoid: "put Lambda in a public subnet" — public subnet alone doesn't give an outbound IP; you need elastic IPs assigned, which Lambda doesn't natively support.

Q3. Walk through the principle of least privilege for an ECS task that processes orders. Identify the task's required actions: read from orders-bucket, write to processed-bucket, send messages to notifications-queue, read parameters from SSM, write logs to CloudWatch. The role policy allows exactly these actions on these specific resources. No * actions. No * resources. Conditions tighten further (e.g., S3 access only from VPC endpoint). Validate with IAM Access Analyzer to ensure the policy matches actual usage. Common wrong answer to avoid: "give it AmazonS3FullAccess" — overscoped; one bug or compromise grants more access than needed.

Q4. A team's NAT Gateway costs are surprisingly high. Walk through the diagnosis. NAT is billed per GB processed. High cost means high outbound traffic. Common causes: (1) downloading large datasets repeatedly to ECS/Lambda (use S3 VPC endpoint to bypass NAT); (2) calls to S3 from a private subnet without VPC endpoint (every byte routes through NAT); (3) inter-AZ traffic to a NAT in another AZ (run NAT per AZ to avoid cross-AZ charges); (4) some service making chatty internet calls. Fixes: VPC endpoints for AWS services, per-AZ NAT, identify the chatty caller. Common wrong answer to avoid: "negotiate AWS pricing" — almost always architecture.

Q5. The team wants to give developers production access for a debugging session. Walk through the safe pattern. Don't grant a long-lived IAM user with production access. Use temporary cross-account role assumption: developer is in dev account; production account has a developer-on-call role that allows AssumeRole from dev with MFA required, time-limited (15min-1hour); the role has read-only or limited write access. Developer assumes the role, gets temporary credentials, debugs, credentials expire. CloudTrail logs the assumption and every action. Common wrong answer to avoid: "add the developer to a production IAM group" — long-lived; auditing weak.

Q6. A team wants multi-AZ RDS but is unsure of the subnet config. Walk through. RDS requires a "DB subnet group" containing at least 2 subnets in different AZs. RDS uses this group to place the primary and standby. The subnets should be in private/DB tiers (no internet route). When multi-AZ is enabled, RDS creates a synchronous replica in the other AZ; failover is automatic. The application reaches RDS via its DNS endpoint; AWS updates the DNS to point at whichever instance is primary. The application doesn't need to know which AZ is primary. Common wrong answer to avoid: "place RDS in a public subnet" — never; database should not be internet-reachable.

Operational memory¶

This chapter explained AWS's two foundational services: IAM (identity and permissions) and VPC (networking). The important idea is that every other AWS service rests on these two; understanding their model is the difference between "I can ship in AWS" and "I can debug AWS."

You learned IAM's decision algorithm, the policy structure, the use of roles vs. users, the VPC's subnet/route/gateway model, the security group pattern, and the multi-account structure. That solves the foundational layer; day-to-day services come next.

Carry this diagnostic forward: when something doesn't work in AWS, the first questions are "does IAM allow it?" and "does the network reach it?" Most production issues live in one of those two layers.

Remember:

Deny wins in IAM; everything else is implicit deny by default.
Use roles for services and SSO for humans; avoid long-lived access keys.
Security groups reference SGs, not IPs; tier-based naming.
Public subnet has IGW route; private has NAT; DB has neither.
Multi-account is a structural defence, not a luxury.

Bridge. The foundations are set. Day-to-day, you work with EC2, S3, and RDS. The next chapter is the surface of the three most-used compute and storage services. → 02-ec2-s3-rds-day-to-day.md