06. Service Mesh and Network Policy — Secure traffic without blind trust¶
⏱️ Estimated time: 25 min | Level: advanced
ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.
Why east-west traffic needs explicit control¶
Clusters make service-to-service traffic easy, which also makes mistakes easy. Default reachability is convenient until one compromised workload moves sideways. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Traffic policy should say who may talk, how, and with which identity. See. Connectivity without policy becomes accidental trust. Now watch.
traffic concerns
┌────────────┐ ┌────────────┐ ┌────────────┐
│ caller │ -> │ policy │ -> │ callee │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
identity | allow or deny | encrypted path
The network path needs a contract.
NetworkPolicy sets basic packet-level boundaries¶
NetworkPolicy controls which pods may send or receive traffic. It usually works at Layer 3 and Layer 4, not deep HTTP logic. Policies select target pods and then list allowed peers and ports. One default-deny policy can change the cluster security posture fast. See. Start denying broadly, then add precise allows. Now watch.
policy model
┌────────────┐ ┌────────────┐ ┌────────────┐
│ target pods │ -> │ peer rules │ -> │ ports │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
selected set | who may talk | which lanes
Everything not allowed can be dropped.
Service mesh adds identity, encryption, and richer traffic control¶
A service mesh usually inserts a data-plane proxy beside each workload. Istio and Linkerd then manage identity, certificates, and traffic policy centrally. Mutual TLS encrypts both directions and proves both workloads speak honestly. That extra control comes with latency, complexity, and debugging overhead. See. A mesh is powerful, but it is never free. Now watch.
mesh pattern
┌────────────┐ ┌────────────┐ ┌────────────┐
│ app │ -> │ sidecar │ -> │ peer │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
local proxy | mTLS policy | remote proxy
Identity follows the workload, not the IP.
Traffic splitting helps canaries and safe migrations¶
Meshes shine when you need percentage-based routing and richer release control. You can send five percent traffic to a canary before a full cutover. Header-based routing also supports tenant, region, or feature experiments. These knobs help only when metrics and rollback actions are ready. See. Smart routing without observability is just sophisticated guessing. Now watch.
split traffic
┌────────────┐ ┌────────────┐ ┌────────────┐
│ baseline │ -> │ canary │ -> │ metrics │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
95 percent | 5 percent | decision loop
Routing percentages need outcome checks.
Operational simplicity still wins more often than fancy policy¶
Many teams need only NetworkPolicy plus clean ingress and egress rules. A full mesh is justified when identity and traffic control needs grow. The real skill is knowing when not to add another control plane. Security should become clearer after tooling, not foggier. See. Choose the smallest control surface that closes the real risk. Now watch.
tool choice
┌────────────┐ ┌────────────┐ ┌────────────┐
│ none │ -> │ policy │ -> │ mesh │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
v v v
simple cluster | need isolation | need identity
Security maturity should grow step by step.
Where this lives in the wild¶
- Platform teams start with namespace default-deny policies before touching service mesh.
- Banks use mTLS and strong workload identity for sensitive east-west calls.
- Canary releases often rely on mesh or gateway traffic splitting with SLO checks.
- Multi-team clusters need clear egress rules so compromised pods cannot wander freely.
Pause and recall¶
- Why is internal service traffic risky even inside one cluster?
- What can NetworkPolicy control, and what can it not express well?
- What extra value does a service mesh add beyond basic networking?
- Why can retries and traffic splitting still create failures?
Interview Q&A¶
Q: Why start with NetworkPolicy before adding a full mesh? A: It closes obvious lateral-movement risks with less operational overhead. Many teams need basic isolation long before they need rich traffic shaping. Common wrong answer to avoid: “Because meshes are obsolete now.”
Q: Why is mTLS more than just encryption? A: It also proves workload identity on both sides of the connection. That identity can then drive stronger authorization decisions. Common wrong answer to avoid: “Because it compresses traffic better.”
Q: Why can retries be dangerous? A: Retries can multiply pressure on an already failing downstream service. Without budgets and visibility, they hide problems while making them worse. Common wrong answer to avoid: “Because retries always increase latency.”
Q: Why does labeling quality matter for security policy? A: Policies select workloads by labels, so wrong labels become wrong trust boundaries. Careless metadata can open paths you believed were closed. Common wrong answer to avoid: “Because labels are only for dashboards.”
Apply now (5 min)¶
Take three services: web, payments, and metrics. Write one default-deny rule and then the minimum allow rules. Now decide whether you also need mTLS or traffic splitting. List one failure the mesh could introduce during rollout. Finally, note the first log or event source you would inspect.
Bridge. Traffic secured. But how does the port grow and shrink with demand? → 07 → 07-autoscaling-and-capacity.md