06. Service Mesh and Network Policy — Secure traffic without blind trust¶

⏱️ Estimated time: 25 min | Level: advanced

ELI5 callback: Think of a busy shipping port. The dock manager must place every container on the right ship. Heavy ML work needs a cargo crane, and port security keeps lanes and permissions clean.

Why east-west traffic needs explicit control¶

Clusters make service-to-service traffic easy, which also makes mistakes easy. Default reachability is convenient until one compromised workload moves sideways. Keep the analogy close. The dock manager reads the manifest, the container carries one workload unit, the ship offers capacity, the cargo crane handles ML-heavy lifts, and port security blocks unsafe access. Simple, no? Traffic policy should say who may talk, how, and with which identity. See. Connectivity without policy becomes accidental trust. Now watch.

traffic concerns
┌────────────┐    ┌────────────┐    ┌────────────┐
│ caller     │ -> │ policy     │ -> │ callee     │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  identity | allow or deny | encrypted path
The network path needs a contract.

North-south security handles entry traffic from outside the cluster. East-west security handles service calls inside the cluster boundary. Identity, encryption, and authorization are related but not identical. Many incidents start with over-broad internal trust assumptions. Policy is easier when labels and namespaces reflect real boundaries. Start with the simplest rule set that blocks obvious sideways movement. So what to do? Map trust boundaries before picking tooling. Group workloads by sensitivity and communication need. Keep namespace design aligned with team and risk boundaries. Review default-allow traffic assumptions explicitly.

NetworkPolicy sets basic packet-level boundaries¶

NetworkPolicy controls which pods may send or receive traffic. It usually works at Layer 3 and Layer 4, not deep HTTP logic. Policies select target pods and then list allowed peers and ports. One default-deny policy can change the cluster security posture fast. See. Start denying broadly, then add precise allows. Now watch.

policy model
┌────────────┐    ┌────────────┐    ┌────────────┐
│ target pods │ -> │ peer rules │ -> │ ports      │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  selected set | who may talk | which lanes
Everything not allowed can be dropped.

Ingress rules describe incoming peers allowed to reach selected pods. Egress rules describe where selected pods may initiate traffic outward. Policy semantics depend on a network plugin that actually enforces them. DNS, metrics, and control-plane calls are easy to block by accident. Labels become security handles, so sloppy labeling becomes risky. Policy tests matter because humans guess wrong about traffic paths. So what to do? Create default-deny policies per namespace as a baseline. Whitelist DNS and observability paths deliberately. Version policies with the application code they protect. Test both allowed and denied flows during reviews.

Service mesh adds identity, encryption, and richer traffic control¶

A service mesh usually inserts a data-plane proxy beside each workload. Istio and Linkerd then manage identity, certificates, and traffic policy centrally. Mutual TLS encrypts both directions and proves both workloads speak honestly. That extra control comes with latency, complexity, and debugging overhead. See. A mesh is powerful, but it is never free. Now watch.

mesh pattern
┌────────────┐    ┌────────────┐    ┌────────────┐
│ app        │ -> │ sidecar    │ -> │ peer       │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  local proxy | mTLS policy | remote proxy
Identity follows the workload, not the IP.

mTLS means both caller and callee present trusted certificates. Service identities are often tied to service accounts or workload names. The mesh can enforce authorization after identity is established. Retries and timeouts become safer when centralized and observable. But proxy injection, certificate rotation, and version skew need care. Keep the business value clear before adding this much machinery. So what to do? Use a mesh where policy and observability needs justify it. Track added latency from proxy hops before promising gains. Automate certificate rotation and expiry alerts. Document how teams debug calls with sidecars in the path.

Traffic splitting helps canaries and safe migrations¶

Meshes shine when you need percentage-based routing and richer release control. You can send five percent traffic to a canary before a full cutover. Header-based routing also supports tenant, region, or feature experiments. These knobs help only when metrics and rollback actions are ready. See. Smart routing without observability is just sophisticated guessing. Now watch.

split traffic
┌────────────┐    ┌────────────┐    ┌────────────┐
│ baseline   │ -> │ canary     │ -> │ metrics    │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  95 percent | 5 percent | decision loop
Routing percentages need outcome checks.

Traffic splitting should pair with latency, error, and saturation metrics. Retries can hide failures while amplifying load on weak backends. Circuit breaking protects the caller, but it can mask deeper issues. Timeout budgets must match user expectations and downstream reality. Canary policy belongs with release policy, not as a separate hobby. For simple systems, an ingress controller may already be enough. So what to do? Set one success threshold before flipping percentages upward. Cap retries so failing backends are not hammered harder. Record which policy changed when debugging surprise regressions. Prefer one traffic management layer per path where possible.

Operational simplicity still wins more often than fancy policy¶

Many teams need only NetworkPolicy plus clean ingress and egress rules. A full mesh is justified when identity and traffic control needs grow. The real skill is knowing when not to add another control plane. Security should become clearer after tooling, not foggier. See. Choose the smallest control surface that closes the real risk. Now watch.

tool choice
┌────────────┐    ┌────────────┐    ┌────────────┐
│ none       │ -> │ policy     │ -> │ mesh       │
└─────┬──────┘    └─────┬──────┘    └─────┬──────┘
      │                 │                 │
      v                 v                 v
  simple cluster | need isolation | need identity
Security maturity should grow step by step.

Start with namespace boundaries and default-deny networking. Add mesh features when you need mTLS, rich authz, or traffic shaping. Train developers on failure modes introduced by proxies and policies. Keep runbooks for certificate failures and blocked traffic. Auditing policy drift matters as much as creating the first rules. Simple systems fail in fewer places and teach faster. So what to do? Name one owner for network policy and mesh standards. Review denied-flow logs so policy mistakes surface early. Minimize overlapping traffic control layers across the same path. Revisit whether the mesh still earns its operational cost.

Where this lives in the wild¶

Platform teams start with namespace default-deny policies before touching service mesh.
Banks use mTLS and strong workload identity for sensitive east-west calls.
Canary releases often rely on mesh or gateway traffic splitting with SLO checks.
Multi-team clusters need clear egress rules so compromised pods cannot wander freely.

Pause and recall¶

Why is internal service traffic risky even inside one cluster?
What can NetworkPolicy control, and what can it not express well?
What extra value does a service mesh add beyond basic networking?
Why can retries and traffic splitting still create failures?

Interview Q&A¶

Q: Why start with NetworkPolicy before adding a full mesh? A: It closes obvious lateral-movement risks with less operational overhead. Many teams need basic isolation long before they need rich traffic shaping. Common wrong answer to avoid: “Because meshes are obsolete now.”

Q: Why is mTLS more than just encryption? A: It also proves workload identity on both sides of the connection. That identity can then drive stronger authorization decisions. Common wrong answer to avoid: “Because it compresses traffic better.”

Q: Why can retries be dangerous? A: Retries can multiply pressure on an already failing downstream service. Without budgets and visibility, they hide problems while making them worse. Common wrong answer to avoid: “Because retries always increase latency.”

Q: Why does labeling quality matter for security policy? A: Policies select workloads by labels, so wrong labels become wrong trust boundaries. Careless metadata can open paths you believed were closed. Common wrong answer to avoid: “Because labels are only for dashboards.”

Apply now (5 min)¶

Take three services: web, payments, and metrics. Write one default-deny rule and then the minimum allow rules. Now decide whether you also need mTLS or traffic splitting. List one failure the mesh could introduce during rollout. Finally, note the first log or event source you would inspect.

Bridge. Traffic secured. But how does the port grow and shrink with demand? → 07 → 07-autoscaling-and-capacity.md