Handling 200k Requests Per Second Surges with Zalando SRE Manager Johannes Boumans
Manage episode 508035250 series 3661258
In this episode, Johannes Boumans, Engineering Manager in Zalando’s SRE team, shares how Lounge by Zalando handles daily surges of up to 200,000 requests per second. He discusses the shift from monoliths to microservices, the “you build it, you run it” model, SRE champions, and the trade-offs behind reliability, fairness, and cost. From bot defense to chaos engineering, it’s a deep dive into scaling one of Europe’s largest e-commerce platforms.
---
Johannes Boumans is an Engineering Manager in the SRE organization at Zalando, where he leads reliability efforts for Zalando Lounge, the company’s off-price shopping destination. Over nearly 10 years at Zalando, Johannes has grown from product support into SRE leadership, where he now supports 25 engineering teams in building resilient, fair, and scalable systems. Johannes is passionate about the “you build it, you run it” philosophy and champions practices like chaos engineering, predictive scaling, and bot defense to keep systems reliable.
This podcast is hosted by José Quaresma, researched by Joseph Thwaites and produced by Perseu Mandillo.
Chapters
00:00 – Intro
01:28 – Zalando: Europe's leading fashion destination
02:42 – The company’s rapid tech evolution since 2008
03:41 – From one team to 25: Johannes’ journey
05:48 – How the SRE champions model works
08:00 – What reliability really means at Zalando
09:27 – From monolith to full DevOps accountability
11:32 – What makes Lounge by Zalando unique
12:50 – Dealing with massive daily traffic spikes
14:05 – Predictive scaling and real-time cost control
17:15 – First-come, first-served: fairness at scale
22:11 – Solving the challenges of limited inventory
25:09 – Combating bots with layered protections
27:12 – Trade-offs: performance vs. experience
29:38 – Why Lounge doesn’t have a search function
31:17 – Advice for engineering managers facing traffic surges
34:25 – Chaos testing in production—including turning off zones
35:53 – Scaling advice for daily vs. seasonal peaks
37:55 – Evaluating virtual waiting rooms for fairness
39:30 – Book & mindset recommendations for engineers
41:43 – Scalability is… balance, cost, and confidence
© Queue-it, 2025
12 episodes