Go offline with the Player FM app!
Death by Uptime
Manage episode 523274964 series 3354720
We hit a new (and disturbing!) failure mode recently when a production rack that had been up for several months saw every (!) compute sled's service processor become simultaneously unresponsive. Bryan and Adam were joined by the members of the Oxide team who debugged the vexing issue -- and reached its surprising root cause.
In addition to Bryan Cantrill and Adam Leventhal, we were joined by Oxide colleagues, Cliff Biffle, Matt Keeter, and Will Chandler.
Previously, on Oxide and Friends:
- OxF s05e03 – Holistic Engineering with Robert Mustacchi
- OxF s04e14 – Rebooting a datacenter: A decade later
- OxF s01e26 – The Pragmatism of Hubris
- OxF s05e20 – Debugger-Driven Development (omdb)
- OxF s05e07 – Transparency in Hardware/Software Interfaces
- OxF s05e31 – Futurelock
- OxF s05e33 – A Grown-up ZFS Data Corruption Bug
Some of the topics we hit on, in the order that we hit them:
- hubris #2304: STM32H7 Ethernet driver stops yielding CPU after many packets
- gist — Summarizing the Hubris side of investigations
- Matt's blog: Hunting a spooky ethernet driver bug
If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!
166 episodes
Manage episode 523274964 series 3354720
We hit a new (and disturbing!) failure mode recently when a production rack that had been up for several months saw every (!) compute sled's service processor become simultaneously unresponsive. Bryan and Adam were joined by the members of the Oxide team who debugged the vexing issue -- and reached its surprising root cause.
In addition to Bryan Cantrill and Adam Leventhal, we were joined by Oxide colleagues, Cliff Biffle, Matt Keeter, and Will Chandler.
Previously, on Oxide and Friends:
- OxF s05e03 – Holistic Engineering with Robert Mustacchi
- OxF s04e14 – Rebooting a datacenter: A decade later
- OxF s01e26 – The Pragmatism of Hubris
- OxF s05e20 – Debugger-Driven Development (omdb)
- OxF s05e07 – Transparency in Hardware/Software Interfaces
- OxF s05e31 – Futurelock
- OxF s05e33 – A Grown-up ZFS Data Corruption Bug
Some of the topics we hit on, in the order that we hit them:
- hubris #2304: STM32H7 Ethernet driver stops yielding CPU after many packets
- gist — Summarizing the Hubris side of investigations
- Matt's blog: Hunting a spooky ethernet driver bug
If we got something wrong or missed something, please file a PR! Our next show will likely be on Monday at 5p Pacific Time on our Discord server; stay tuned to our Mastodon feeds for details, or subscribe to this calendar. We'd love to have you join us, as we always love to hear from new speakers!
166 episodes
All episodes
×Welcome to Player FM!
Player FM is scanning the web for high-quality podcasts for you to enjoy right now. It's the best podcast app and works on Android, iPhone, and the web. Signup to sync subscriptions across devices.