breadchris | hack the planet logo

breadchris | hack the planet

Archives
Subscribe
March 15, 2024

hack the planet | failure is not an option, it happens

Hey y'all,

Every year, I run a cyber security competition for high schoolers. Yesterday was the 7th iteration. Months of planning go into designing the story and putting together digital evidence for the competitors to try to solve a murder mystery. It was a special event for me because the organizers who built it for my high school self stoked my passion for security.

Unfortunately, it did not go as planned this year. After giving the introduction talk to 350 students, the site got DOSed from the concurrent load of everyone accessing it. OK, easy fix, my app is deployed on kubes, I'll just increase the CPU and RAM of pod. Error, unscheduable. Investigating why, I saw that I was using low cost budget nodes, which were too small to support the requested resources.

cheap, but very small

Most of what I know comes from the book I read, which was incredibly useful for getting started, and now I was being tested, under pressure. If I had been following the book, I should have tried to balance the load by increasing the number of pods for my app. My heat of the moment solution was to create a new node pool with bigger nodes, not wrong but more involved. I needed to move the pods in the current node pool over to the new one, but I made a mistake. Instead of draining the node pool, I tried to delete it. It technically would have worked, but postgres' volume claim was preventing the node from being removed. I am now reminded of the warnings I have read about running a database in kubes. Watching the time go by, I got flustered and started on other deployment options.

the message I missed: "pod has unbound immediate PersistentVolumeClaims"

My other solutions were a frantic rabbit hole, battling the strict requirements of the network students were using. I setup a Compute Engine VM quickly, but no HTTPs. I needed a load balancer to get HTTPS, that was too confusing, so I used ngrok. ngrok's TLD is blocked, so I setup a custom domain. Finally having it seemingly working, I realized my fatal flaw. My computer started to heat up as I remembered ngrok routes connections through your computer. The silly patch work of an infrastructure resulted in something that "kind of worked."

trying to setup https in realtime is not easy

It pains me writing this out and thinking through how simple the fix should have been. I got tunnel vision in the moment and failed to think clearly though the available options.

The silver lining here is I was able to give an engaging post-mortem presentation to the students at the end of the day. I wanted to demonstrate to students how to address failure; take ownership, and be transparent. It led to some pretty insightful questions being asked which made me happy.

I have the competition still running if you are interested in checking it out. If you have questions about it or about infra things let me know, I would love to help ya out :)

Thank you,

Chris

thank you,

breadchris

https://breadchris.com
Don't miss what's next. Subscribe to breadchris | hack the planet:
https://breadch...
Powered by Buttondown, the easiest way to start and grow your newsletter.