04/18/11
Whew! What an exciting night! Just a quick recap...
So last night we delivered ~900 trunks, so around 5400 baby Meeroos. (That's not even 25% of the beta testers.) About 15-20 mins into a mass birthing frenzy people started reporting 502 errors. Aaaargh.. looking at the server I was baffled because it was using no RAM and no CPU. I couldn't even see that anything was happening. I check the DB. 1000s of DB queries were flying by so SOMETHING had to be happening.. Sure enough in the DB there are 1000s of new babies appearing. Lots of reports of strange behaviors and errors coming in group chat and from the CSRs. Eeeeeek.
I pull the plug on the trunk delivery system. I retooled that earlier yesterday and that worked TOO well. The trunks were flying out at an astonishing rate, which only was going to make the load on the server worse when people kept adding more and more meeroos to the grid.
Well turns out the web server had a hard cap on the number of fastcgi processes that was way too low. I'm kicking myself wondering why this issue didn't appear in my own internal load testing I did last weekend. Grrrrrr.. but don't have time to figure what that was all about.
Fire up tons more fastcgi and reboot the web server, but the mad dash is already over really so it won't matter until the next wave.
The CSRs are getting swamped with IMs and NCs of course. We try to keep our spirits up by cutting up on Skype while we sort through the mess. It seemed overwhelming last night, but looking back there were probably only a few hundred out of the 4500+ that went out that were broken, and the vast majority of those easily fixed by a REBUILD.
- Tiger's blog
- Log in or register to post comments