October 6 server outage - Post Motem

below is the post mortem report of the October 6 outage.



Our server had an outage at 13:30 GMT due to a high trafficked image. (http://beeimg.com/view/h8150456397/) according to the referral logs the image was linked to several "LiveJournal" blogs. The outage was recorded by pingdom at 13:31 GMT. (http://stats.pingdom.com/pdc110r2vx7j/889337/2014/10) I was notified about this outage at 13:36 GMT via SMS (Email too) by Pingdom. Which then I quickly went online and started investigating this issue.(https://twitter.com/beeimg/status/519120005695537152)



Normally whole server is monitored and if the Apache server is crashed, it automatically get restarted. during this outage the whole server got crashed making the monitoring mechanism useless. the server was returning pings, meaning the server was accessible, but was under high load. I gave the server a hard reboot at 13:50 GMT via DigitalOcean CP and the server was back online 13:52 GMT. The server was completely inaccessible for a 22 min period. (https://twitter.com/beeimg/status/519123971133169664)



To control the server load I quickly jumped to the Cloudflare CP and proxied the traffic of beeimg.com via CF for a brief time. (https://twitter.com/beeimg/status/519124680318664705) during the time I had edited the back end to redirect the images to CF powered domain. which resulted was most of the images sent to cdn.beeimg.com but gave a 404 error due to recent back end updates. I quickly applied a patch at the server end, but CF had cached the 404 error for most users for nearly 2 hours.



we were switching to Cloudflare HTTPS in the past few days, and were migrating the CDN domain to i.beeimg.com to serve the CF cached content. more test were done and new codes were added to balance the load and the issue was officially announced solved around 17:30 GMT (https://twitter.com/beeimg/status/519185562486710272)



We will be monitoring the server further more and will be adding more load balancing mechanisms during the upcoming updates to the site. Also we added a new header to moniter the server load the which will be explained in another blog article.



Thanks for reading :)
-Admin

Comments

Popular posts from this blog

Stability Improvements 2020

Stability Improvements 2019

Recent Hiccups May 2016