“Shit always comes in threes” I remember a colleague saying this halfway through last week. Well, guess what? It was on discount and for this one week came in eights!
Developers at our company share the responsibility for the operations of our live environment. We have a (usually) competent hosting partner who helps us with the hardware, networking and the operating systems. From there on in, it’s up to us. We manage all of the installation, configuration and maintenance of the middleware and our own personal applications. In order to be available 24×7, we have a weekly rotating shift of Support. This usually means you keep your phone about you, do everything you always do and get a little extra cash at the end of the month. Last week took a different road. Get ready for a long story, or while you still can, skip to the part where I put down my lessons of the week.
Tuesday, too early, 6:00 I wake up because of the phone, find out that the base station is ringing, but our portable is empty, so I fly out of bed and run downstairs to try and catch the call on that handset. Too late, just missed it. I check the caller ID and find out that it is our hosting partner, let’s call them Earthgoal for now. Knowing that Earthgoal only calls when there is something wrong, I turn on my laptop, put on a sweater and call back. Our biggest site is throwing errors left and right and is basically dead for all intents and purposes, it just doesn’t realize it fully. I’m not quite awake yet, but I try to shake off my fogginess and start my investigation round. Simply putting backups back doesn’t always do the trick, so I had to find out if that would help. After a long time I realize I am getting nowhere so I decide to call a colleague out of bed to brighten up his day and see if he can help me out. We investigate some further and luckily find a cause.
It turned out that the batch process that refreshes the data on our site, had somehow managed to corrupt our search engine and that our sites were erroring because of that. The batch process has a lot of failsafes, so this was not an easy thing to accomplish, but somehow it happened. We started a rollback to the data of the day before and went on to shower and get to work. Of course when we got to work, we realized that we had kind of screwed up. The rollback we did had wiped out all of the corrupt data and all of the logging. We asked our search engine partner, let’s call them FrankJumper, but FrankJumper couldn’t figure it out without anything to look at. DOH! This was going to happen again, all you have to do is wait for it.
Tuesday, quite inconvenient, 20:43 Of course right when all the guests come in and the celebration really starts at my father-in-law’s birthday, I get a call from Earthgoal. Although they are always quite friendly, being from soft-spoken Belgium and all, it is not really a pleasant phone call to receive. The guy on the phone tells me that one of our webservers is down. Of course we are redundant on many levels, but having a server out is really not a good thing for performance. Time for action!
I hadn’t really counted on being called, so I had left my laptop at home. I try to do a bit of an investigation on my phone, but get annoyed quickly. We only live about a kilometer from my in-laws, so I walk home to my trusty laptop. I take a look at the server, and yeah, it is quite down. I restart it, it comes back up and after checking the logs I determine that everything is running smoothly again. Time to go join the celebration.
10 minutes later I realized that it really wasn’t my sharpest day. I had restarted the server, but had forgotten to obtain an autopsy-kit. We do of course store our log files, but I didn’t have a heap dump or anything for the other developers to analyze in the morning. DOH! This was going to happen again, all you have to do is wait for it.
Wednesday Nothing. Really, nothing!
Thursday, making up for a calm day, 6:00 As I drag myself from the bed, kicking and screaming, I know it must be Earthgoal calling. This time the portable still has juice and when I pick up, the gentle voice on the other end of the line is indeed carrying a Belgian accent. He announces that one of our FrankJumper servers has died and asks me whether he can help out. I’m already over my laptop, so I ask him for the details. After discussing the issue with him for a bit, I come to a completely unrelated conclusion: none of our sites work. They’re dead Jim, they’re all dead!
Our sites are all completely siloed, the only cross cutting concern is basically our network infrastructure. This same network had had scheduled maintenance done at 0:00 the night before. I already know it, but the friendly Belgian voice still attempts to find fault with our sites. I check a few things for and with him and then he goes quiet and informs me that he is going to ring up some colleagues after which he hangs up.
My problem with the FrankJumper server of course still remained. Earthgoal hadn’t called for nothing, so I checked whether this time we did still have working FrankJumpers and luckily we did. Our batch process had not gotten to the point yet where all of our servers were refreshed with corrupt data, so it was safe to kill that process. I left our sites running on little over half-strength so that the FrankJumper crew would have something to do an autopsy on.
After taking a shower and having some breakfast, Earthgoal called. They had found the problem and fixed it. It turned out that their maintenance had taken all of our servers out of the loadbalancer. I informed everybody that we were back online. We had been down for little under 8 hours because someone forgot to check whether everything still worked. Amazing!
Thursday, DOH, 9:00 I head for work and get there after a freezing ride on my bicycle and a nice work train journey. At work one of my colleagues informs me that one of our sites is still down and asks whether I know what is up.We investigate together, but quicly arrive at the conclusion that the problem of the morning had never been fixed for this site.
Earthgoal had “forgotten” a set of servers and I only checked two of our six sites. Big mistake, we had been down without any decent reason for another hour and a half with that site. All because both I and Earthgoal had forgotten to check well.
That day however, the guys from FrankJumper did their autopsy and found the cause for the problem that I had had such a great time with on Tuesday and Thursday morning. It turned out that our batchprocess was calling the wrong script after it had sent fresh data to the FrankJumper search servers. This script worked most of the times, but sometimes, just sometimes, it would not work properly. Don’t ask em why you would want to have such a script in the first place, but at least now we knew what script to call. The lovely thing was that we had never run into this problem in 2 years of using this same script. Go figure, of course it decides to die on my week.
Thursday, Thanks!, 23:00 Being a little paranoid about our batch process after having had such a wonderful run with it, I check on it’s progress in the evening before going to bed. Of course my intentions are to make sure that I can have a quiet night. As I open up my mail however, I see error reports for each of the different runs of the batchprocess and it turns out that the batchprocess had already collapsed completely for all of our sites. This batch process is also completely siloed, so there aren’t that many things that can cause this. I start investigating, but what I find simply didn’t make any sense. The process had collapsed on data that was completely corrupt in one of our core databases. The database itself doesn’t enforce referential integrity (performance is a b***h), so somehow there is data referencing records that aren’t there at all. I send an email to my data colleagues and go to bed, certain that my next day is going to be another great day filled with data problems.
One of my data colleagues, let’s call him Pilbert, had made a slight mistake on one of his new transformations. Pilbert had taken the wrong ID field when inserting fresh data and of course we have no quality assurance whatsoever in this department. He fixed it quickly and was of course quite sorry, but the damage was already done. I had spent another hour and a half working when I should have been sleeping and we did not have fresh data on any of our sites yet another day.
Saturday, right when you need it, 17:30 Earthgoal calls, right as my wife is about to leave for work. She helps me a bit, but is already late so she has to leave me and our two great but very young kids in my care along with the problem I had just gotten. One of our web servers is down and it looks like what I encountered before this week. I decide to do it right this time. With a crying baby in my arms I find the wiki pages that help me collect the data that I need and then restart our servers. I write an issue about the situation and attach the details. Nothing more to do on the work front, so it’s time to start cleaning up the chaos at home.
Sunday, interrupted dinner, 18:00 Does it never end? Right during dinner, my phone rings and I have to restart another web server. I repeat my steps of the day before and save myself about half an hour this time. Knowing the road is really a lot easier than navigating it for the first time.
Lessons I take away
I had had eight incidents in one week time, which after analyzing have had 5 different root causes:
- Software going funky (*2). FrankJumper suddenly started dying after working well for over 2 years.
- Corrupt data. Bad data in your databases can cause really weird behavior.
- Network changes. The load balancer config did not work for our situation.
- Carelessness. Neither me nor Earthgoal hadn’t checked all of the sites, one of them had never gotten fixed.
- Unidentified (*3). We still don’t know exactly what caused our web server crashes. We’ve come a long way but we’re not there yet.
From these incidents, causes and my, sometimes poor, way of managing them I think I can learn the following. Many people have said these things, but sometimes you just have to feel the pain before you realize the value of good advice:
- Always make sure you can do an autopsy. Whenever you take action to restore your site to a working situation, make really sure that you have the data needed to figure out the problem later on.
- Automate or follow a script. A colleague of mine said it before, but now I realize it too: it pays off to have a script. When you get called in the morning you’re really not as focussed as normal, so you really want dummy instructions to save yourself from making mistakes. The best way is of course to have recovery automated, but that is not in all cases financially attractive.
- Don’t trust easily. Testing is important in development, but it is even more important on live situations. When you think you have the situation resolved, take some time to verify that it really is resolved.
- Rollbacks should be quick. When you’re really in trouble and you need to roll back your data or software, you really don’t want to be down for too long. Think about your rollback plan up front and make sure it’s fast. The first time FrankJumper was having problems, we had our biggest site down for about 6 hours. 😦
I hope my story helps others to see the value of this advice, at least I can now see why 🙂