To Hotfix or To Rollback? That is the question

We’ve all been there before: you just broke something. Maybe, if you’re particularly unlucky, you broke something in production. This is a pretty scary moment. I’ve written about that scenario when everything goes wrong and it’s a big big problem. But wether it’s a big problem or a little one, it’s still a broken site and you have to fix it ASAP. And here enters one of the most challenging questions in our industry: should we hotfix whatever broken in hopes of righting the ship or should we rollback to a known working state?

Unfortunately there isn’t always a clear and concise answer to this question. But my hope in this post is to give you some pointers and tips to help you make a real time determination. Because at the end of the day every time something breaks, the situation is going to be just a little different and you’ll have to triage the issue and make a case-by-case determination. Every single time. This is, by the way, why I talk so much about testing and devops best practices. If you have to break something, much better to break it before prod. But, let’s assume for the purposes of this article that prod is broken. What now?

Evaluating the Problem

Let’s set the stage: you’ve just deployed something into your production environment and something went wrong. Maybe the deployment itself failed. Maybe the deployment succeeded, but now that you’re spot testing your site you’ve found a problem (or worse, found a page that’s throwing a fatal error).

I’ve been in this situation many times (despite my best efforts) and I can tell you from experience, there will be panic. Your number one job at this moment is to try and contain that panic because the absolute worst thing you can do in this moment is react blindly. In fact, I would go so far as to say that a knee jerk reaction here (meaning you just start making changes) could actually cause things to get worse.

Your number two job is to try and understand exactly what is going on. Now, troubleshooting and debugging something that is “really” broken can be a quite time consuming process (and may require additional tooling that you don’t have access to in your production environment). I’m not suggesting you dive deep. But I am suggesting that you take at least a few minutes to understand exactly what that fatal error is.

Recently, I attempted to update the Govcon site to Drupal 9.3.x. I went to do my production deployment, and it failed. Now, thankfully, in this specific case the website itself was seemingly alright. This is terrific news, but not a get out of jail free card (because a failed deployment still likely means something is very wrong with the site, even if it isn’t manifesting with a bunch of PHP errors).

When I dug into the failed deployment, I discovered that a fairly massive update hook attempted to run (see system_post_update_sort_all_config) and my deploy bombed out with:

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 20480 bytes) in /mnt/www/html/capitalcamp/docroot/core/lib/Drupal/Core/Database/Query/Condition.php on line 94

With a little help from some of my colleagues, I chased down the open D.O issue: https://www.drupal.org/node/3254403. Now I knew what the problem was. I just had to figure out how I wanted to fix it. In this specific instance, I tried a few things:

  1. Since the deploy failed during database updates, I checked to see if the site was responding to drush commands (it was) so I tried manually running drush commands. I experienced the same issue.

  2. Reading the error message, I discovered that the error was occurring because the PHP process ran out of memory. So, I tried upping my PHP memory in production (I went as far as I could, up to 1gb) but I didn’t get a different result. I didn’t dig much on this, but later discovered that the PHP memory changes made via the Acquia Cloud dashboard don’t actually impact CLI commands (this has to be done separately, as described here —which required a deployment).

  3. Given that my “hotfix” solutions (basically just trying to redo the updates) failed, I had to make a choice: rollback the site to 9.2.x OR apply the patch from the D.O issue and attempt to re-update to 9.3.x.

I opted to attempt the patch. So, I opened a new pull request that patched Drupal core and deployed this all the way out to production (after my CI succeeded, of course). And, thankfully all went smoothly. The deploy succeeded, the database updates completed, all good!

The bottom line is that if I had just immediately rolled back to 9.2.x, the next time I updated to 9.3.x I likely would have run into this issue again (unless I waited for 9.3.1 to be released in a few weeks). In taking some time to actually read what was going on and triage the issue, I actually saved myself work in the long run. Sure, my site was in a damaged state for an additional 5-10 minutes (not great) but… overall, I believe the impact to the site was less severe than doing multiple passes. I also believe that the risk of a core patch (in this case) was significantly lower than other situations where I might have had to do custom development or something else.

This final point cannot be overstated: if there wouldn’t have been a D.O issue with a solution, I probably would have rolled back. In this case, I had what appeared to be a ready-made solution that I could drop in and take advantage of. Had I not have found this, I would have had to potentially fix the issue myself, report it on D.O, etc. I only knew what was going on because I took the time to look at the deployment logs (in this case), find the error, Google the error, talk to my colleagues about it, etc. “Deployment failed” is the same sort of vague error analysis as “my car won’t start.” There’s an awful lot of reasons why that might be the case!

Friendly Reminders

Now you might be thinking to yourself… your production site was in a bad way for 10+ minutes? It was! And it wasn’t a big deal, for a number of reasons. Primarily in that the Govcon site isn’t high traffic unless the conference is going on (and we obviously would not be doing a Drupal core update like this in the middle of the conference). BUT even if your site is higher trafficked than mine (or higher criticality) you should be doing deployments at a time when a site being down for 10ish minutes isn’t going to be the end of the world. Obviously, a big part of devops best practices (which I write about frequently) is to try and ensure that bad things don’t happen. But bad things happen. And you should be prepared for them.

A few things I strongly recommend:

  1. Simulate deployments in non-production environments before you do the prod deploy. This means you should copy the current production database into another environment and do the exact deploy you plan to do. If you blow that up? Who cares. Fix the issues and simulate again until you are confident things will work THEN do prod.

  2. Backup everything prior to your production deployment just in case you do have to roll back.

  3. Ensure that a knowledgable Drupalist is participating in the deployment. While I hope you don’t need this person, if something does go wrong, a novice (or non-)Drupalist will really struggle to identify the criticality of a problem

  4. Have a deployment plan in place. This should include key things like:

    1. When are deployments allowable

    2. What sort of testing should be performed before and after a deployment

    3. Who should be notified if a deployment goes wrong

    4. What is the acceptable time-frame to research and triage a problem before rollback

    5. etc.

  5. Don’t hotfix something without extensive testing (which you may not have time to do right now). A bad hotfix might make the problem worse, not better. When in doubt, roll back!

  6. Don’t Panic!

The last one I cannot stress enough. Despite our best practices, efforts, and hopes you will eventually break a site if you are doing deployments. It’s going to happen. However, the more you prepare for it to happen both emotionally and skill-wise, the easier it will be the navigate when it finally does happen. I still get freaked out when it happens, but when I was first starting out as a freelancer I was terrified at the implications. I’ve had a lot of practice (and experience) at this point, but it’s still something that impacts me to this day.

In Conclusion

When you do invariably break a site in prod, take deep breaths. Take some time to review the logs and come up with a game plan that will get you back to a working site based on the data you gather. It’s OK to roll back! And it’s definitely safer to roll back than it is to try and hotfix something and make it worse.

Finally, I hope you’ll walk away from this article and think about how you can bolster your devops process to even further reduce the likelihood of problems occurring in the first place. Good luck!

Photo courtesy of Photo by Andrew E Weber on StockSnap

Related Content