Drupal 9: Fun With Update Hooks

OK let’s get honest for a few minutes: mistakes happen when you build websites. Even with the best of intentions, planning, architecture, etc. sooner or later you’re going to do a thing in a way that ultimately pans out to be “wrong.” Thankfully Drupal has a mechanism built in to handle this sort of thing! The update system. This article is about a recent problem I discovered on the Drupal GovCon site and how I’m fixing it using an update hook in a custom Drupal 9 module.

Background

The current incarnation of the Drupal Govcon website was actually built for the 2017 conference (hence the reason our repo still says https://github.com/Drupal4Gov/Drupal-GovCon-2017 in the URL). However, the site has been repurposed and used for each conference since then and is running the most current version of Drupal today (as of this writing 9.1.2).

Having said that, when we built that site back in 2016 / 2017 we had been totally building a new site for each conference. So, while we built something “good” for the 2017 conference, we didn’t build something that was intended to hang around and be reused year after year. As a result, I (and the rest of the web team) have been gradually been making improvements to things that we did (and/or have done since) that don’t scale or age well.

This week I found another fun thing…

We have a static archive of the 2016 and 2015 versions of the site. Since we recently added a session archive I wanted to go ahead and port all of the old sessions from those years over (so that all of the sessions for the conference lived in the archives). This was going great! Until I realized how wonky our session track taxonomy has gotten. Just look at this:

Screen Shot 2021-01-11 at 2.50.12 PM.png

Now usually I do some Views Entity Reference magic to hide that entire list (so you are only seeing the “current year’s tracks” as an end user). But I disabled that temporarily to move over the 2016 sessions and wowza. We did this to try and preserve tracks on a year to year basis but ultimately, we’ve decided it’s more trouble than it’s worth (especially now that we have a single, unified archive).

So the problem: we have a bunch (in some cases 3+) of taxonomy terms that are actually the “same session track” despite living in multiple categories. And we have a ton of session (hundreds) so going through and manually retagging all of them sounds… miserable.

Finding a Solution

I want to start this solution discussion with a major disclaimer. This is a one time revamp of content from an information architecture that has burst its seams. This is not intended to be reused or overly stable for long-term application. Because of this fact, I’m going to “migrate” the session data using an update hook instead of writing an “actual migration.”

If this was something that needed to be done on an ongoing basis, had more data, etc. then I would have approached it very differently.

The first thing I did to start unraveling this mess was to build a quick table of my taxonomy terms from the Session Track vocabulary. This gave me a quick look at what terms we had, the years, and I also pulled out the taxonomy term ids for each.

Screen Shot 2021-01-14 at 4.20.54 PM.png

Columns D and E (canonical and parent) were added manually by me as I went through and mapped out the old terms to their new “canonical” terms (meaning, the term that would survive the migration).

The next thing I needed to do was write a basic script that would:

  • find all nodes associated with one or more terms

  • change the node’s field data from the old term to the new canonical term

  • update the node

  • delete the old term(s)

You can see my pull request for this here.

As I said in my disclaimer above, this is a one time change so I felt an update hook was more appropriate for this than writing an actual migration. Let’s dig into the code!

What is an Update Hook

If you haven’t written an update hook before, the TLDR is this:

  • defined in a module’s .install file (that module must be enabled to run)

  • executes arbitrary PHP code (basically does whatever you want it to do)

  • executed via drush updb or Drupal Admin UI

The trick with update hooks is that you can only run them from an enabled module. This is one of the big reasons I always strongly recommend having a “core” or whatever module for your site/platform that is always enabled. You don’t have to do much development in this module (because it shouldn’t be a catch all for your junk) but by always having at least one module that’s enabled everywhere, you know you have a place to run updates from!

Doing This Migration

The first thing in my case that needed to happen was take my spreadsheet of terms and map out the changes in data. Again, a strong disclaimer that since I only had a handful of terms to deal with, I just did this by hand. If you have significantly more data than me, I would suggest taking more repeatable steps (like adding relationships in Drupal itself on the taxonomy terms and the traversing those relationships during your update instead of doing it by hand).

For me, this mapping looked something like this:

  $terms = [
    0 => [
      'parent' => 156,
      'children' => [46, 66, 381],
    ],
    1 => [
      'parent' => 311,
      'children' => [51, 56, 176],
    ],
    2 => [
      'parent' => 171,
      'children' => [61],
    ],
    3 => [
      'parent' => 161,
      'children' => [81],
    ],
    4 => [
      'parent' => 166,
      'children' => [91, 386],
    ],
    5 => [
      'parent' => 306,
      'children' => [181],
    ],
  ];

Next, I need to iterate through this array (of arrays) and find the nodes associated with each ‘children’ array. I did this like so:

foreach ($terms as $term) {
  $nids = _get_content_by_term($term['children']);
}

The helper function in this case accepts the array of term ids and does an entity query:

/**
 * A helper function to load all nodes tagged with any of the provided tids.
 *
 * @param array $tids
 *   An array of term ids.
 *
 * @return array
 *   Returns an array of node ids.
 */
function _get_content_by_term(array $tids) {
  $nids = [];
  foreach ($tids as $tid) {
    $node = \Drupal::service('entity_type.manager')->getStorage('node');
    $node_results = $node->getQuery()
      ->condition('type', 'session')
      ->condition('field_session_track', $tid)
      ->accessCheck(FALSE)
      ->execute();
    foreach ($node_results as $key => $value) {
      $nids[$key] = $value;
    }
  }
  return $nids;
}

Then it takes all the nodes it finds that have been tagged with my term ids and returns an array of term ids. Basically, this function transforms a list of terms into a list of nodes. It’s basic, but it’s useful!

The next step is to actually get at the nodes I just found so I can use the entity API to muck about with the data. WARNING: this is potentially quite destructive and dangerous. I strongly recommend testing the crap out of this before you let it anywhere near real data. If you mess this up, you could really corrupt your data!

$nodes = Node::loadMultiple($nids);
foreach ($nodes as $node) {
  $node->set('field_session_track', $term['parent']);
  $node->save();
}

So, in this case, we take all of the nodes we found in the helper function and load them into the $nodes variable. Then we iterate through what is now a bunch of node objects (assuming we found any nodes at all) and set the taxonomy term (which in my case is stored in a field called Session Track, hence the field_session_track machine name) and set it to the new canonical / parent taxonomy term I want to keep. Then I save my node.

My last step is to remove the old child terms because once I’ve moved the nodes out of the term, I don’t want it anymore!

foreach ($term['children'] as $child) {
  $delete = Term::load($child);
  $delete->delete();
}

Note: I’ve split up the logic of this a bit to talk through it. Have a look at the actual pull request to see it in place.

The last step is testing. I didn’t do automated testing for this because again, it’s a one-off. BUT I did do a lot of manual testing. I did so by using my production session archive to filter down to the years I was getting rid of (e.g. DevOps 2017) and compared with the NEW category for the same year (DevOps, Performance, Security, and Privacy). Given that NEW category wasn’t used for any content in 2017 if I did my job right, prior to the migration NEW category + 2017 should have 0 results and post migration the number of results should match the OLD category + 2017 on production.

You can see the resulting session track select box post migration as well:

Screen Shot 2021-01-14 at 4.38.31 PM.png

Note: I haven’t gone through and re-tagged the years just year, so obviously 2016 and some of the 2017 data looks to be missing (it’s really bundled up under the new canonical terms, as discussed).

In Conclusion

One of the reasons I push so hard to do architecture right “the first time” isn’t because you can’t fix something that goes wrong. We do it all the time! There’s options for that. There’s mechanisms for that.

The thing is, “fixing it” is usually more time consuming and costly than just doing it right the first time. And imagine if you will if this was thousands or millions of pieces of content that was miscategorized and not just a few hundred. To be clear, my solution will work with millions of nodes just the same way it does with hundreds. But like I said earlier, that mapping of the terms absolutely does not scale. The other really significant thing that doesn’t scale here is the way I’m changing the nodes.

$nodes = Node::loadMultiple($nids);

This line of code looks pretty benign. But I cannot adequately explain how many times I’ve seen this (or a line like it) tank a production website. Why? Well, it all comes down to how many things are in that $nids array. For me, I know that there’s not “that many.” I also know that this code isn’t ever going to run “at run time” (it will run exactly once during a database update on a deployment). But if I tried to load a thousand or tens of thousands of nodes into memory all at once? Or if I did this at run time while lots of people were visiting my site? Wowza that would be bad!

So! My big take away here is that if you make a mistake, it’s ok! We all do it. You can totally change whatever you did to “fix” that mistake (and a big part of building software is the evolution of your code). Just make sure that the solution you’re building actually fits your scenario.

Related Content