Deleting a lot of nodes at a time in Drupal for First Parish in Cambridge

Shai Sachs's picture

Update, 19 Dec 2007: Today I discovered an exciting new Drupal module, Views bulk operations, which should make this task much easier and less error-prone. The module is still in beta, but it looks quite promising! Thanks very much to the folks at Dev Bee for bringing this module to my attention.

If you've spent any appreciable amount of time maintaining Drupal sites, you've probably had an unfortunate snafu that required some behind-the-scenes plumbing work that can't be done using the traditional Drupal tools. You've probably been extremely frustrated with the paucity of tools available for doing things like deleting a lot of nodes at once to clean up the mess left by an unruly contributed module, and hoping for some kind of quick fix. That was the situation I found myself in today, and while I have some thoughts about deleting a lot of nodes in Drupal, I unfortunately don't have a quick fix.

One of my volunteer projects is helping out with the website for the First Parish in Cambridge, Unitarian Universalist. That's my wife's church, and I drop by there myself every few months for special occasions.

Like many churches, First Parish Cambridge has certain events which occur periodically - weekly services, routine group meetings, choir rehearsals, etc. When I set up the site originally together with the rest of the volunteer team, we figured that the event repeat module would do the trick. Well, it did - far, far too well. For some reason, we recently discovered that the event repeat module had created something like 64,000 events on the church calendar, about 55,000 of which were in June 2008. Talk about an active congregation!

We haven't yet pinned down the exact cause of the snafu, but it looks like it has something to do with the fact that many of the periodic events were set up with a June 2008 end date. Presumably, there's some bug in event repeat which loops for a while on the end date under the wrong circumstances.

At any rate, tonight my task was to clear out the thousands and thousans of extra events that have piled up in the system, in order to make upgrading to Drupal 5.3 a little less painful. In technical terms, I decided to delete every event whose start date was June 2008 or later (there were several thousands of events stretching all the way to 2012, in addition to the 55,000 in June 2008.)

The first thing I did was backup the database and folder structure. That's always important when undertaking these kinds of mammoth database reorganizations.

My first approach was straightforward enough. I created a bit of code which queried the database for events with start dates past June 2008, and deleted them using Drupal's node_delete function. Then I slipped that code into a simple page, set the filter to PHP, and let it run. (By the way - may I caution anyone whose reading this to take this step only with the utmost caution. Leaving a bit of code in your Drupal system like this untended is extraordinarily dangerous. Delete the page as soon as you're done with it!) Here was the code for the page:

set_time_limit(0);

$res = db_query("select n.nid from node n inner join event e on n.nid = e.nid
where from_unixtime(e.event_start) >= '2008-06-01' limit 1000");

$output = '';
while ($arr = db_fetch_array($res)) {
  $nid = (int) $arr['nid'];
  if ($nid <= 0) { continue; }
  node_delete($nid);
  $output .= "deleted node $nid
\n"; } print $output;

I added the limit clause in the SQL in order to avoid potentially choking my database server with what I imagined to be a database-intensive node_delete call. Unfortunately, the limit was too generous by an order of 5 or so. The database choked after about 200-300 nodes. With 64,000 nodes to delete, I'd have to run this same script 320 times to get everything cleared out of the system! That wouldn't be a lot of fun, and it'd almost certainly incur the wrath of my hosting provider.

My next approach was to see if I could delete a lot of nodes at once. After all, a statement like:

DELETE FROM node WHERE nid >= 10000 AND nid <= 30000

is much faster than 20,000 iterations of a statement like:

DELETE FROM node WHERE nid = 10000

Unfortunately, Drupal just doesn't give you an option to effectively execute the first kind of statement; there's no "mass delete" operation in the node API.

To recreate my own one-off "mass delete" operation, I did the following.

  1. Determined the set of function calls induced by a single node_delete. Like many node operations, node_delete consists of a couple of simple SQL statements, adorned with calls to node_invoke and node_invoke_nodeapi. These two functions are themselves pretty simple: they find every module which contains a function whose name indicates that it should be invoked, and then they call those functions. In the case of node_invoke, deducing exactly which function names will be used is a little tricky. I cheated by inserting "print" statements before each function invocation and looking at the trace that resulted when I created a page whose body was
    node_delete($some_nid);

    Of course, $some_nid was chosen to be the nid for a node which I had already identified as a node I wanted to delete.

  2. Converted these function calls into SQL DELETE statements. Once I knew the names of functions invoked during a node_delete, I traced through each function, looking for SQL delete statements, as well as any Drupal functions which might indicate that some other change needed to be made to the Drupal system. Fortunately, the events I was working with didn't have attached files, and didn't have other kinds of complicating factors that might have required the execution of PHP code on a per-node basis. As a result, I was able to compile a list of simple DELETE statements which, taken together, would recreate the deletion of a single node; there was no need for extra computation (though circumstances may vary in your own Drupal installation.) By the way, note that I could have extracted the exact same SQL statements more simply with the Devel module. However, I simply didn't have the time to install it.
  3. Identified large ranges of nodes to delete This is a critical piece, since my MySQL database doesn't support sub-selects, and in any event sub-selects are horribly inefficient. I wanted to write SQL queries like
    DELETE FROM node WHERE nid >= 10000 and nid <= 30000

    in order to delete a lot of nodes quickly. There was a little black magic here, as I had to essentially guess what the data looks like, but the nature of the problem (event repeat iterating thousands of times upon a single node creation) gave me a hint that there would be long sequences of nodes which all were events past June 2008. What I ended up doing was grouping the list of nodes by the thousands of node ids, and within each group of a thousand, by node type. The SQL query I wrote was:

    SELECT floor(nid/1000), type count(*) FROM node
    GROUP BY floor(nid/1000), type

    That helped me identify large ranges of nodes which were almost entirely made of events - ranges as large as 10,000 nodes at a stretch.

  4. Verified that these ranges contained only offending nodes (i.e., nodes with start date >= 2008-06-01) When deleting large amounts of data, this kind of verification is absolutely critical! SQL makes deleting a lot of data disastrously easy. Even though I had everything backed up, I didn't want to get halfway through my deletions only to find that I had made some crucial mistake early on. My solution was to double-verify. First, I verified that there were no nodes which were events, and no events with start dates before June 1, 2008, in my range. Then, I verified that all of the nodes in the group were events with start dates after June 1, 2008. These two tests are logically equivalent, and if I perfectly reproduced the tests each time, I should always get the same answer. However, I was worried about typing mistakes and forgetful recreations of templated SQL queries; double-verification guaranteed that I'd have to make the same mistake twice before deleting a lot of good data.

    Each test was actually a pair of SQL statements which were executed simultaneously; both statements were expected to give the same result. If they didn't, something was wrong.

    This was the first test; it was expected to return 0 for both queries:

    select count(*) from event
    where from_unixtime(event_start) < '2008-06-01'
    and nid >= 10000 and nid <= 30000;
    
    select count(*) from node n
    where type != 'flexinode-10'
    and n.nid >= 10000 and n.nid <= 10000;

    This was the second test; it was expected to return a number roughly (but not necessarily exactly) equivalent to the size of the range, and both queries were expected to return the same number:

    select count(*) from node n inner join event e on n.nid = e.nid
    where n.type='flexinode-10' and from_unixtime(e.event_start) >= '2008-06-01'
    and n.nid >= 10000 and n.nid <= 30000;
    
    select count(*) from node n
    where n.nid >= 10000 and n.nid <= 30000;
  5. Executed the SQL DELETE statements against these ranges This was the easy (and fun) part. Once I determined that a range of node ids like, say, 10,000 - 30,000 was safe for deletion, I executed the SQL statements I identified in step 2 against that range. I just replaced queries like
    DELETE FROM node WHERE nid=123

    with

    DELETE FROM node WHERE nid >= 10000 and nid <= 30000

    It's very satisfying to delete 20,0000 rows with a single command, somehow
    .

  6. Sanity-checked the new database Before overwriting the live site database with the database I had hacked apart offline, I wanted to make sure that I hadn't accidentally deleted any good data. I took a look at the number of events for each month before June 2008, for both the live and development databases. (Incidentally, Excel is extremely useful for this! Just copy the two SQL result sets side-by-side into a spreadsheet, use Tools -> Text to columns align the MySQL data into cells, and do row-by-row comparisons with Fill down.) Everything turned out the same, so I went ahead and transfered the database back to the live site.

This database intervention was actually just the first step in upgrading the site. I wanted to get the size of the database reduced before running a Drupal update script or a Flexinode-to-CCK converter (yep, this old site was built without CCK. Whoops.) I succeeded, incidentally - the SQL transcript for the old database was 140 MB, and the transcript for the new database was "only" 30 MB. Mission accomplished!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options