This is part of our blog series about behind-the-scenes things we do on a regular basis on Sherlock, to keep it up and running in the best possible conditions for our users.
Now that Sherlock’s old storage system has been retired, we can finally tell that story. It all happened in 2016.

Or: How we replaced more than 1 PB of hard drives, while continuing to serve files to unsuspecting users.

TL;DR: The parallel filesystem in Stanford’s largest HPC cluster has been affected by frequent and repeated hard-drive failures since its early days. A defect was identified that affected all of the 360 disks used in 6 different disk arrays. A major swap operation was planned to replace the defective drives. Multiple hardware disasters piled up to make matters worse, but in the end, all of the initial disks were replaced, while retaining 1.5 PB of user data intact, and keeping the filesystem online the whole time.

History and context

Once upon a time, in a not so far away datacenter…

We, Stanford Research Computing Center, manage many high-performance computing and storage systems at Stanford. In 2013, in a effort to centralize resources and advance computational research, a new HPC cluster, Sherlock, has been deployed. To provide best-in-class computing resources to all faculty and facilitate research in all fields, this campus-wide cluster features a high-performance, Lustre-based parallel filesystem.

This filesystem, called /scratch, was designed to provide high-performance storage for temporary files during simulations. Initially made of three I/O cells, the filesystem had been designed to be easily expanded with more hardware as demand and utilization grew. Each I/O cell was comprised of:

  • 2x object storage servers,
  • 2x disk arrays, with:
    • dual RAID controllers,
    • 5 drawers of 12 disks each,
    • 60 4TB SAS disks total.

MD3260

Each disk array was configured with 6x 10-disk RAID6 LUNs, and every SAS path being redundant, the two OSS servers could act as a high-availability pair. This is a pretty common Lustre setup.

Close to a petabyte in size, this filesystem quickly became the go-to solution for many researchers who didn’t really have any other option to store and compute against their often large data sets. Over time, the filesystem was expanded several times and eventually more than doubled in size:

# disk arrays# OSTs# diskssize
initially6363601.1 PB
ultimately1810810803.4 PB

As the filesystem grew, it ended up containing close to 380 million inodes (that is, filesystem entries, like files, directories or links). Please keep that in mind, turns out that’s an important factor for the following events.

The initial issue

All was fine and dandy in storage land, and we had our share of failing disks, as everybody. We were replacing them as they failed, sending them back to our vendor, and getting new ones in return. Datacenter business as usual.

Except, a lot of disks were failing. Like, really a lot, as in one every other day.

We eventually came to the conclusion that our system had been installed with a batch of disks with shorter-than-average lifespans. They were all from the same disk vendor, manufactured around the same date. But we didn’t worry too much.

Until that day, where 2 disks failed within 3 hours of each other. In the same disk array. In. The. Same. LUN.

To give some context, one failed drive in a 10-disk RAID6 array is no big deal: data can be reconstructed from the 9 remaining physical disks without any problem. If by any chance one of those remaining disks suffers from a problem and data cannot be read from it, there are still enough redundancy to reconstruct the missing data and all is well.RAID6 8+2

A single drive failure is handled quite transparently by the disk array:

  • it emits an alert,
  • you replace the failed disk,
  • it detects the drive has been replaced,
  • it starts rebuilding it from data and parity on the other disks of the LUN,
  • about 24 hours later, you have a brand new LUN, all shiny and happy again.

But two failed disks, on the other hand, that’s pretty much like a Russian roulette session: you may be lucky and pull it off, but there’s a good chance you won’t. While the LUN misses 2 disks, there is no redundancy left to reconstruct the data. Meaning that any read error on any of the remaining 8 disks will lead to data loss as the controller won’t be able to reconstruct anything. And worse, any bit flip during reads will go completely unnoticed, as there is no parity left to check the data. Which means that you can potentially be reconstructing completely random garbage on your drives.

Given that, it didn’t take us long to pick up the phone and call our vendor.

They confirmed our initial findings that in our initial set of 6 disk arrays, over the course of 2 years, we had already replaced about 60 disks out of 360. At a rate of 5-10 failures per month. Way higher than expected.

The LUN rebuild eventually completed fine, without any problem, but that double-failure acted as a serious warning. So we stated thinking about ways to solve our problem. And that’s when the sticky cheese hit the fan…

When problems pile up

Three days after the double failure, we had an even more important hardware event: one drawer in another disk array misbehaved, reporting itself as degraded, and 6 disks failed in that same drawer over the course of a few minutes. A 7th disk was evicted a few hours later, and left 2 LUNs without any parity in that single array. Joy all over. In a few minutes, the situation we were dreading a few days earlier just happened twice in the same array. We were a disk away from loosing serious amounts of data (we’re talking 30TB per LUN). And as past experience proved, those disks were not of the most reliable kind…

We got our vendor to dispatch a replacement drawer to us under the terms of our H+4 support contract. Except they didn’t have any replacement drawer in stock that they could get to us in 4 hours. So they overnight’d it and we got the replacement drawer the following day.

We diligently replaced the drawer and rebuild started on those 7 drives in the disk array. Which, yes, means that one LUN was rebuilding without any redundancy. Like the one from the other disk array the week before. And as everyone probably guessed, things didn’t go that well the second time: that LUN stayed degraded, despite all the rebuild operations being done and all the physical disks state being "optimal". Turned out the interface and the internal controller state disagreed on the status of a drive. On our vendor’s suggestion, we replaced that drive, a new rebuild started, and then abruptly stopped mid-course: the state of the LUN was still "degraded".

And then, we had the sensible yet completely foolish idea of calling vendor support on a week-end.

Hilarity and data loss ensued.

Never trust your hardware vendor support on week-ends

We were in a situation were a LUN was degraded, and a recently failed drive had just failed to rebuild, yet was showing up as “optimal” in the management interface. The vendor support technician then had the brilliant idea of forcefully “reviving” that drive. Which had the immediate effect of putting back online a drive that had been partially reconstructed, ie. on which 100% of the data had to be considered bit waste.
And the LUN stayed in that state, serving ridiculously out-of-sync, inaccurate and pretty much random data to our Lustre OSS servers for about 15 minutes. Fifteen minutes. Nine hundred full-size seconds. A lot of bad things can (and did) happen in 900 seconds.

Luckily, the Lustre filesystem quickly realized it was lied to, so it did the only sane thing to do, blocked all I/O and set that device read-only. Of course, some filesystem-level corruption happened during the process.

We had to bring that storage target down and check it multiple time with fsck to restore its data structure consistency. About 1,500 corrupted entries where found, detached from the local filesystem map and stored in the lost+found directory. That means that all those 1,500 objects, which were previously part of files, where now orphaned from the filesystem, as it had no way of knowing what file the belonged too anymore. So it tossed them in lost+found as it couldn’t do much else with them.

And on our cluster, users trying to access those files were kindly greeted with an error message, which, as error messages sometimes are, was unintuitively related to the matter at hand: cannot allocate memory.

With (much better) support from our filesystem vendor, we were able to recover a vast majority of those 1,500 files, and re-attach them to the filesystem, where they originally were. For Lustre admins, the magic word here is ll_recover_lost_found_objs.

So in the end, we “only” lost 29 files in the battle. We contacted each one of the owners to let them know about the tragic fate of their files, and most of them barely flinched, their typical response being: "Oh yeah, I know that’s temporary storage anyway, let me upload a new copy of that file from my local machine".

We know, we’re blessed with terrific users.

The tablecloth trick

Now, this was just starters, we hadn’t really had a chance to tackle the real issue yet. We were merely absorbing the fallout of that initial drawer failure, but we hadn’t done anything to address the high failure rate of our disk drives.

Our hardware vendor, well aware of the underlying reliability issue, as the same scenario happened other places too, kindly agreed to replace all of our remaining original disks. That is, about 300 of them:

disk arrayalready HDDstotal HDDsHDDs to replace
DA00166044
DA01156045
DA02146046
DA03136047
DA0486052
DA05156045
total81360279

The strategy devised by that same vendor was:

"We’ll send you a whole new disk array, filled with new disks, and you’ll replicate your existing data there".

To which we replied:

“Uh, sorry, that’s won’t work. You see, those arrays are part of a larger Lustre filesystem, we can’t really replicate data from one to another without a downtime. And we would need a downtime long enough to allow us to copy 240TB of data. Six times, 'cause you know, we have six arrays. Oh, and our users don’t like downtimes.”

So we had to find another way.

Our preference was to minimize manipulations on the filesystem and keep it online as much as possible during this big disk replacement operation. So we leaned toward the path of least resistance, and let the RAID controllers do what they do best: compute parities and write data. So we ended up removing each one of those bad disks, one at a time, replacing it with a new disk, and let the controller rebuild the LUN.

Each rebuild operation took about 24 hours, so obviously, replacing ~300 disks one at a time wasn’t such a thrilling idea: assuming somebody would be around 24/7 to swap a new drive as soon as the previous one finished, that would make the whole operation last almost a full year. Not very practical.

So we settled on doing them in batches, replacing one disk in each of the 36 LUNs in each batch. That would allow the RAID controllers to rebuild several LUNs in parallel, and cut the overall length of the operation. Instead of 300 sequential 24-hours rebuilds, we would only need 5 waves of disk replacements, which shouldn’t take more than a couple weeks total.

Should we mention the fact that our adored vendor mentioned that, since we were using RAID6, if we wanted to speed things even more, we could potentially consider replacing two drives at a time in each LUN, but that they wouldn’t recommend it? Nah, right, we shouldn’t.

Remove the disks, keep the data

So they went away and shipped us new disks. That’s where the “tablecloth trick” analogy is fully realized: we were indeed removing disk drives from our filesystem, while keeping the data intact, and inserting new disks underneath to replace them. Which would really be like pulling the tablecloth, putting a new one in place, and keeping the dishes intact.

tablecloth

But you know, things never go as planned, and while we started replacing that first batch of disks, we realized that those unreliable drives? Well, they were really unreliable.

When things go south

No less than five additional disks failed during that same first wave of rebuilds. Four of them in the same array (DA00). To make things worse, in one of those LUNs, one additional disk failed during the rebuild and then, unreadable sectors were encountered on a 3rd disk. Which lead to data loss and a corrupted LUN.

We contacted our vendor, which basically said: "LUN is lost, restore from backup". Ha ha! Of course, we have backups for a 3PB Lustre filesystem, and we can restore an individual OST without breaking complete havoc in the rest of the filesystem’s coherency. For some reason, our vendor support recommended to delete the LUN, recreate it, and let the Lustre file system re-populate data back. We are still trying to understand what they meant.

On the bright side, they engaged our software vendor, to provide some more assistance at the filesystem level and devise a recovery strategy. We had one of our own rolling already, and it turned out it was about the same.

Relocating files

Since we still had access to the LUN, our approach was to migrate all the files out of that LUN as quickly as possible and relocate them on other OSTs in Lustre, re-initialize the LUN at the RAID level, and them reformat it and re-insert it in the filesystem. Or, more precisely:

  1. deactivate the OST on MDT to avoid new object creation,
  2. use lfs_migrate to relocate files out of that OST, using either Robinhood or the results of lfs find to identify files residing on that OST (the former can be out of date, the latter was quite slow),
  3. make sure the OST was empty (lfs find again),
  4. disable the OST on clients, so they didn’t use it anymore,
  5. reactivate the OST on the MDS to clear up orphaned objects (while the OST is disconnected from the MDT, file relocations are not synchronized to the OST, so objects are orphaned there and take up space unnecessarily),
  6. backup the OST configuration (so it could be recreated with the same parameters, including its index),
  7. reinitialize the LUN in the disk array, and retain its configuration (most importantly its WWID),
  8. reformat the OST with Lustre,
  9. restore the OST configuration (especially its index),
  10. reactivate the OST.

What can go wrong in a 10-step procedure? Turns out, it kind of all stopped at step 1.

Making sure nobody writes files to a LUN anymore

In order to be able to migrate all the files from an OST, you need to make sure that nobody can write new files to it anymore. How could you empty an OST if new files keep being created on it?
There are several approaches to this, but it took us some tries to get it right where we wanted it to be.

First, you can try to ‘deactivate’ the OST by making it read-only on the MDT. It means that users can still read the existing files on the OST, but the MDT won’t consider it for new file creations. Sounds great, except for one detail: when you do this, the OST is disconnected from the MDT, so inodes occupied by files that are being migrated are not reported as freed up to the MDT. The consequence is that the MDT still thinks that the inodes are in-use, and you end up in a de-synchronized state, with orphaned inodes on your OST. Not good.

So you need, at some point, to reconnect your OST to the MDT. Except as soon as you do this, new files get created on it, and you need to deactivate the OST, migrate them again, and bam, new orphan inodes again. Back to square one.

Another method is to mark the OST as "degraded", which is precisely designed to handle such cases: OST undergoing maintenance, or rebuilding RAIDs, during which period the OST shouldn’t be used to create new files. So, we went ahead and marked our OST as "degraded". Until we realized that files were still created on it. It turns out that this was because of some uneven usage in out OSTs (they were added to the filesystem over time, so they were not all filled at the same level): if there’s too much unbalanced utilization among OSTs, the Lustre QOS allocator will ignore the “degraded” flag on OSTs, and privilege trying to rebalance usage over obeying OST degradation flags.

Our top-notch filesystem vendor support suggested an internal setting to set on the OST (fail_loc=0x229, don’t try this at home) to artificially mark the OST as "out-of-space", which would carry both benefits of leaving it connected to the MDT for inodes cleanup, and prevent new files creation there. Unfortunately, this setting had the unexpected side effect of making load spike on the MDS, practically rendering the whole filesystem unsuable.

So we ended up deciding to temporarily sacrifice good load balancing across OSTs, and disabled the QOS allocator. This allowed us to mark our OST as "degraded", keep it connected to the MDT so inodes associated with migrated files would effectively be cleaned, while preventing new file creation. This worked great.

lfs_migrates

We let our migration complete, and at the end both OSTs were completely empty, devoid of any file.

Zombie LUN

Because any good story needs zombies.

Once we had finished emptying our OSTs, we then needed to fix them at the RAID level. Because, remember, everything went to hell after multiple disk failures during a LUN rebuild. Meaning that in their current state, those two LUNs were unusable and had to be re-initialized. We had good hopes we would be able to do this from the disk array management tools. Unfortunately, our hardware vendor didn’t think it would be possible, and strongly recommended to destroy the LUN and to rebuild it with the same disks.

The problem with that approach is that this would have generated a different identifier for our LUNs, meaning we would have had to change the configuration of our multipath layer, and more importantly, swap old WWIDs with the new ones in our Lustre management tool. Which is not supported.

Thing is, we’re kind of stubborn. And we didn’t want to change WWIDs. So we looked for a way to re-initialize those LUNs in-place. Sure enough, failing multiple drives in the LUN rendered it inoperable. And nothing in the GUI seemed to be possible from there, besides "calling support for assistance". And you know, we tried that before, so no thanks we’ll pass.

Finally, exploring the CLI options, we found one (revive diskGroup) that did exactly what we were looking for: after replacing the 2 defectives disks (which made the LUN fail), we revived it from the CLI, and it happily sprung to life again. With all its parameters intact, so from the servers point of view, it was like nothing ever happened.

Restore Lustre

So, all what was left to do, was to reformat the OSTs and restore their parameters we had backed up before failing and reviving the LUNs.

Wrap up

Everything was a smooth ride from there. While working on repairing our two failed OSTs, we were continuously replacing those ~300 defective hard drives, one at a time and monitoring the rebuilds processes. At any given time, we had something like 36 LUNs rebuilding (6 arrays, 6 LUNs each) to maximize the throughput.

Disk replacement

Our hardware vendor was sending us replacement drives in batches, and we’ve been replacing 1 disk in each LUN pretty much every day for about 3 weeks.
We built a tool to follow the replacements and select the next disks to replace (obviously placement was important as we didn’t want to remove multiple disks from the same LUN). The tool allowed to see the number of disks left to replace, the status of current rebuilds, and when possible, selected the next disks to replace by making them blink in the disk arrays.

The end

Just because that how lucky we are, another drawer failed during the last rounds of disk replacements. It took an extra few days to get a replacement on site and replace it. Fortunately, no unreadable sectors happened during the recovery.

It took a few more days to clear out remaining drawers and controllers errors and to make sure that everything was stable and in running order. The official end of the operation was declared on May 17th, 2016, about 4 months after the initial double-disk failure.

We definitely learned a lot in the process, way more that we could ever have dared to ask. And it was quite the adventure, the kind that we hope will not happen again. But considering all what happened, we’re very glad the damage was limited to a handful of files and didn’t have a much broader impact.