Sherlock changelog

Tracking NFS problems down to the SFP level

2021-02-05T18:20:00Z

This is part of our technical blog series about things that happen behind-the-scenes on Sherlock, and which are part of our ongoing effort to keep it up and running in the best possible conditions for our beloved users.

For quite a long time, we've been chasing down an annoying NFS timeout issue that seemed to only affect Sherlock 3.0 nodes.

That issue would impact both login and compute nodes, both NFSv4 user mounts (like $HOME and $GROUP_HOME) and NFSv3 system-level mounts (like the one providing software installations), and occur at random times, on random nodes. It was not widespread enough to be causing real damage, but from time to time, a NFS mount would hang and block I/O for a job, or freeze a login session. When that happened, the node would still be able to ping all of the NFS servers' IP addresses, even remount the same NFS file system with the exact same options in another mount point, and no discernable network issue was apparent on the nodes. Sometimes, the stuck mounts would come back to life on their own, sometimes they would stay hanging forever.

Is it load? Is it the kernel? Is it the CPU?

It kind of looked like it could be correlated with load, and mostly appear when multiple jobs were doing NFS I/O on a given node, but we never found conclusive proof that it was the case. The only distinguishable element was that the issue was only observed on Sherlock 3.0 nodes and never affected older Sherlock 1.0/2.0 nodes. So we started suspecting something about the kernel NFS client, maybe some oddity with AMD Rome CPUs: after all, they were all quite new, and the nodes had many more cores than the previous generation. So maybe they had more trouble handling the parallel operations, ended up with a deadlock or something.

But still, all the Sherlock nodes are using the same kernel, and only the Sherlock 3.0 nodes were affected, so it appeared unlikely to be a kernel issue.

The NFS servers maybe?

We then started looking at the NFS servers. Last December’s maintenance was actually an attempt at resolving those timeout issues, even though it proved fruitless in that aspect. We got in touch with vendor support to explore possible explanations, but nothing came out of it and our support case went nowhere. Plus, if the NFS servers were at fault, it would likely have affected all Sherlock nodes, not just a subset.

It’s the NFS client parameters! Or is it?

So back to the NFS client, we've started looking at the NFS client mount parameters. The petazillion web hits about "nfs timeout" didn't really help in that matter, but in the process we found pretty interesting [email protected]/T/?utm_source=noticeable&utm_campaign=sherlock.tracking-nfs-problems-down-to-the-sfp-level&utm_content=publication+link&utm_id=bYyIewUV308AvkMztxix.GtmOI32wuOUPBTrHaeki.P3xY1hwDWMe8tR48vPEj&utm_medium=newspage" rel="noopener nofollow" target="_blank">discussions about read/write sizes and read-ahead. We tried tweaking all of those parameters left and right, deployed various configs on the compute nodes (A/B testing FTW!), but the timeout still happened.

The lead

In the end, what gave us a promising lead was an article found on the GRNET blog that explain how the authors tracked down a defective QSFP that was causing issues in their Ceph cluster. Well, it didn't take long to realize that there was a similar issue between those Sherlock nodes and the NFS servers. Packet loss was definitely involved.

The tricky part, as described in the blog post, is that the packet loss only manifested itself when using large ICMP packets, close to the MTU upper limit. When using regular packet size, no problem was apparent.

For instance, this regular ping didn't show any loss:

# ping -c 50 10.16.90.1 | grep loss
50 packets transmitted, 50 received, 0% packet loss, time 538ms

But when cranking up the packet size:

# ping -s 8972 -c 50 10.16.90.1 | grep loss
50 packets transmitted, 36 received, 28% packet loss, time 539ms

What was even funnier is that not all Sherlock 3.0 nodes were experiencing loss to the same NFS server nodes. For instance, from one client node , there was packet loss to just one of the NFS servers:

client1# clush -Lw 10.16.90.[1-8] --worker=exec ping -s 8972 -M do -c 10 -q %h | grep loss
10.16.90.1: 10 packets transmitted, 10 received, 0% packet loss, time 195ms
10.16.90.2: 10 packets transmitted, 8 received, 20% packet loss, time 260ms
10.16.90.3: 10 packets transmitted, 10 received, 0% packet loss, time 193ms
10.16.90.4: 10 packets transmitted, 10 received, 0% packet loss, time 260ms
10.16.90.5: 10 packets transmitted, 10 received, 0% packet loss, time 200ms
10.16.90.6: 10 packets transmitted, 10 received, 0% packet loss, time 264ms
10.16.90.7: 10 packets transmitted, 10 received, 0% packet loss, time 196ms
10.16.90.8: 10 packets transmitted, 10 received, 0% packet loss, time 194ms

But from another client, sitting right next to it, no loss to that server , but packets dropped to another one instead:

client2# clush -Lw 10.16.90.[1-8] --worker=exec ping -s 8972 -M do -c 10 -q %h | grep loss
10.16.90.1: 10 packets transmitted, 8 received, 20% packet loss, time 190ms
10.16.90.2: 10 packets transmitted, 10 received, 0% packet loss, time 198ms
10.16.90.3: 10 packets transmitted, 10 received, 0% packet loss, time 210ms
10.16.90.4: 10 packets transmitted, 10 received, 0% packet loss, time 197ms
10.16.90.5: 10 packets transmitted, 10 received, 0% packet loss, time 196ms
10.16.90.6: 10 packets transmitted, 10 received, 0% packet loss, time 243ms
10.16.90.7: 10 packets transmitted, 10 received, 0% packet loss, time 201ms
10.16.90.8: 10 packets transmitted, 10 received, 0% packet loss, time 213ms

The link

That all started to sound like a faulty stack link, or a a problem in one of the LACP links between the different switch stacks (Sherlock's and the NFS servers’). We didn't find anything obviously out-of-place in reviewing the switches configuration, so we went back to the switches’ documentation to try to understand how to check counters and identify bad links (which gave us the opportunity to mumble about documentation that is not in sync with actual commands, but that’s another topic...).

So we dumped the hardware counters for each link involved in the NFS connections, and on a switch, on the NFS client’s side, there was this:

te1/45 Ingress FCSDrops : 0
te1/46 Ingress FCSDrops : 0
te2-45 Ingress FCSDrops : 0
te2-46 Ingress FCSDrops : 0
te3-45 Ingress FCSDrops : 0
te3-46 Ingress FCSDrops : 1064263014
te4-45 Ingress FCSDrops : 0
te4-46 Ingress FCSDrops : 0

Something standing out, maybe?

In more details:

#show interfaces te3/46
TenGigabitEthernet 3/46 is up, line protocol is up
Port is part of Port-channel 98
[...]
Input Statistics:
18533249104 packets, 35813681434965 bytes
[...]
1064299255 CRC, 0 overrun, 0 discarded

The CRC number indicates the number of CRC failures, packets which failed checksum validation. All the other ports on the switch were at 0. So clearly something was off with that port.

The culprit: a faulty SFP!

We decided to try to shut that port down (after all, it's just 1 port out of a 8-port LACP link), and immediately, all the packet loss disappeared.

So we replaced the the optical transceiver in that port, hoping that swapping that SFP would resolve the CRC failure problem. After re-enabling the link, the number of dropped packets seemed to have decreased. But not totally disappear…

The real culprit: the other SFP

Thinking a little more about it, since the errors were actually Ingress FCSDrops on the switch, it didn’t seem completely unreasonable to consider that those frames were received by the switch already corrupted, and thus, that they would have been mangled by either the transceiver on the other end of the link, or maybe in-flight by a damaged cable. So maybe we’ve been pointing fingers at a SFP, and maybe it was innocent… 😁

We checked the switch’s port on the NFS server’s side, and the checksum errors and drop counts were all at 0. We replaced that SFP anyway, just to see, and this time, bingo: no more CRC errors on the other side.

Which lead us to the following decision tree:

if a port has RX/receiving/ingress errors, it’s probably not its fault, and the issue is most likely with its peer at the other end of the link,
if a port has TX/transmitting/egress errors, it’s probably the source of the problem,
if both ports at each end of a given link have errors, the cable is probably at fault.

By the way, if you’re wondering, here’s what a SFP looks like:

TL;DR

We had seemingly random NFS timeout issues. They turned out to be caused by a defective SFP, that was eventually identified through the port error counter of the switch at the other end of the link.

There's probably a lesson to be learned here, and we were almost disappointed that DNS was not involved (because it's always DNS), but in the end, we were glad to finally find a rational explanation to those timeouts. And since that SFP replacement, not a single NFS timeout has been logged.

Sherlock is hard at work against COVID-19

2020-04-14T15:04:00.001Z

About a month ago, we announced that we were dedicating a portion of Sherlock’s computing resources to research projects around COVID-19.

Since then, more than 15 PIs and research groups have reached out to share their projects, and their time-critical need for more computing resources in the global pandemic context. All of these projects have been granted access to Sherlock’s dedicated COVID-19 resources, and those research teams have been hard at work ever since.

As with pretty much everything that runs on Sherlock, it’s been amazing to see the breadth and variety of those research projects. They touch an stunningly wide variety of subjects, covering many of the biological, social and economic aspects of this disease.

From exploring CRISPR-based treatments, analyzing the SARS-CoV-2 genome and the virus-host interactions, to modeling the COVID-19 epidemiological spread in different locations, studying cellphone location data to understand contact patterns, improving statistical models of the spread and helping policy makers to take informed decisions, to modelling alternative ventilator designs, the research projects that now benefit from dedicated resources on Sherlock come from many of the Stanford Schools, from Medicine and Engineering to Humanities and Sciences.

To continue supporting this vast and collaborative endeavor, we’d like to renew our call and encourage more of our user community working on COVID-19 to [email protected]" target="_blank" rel="noopener">reach out if they need more resources. With the generous contributions of the School of Humanities and Sciences, we’ve even been able to increase the amount of computing power dedicated to those critical projects.

Again, Sherlock owners: if you have dedicated compute nodes on Sherlock that you’d like to contribute to this essential computing effort, please let us know, and we’ll be happy to add them to the dedicated resources pool, so all of these projects around coronavirus could benefit from more computing power.

And if your lab is not doing COVID-19 research, don’t worry. Our primary mission is still to support research at Stanford, and research continues. So if you need assistance with computation or data challenges while your lab is closed, please feel free to fill out this short survey and we’ll try our best to help.

Adventures in storage

2019-12-03T23:30:00.001Z

This is part of our blog series about behind-the-scenes things we do on a regular basis on Sherlock, to keep it up and running in the best possible conditions for our users.
Now that Sherlock’s old storage system has been retired, we can finally tell that story. It all happened in 2016.

Or: How we replaced more than 1 PB of hard drives, while continuing to serve files to unsuspecting users.

TL;DR: The parallel filesystem in Stanford’s largest HPC cluster has been affected by frequent and repeated hard-drive failures since its early days. A defect was identified that affected all of the 360 disks used in 6 different disk arrays. A major swap operation was planned to replace the defective drives. Multiple hardware disasters piled up to make matters worse, but in the end, all of the initial disks were replaced, while retaining 1.5 PB of user data intact, and keeping the filesystem online the whole time.

History and context

Once upon a time, in a not so far away datacenter…

We, Stanford Research Computing Center, manage many high-performance computing and storage systems at Stanford. In 2013, in a effort to centralize resources and advance computational research, a new HPC cluster, Sherlock, has been deployed. To provide best-in-class computing resources to all faculty and facilitate research in all fields, this campus-wide cluster features a high-performance, Lustre-based parallel filesystem.

This filesystem, called /scratch, was designed to provide high-performance storage for temporary files during simulations. Initially made of three I/O cells, the filesystem had been designed to be easily expanded with more hardware as demand and utilization grew. Each I/O cell was comprised of:

2x object storage servers,
2x disk arrays, with:
- dual RAID controllers,
- 5 drawers of 12 disks each,
- 60 4TB SAS disks total.

Each disk array was configured with 6x 10-disk RAID6 LUNs, and every SAS path being redundant, the two OSS servers could act as a high-availability pair. This is a pretty common Lustre setup.

Close to a petabyte in size, this filesystem quickly became the go-to solution for many researchers who didn’t really have any other option to store and compute against their often large data sets. Over time, the filesystem was expanded several times and eventually more than doubled in size:

	# disk arrays	# OSTs	# disks	size
initially	6	36	360	1.1 PB
ultimately	18	108	1080	3.4 PB

As the filesystem grew, it ended up containing close to 380 million inodes (that is, filesystem entries, like files, directories or links). Please keep that in mind, turns out that’s an important factor for the following events.

The initial issue

All was fine and dandy in storage land, and we had our share of failing disks, as everybody. We were replacing them as they failed, sending them back to our vendor, and getting new ones in return. Datacenter business as usual.

Except, a lot of disks were failing. Like, really a lot, as in one every other day.

We eventually came to the conclusion that our system had been installed with a batch of disks with shorter-than-average lifespans. They were all from the same disk vendor, manufactured around the same date. But we didn’t worry too much.

Until that day, where 2 disks failed within 3 hours of each other. In the same disk array. In. The. Same. LUN.

To give some context, one failed drive in a 10-disk RAID6 array is no big deal: data can be reconstructed from the 9 remaining physical disks without any problem. If by any chance one of those remaining disks suffers from a problem and data cannot be read from it, there are still enough redundancy to reconstruct the missing data and all is well.

A single drive failure is handled quite transparently by the disk array:

it emits an alert,
you replace the failed disk,
it detects the drive has been replaced,
it starts rebuilding it from data and parity on the other disks of the LUN,
about 24 hours later, you have a brand new LUN, all shiny and happy again.

But two failed disks, on the other hand, that’s pretty much like a Russian roulette session: you may be lucky and pull it off, but there’s a good chance you won’t. While the LUN misses 2 disks, there is no redundancy left to reconstruct the data. Meaning that any read error on any of the remaining 8 disks will lead to data loss as the controller won’t be able to reconstruct anything. And worse, any bit flip during reads will go completely unnoticed, as there is no parity left to check the data. Which means that you can potentially be reconstructing completely random garbage on your drives.

Given that, it didn’t take us long to pick up the phone and call our vendor.

They confirmed our initial findings that in our initial set of 6 disk arrays, over the course of 2 years, we had already replaced about 60 disks out of 360. At a rate of 5-10 failures per month. Way higher than expected.

The LUN rebuild eventually completed fine, without any problem, but that double-failure acted as a serious warning. So we stated thinking about ways to solve our problem. And that’s when the sticky cheese hit the fan…

When problems pile up

Three days after the double failure, we had an even more important hardware event: one drawer in another disk array misbehaved, reporting itself as degraded, and 6 disks failed in that same drawer over the course of a few minutes. A 7th disk was evicted a few hours later, and left 2 LUNs without any parity in that single array. Joy all over. In a few minutes, the situation we were dreading a few days earlier just happened twice in the same array. We were a disk away from loosing serious amounts of data (we’re talking 30TB per LUN). And as past experience proved, those disks were not of the most reliable kind…

We got our vendor to dispatch a replacement drawer to us under the terms of our H+4 support contract. Except they didn’t have any replacement drawer in stock that they could get to us in 4 hours. So they overnight’d it and we got the replacement drawer the following day.

We diligently replaced the drawer and rebuild started on those 7 drives in the disk array. Which, yes, means that one LUN was rebuilding without any redundancy. Like the one from the other disk array the week before. And as everyone probably guessed, things didn’t go that well the second time: that LUN stayed degraded, despite all the rebuild operations being done and all the physical disks state being "optimal". Turned out the interface and the internal controller state disagreed on the status of a drive. On our vendor’s suggestion, we replaced that drive, a new rebuild started, and then abruptly stopped mid-course: the state of the LUN was still "degraded".

And then, we had the sensible yet completely foolish idea of calling vendor support on a week-end.

Hilarity and data loss ensued.

Never trust your hardware vendor support on week-ends

We were in a situation were a LUN was degraded, and a recently failed drive had just failed to rebuild, yet was showing up as “optimal” in the management interface. The vendor support technician then had the brilliant idea of forcefully “reviving” that drive. Which had the immediate effect of putting back online a drive that had been partially reconstructed, ie. on which 100% of the data had to be considered bit waste.
And the LUN stayed in that state, serving ridiculously out-of-sync, inaccurate and pretty much random data to our Lustre OSS servers for about 15 minutes. Fifteen minutes. Nine hundred full-size seconds. A lot of bad things can (and did) happen in 900 seconds.

Luckily, the Lustre filesystem quickly realized it was lied to, so it did the only sane thing to do, blocked all I/O and set that device read-only. Of course, some filesystem-level corruption happened during the process.

We had to bring that storage target down and check it multiple time with fsck to restore its data structure consistency. About 1,500 corrupted entries where found, detached from the local filesystem map and stored in the lost+found directory. That means that all those 1,500 objects, which were previously part of files, where now orphaned from the filesystem, as it had no way of knowing what file the belonged too anymore. So it tossed them in lost+found as it couldn’t do much else with them.

And on our cluster, users trying to access those files were kindly greeted with an error message, which, as error messages sometimes are, was unintuitively related to the matter at hand: cannot allocate memory.

With (much better) support from our filesystem vendor, we were able to recover a vast majority of those 1,500 files, and re-attach them to the filesystem, where they originally were. For Lustre admins, the magic word here is ll_recover_lost_found_objs.

So in the end, we “only” lost 29 files in the battle. We contacted each one of the owners to let them know about the tragic fate of their files, and most of them barely flinched, their typical response being: "Oh yeah, I know that’s temporary storage anyway, let me upload a new copy of that file from my local machine".

We know, we’re blessed with terrific users.

The tablecloth trick

Now, this was just starters, we hadn’t really had a chance to tackle the real issue yet. We were merely absorbing the fallout of that initial drawer failure, but we hadn’t done anything to address the high failure rate of our disk drives.

Our hardware vendor, well aware of the underlying reliability issue, as the same scenario happened other places too, kindly agreed to replace all of our remaining original disks. That is, about 300 of them:

disk array	already HDDs	total HDDs	HDDs to replace
DA00	16	60	44
DA01	15	60	45
DA02	14	60	46
DA03	13	60	47
DA04	8	60	52
DA05	15	60	45
total	81	360	279

The strategy devised by that same vendor was:

"We’ll send you a whole new disk array, filled with new disks, and you’ll replicate your existing data there".

To which we replied:

“Uh, sorry, that’s won’t work. You see, those arrays are part of a larger Lustre filesystem, we can’t really replicate data from one to another without a downtime. And we would need a downtime long enough to allow us to copy 240TB of data. Six times, 'cause you know, we have six arrays. Oh, and our users don’t like downtimes.”

So we had to find another way.

Our preference was to minimize manipulations on the filesystem and keep it online as much as possible during this big disk replacement operation. So we leaned toward the path of least resistance, and let the RAID controllers do what they do best: compute parities and write data. So we ended up removing each one of those bad disks, one at a time, replacing it with a new disk, and let the controller rebuild the LUN.

Each rebuild operation took about 24 hours, so obviously, replacing ~300 disks one at a time wasn’t such a thrilling idea: assuming somebody would be around 24/7 to swap a new drive as soon as the previous one finished, that would make the whole operation last almost a full year. Not very practical.

So we settled on doing them in batches, replacing one disk in each of the 36 LUNs in each batch. That would allow the RAID controllers to rebuild several LUNs in parallel, and cut the overall length of the operation. Instead of 300 sequential 24-hours rebuilds, we would only need 5 waves of disk replacements, which shouldn’t take more than a couple weeks total.

Should we mention the fact that our adored vendor mentioned that, since we were using RAID6, if we wanted to speed things even more, we could potentially consider replacing two drives at a time in each LUN, but that they wouldn’t recommend it? Nah, right, we shouldn’t.

Remove the disks, keep the data

So they went away and shipped us new disks. That’s where the “tablecloth trick” analogy is fully realized: we were indeed removing disk drives from our filesystem, while keeping the data intact, and inserting new disks underneath to replace them. Which would really be like pulling the tablecloth, putting a new one in place, and keeping the dishes intact.

But you know, things never go as planned, and while we started replacing that first batch of disks, we realized that those unreliable drives? Well, they were really unreliable.

When things go south

No less than five additional disks failed during that same first wave of rebuilds. Four of them in the same array (DA00). To make things worse, in one of those LUNs, one additional disk failed during the rebuild and then, unreadable sectors were encountered on a 3rd disk. Which lead to data loss and a corrupted LUN.

We contacted our vendor, which basically said: "LUN is lost, restore from backup". Ha ha! Of course, we have backups for a 3PB Lustre filesystem, and we can restore an individual OST without breaking complete havoc in the rest of the filesystem’s coherency. For some reason, our vendor support recommended to delete the LUN, recreate it, and let the Lustre file system re-populate data back. We are still trying to understand what they meant.

On the bright side, they engaged our software vendor, to provide some more assistance at the filesystem level and devise a recovery strategy. We had one of our own rolling already, and it turned out it was about the same.

Relocating files

Since we still had access to the LUN, our approach was to migrate all the files out of that LUN as quickly as possible and relocate them on other OSTs in Lustre, re-initialize the LUN at the RAID level, and them reformat it and re-insert it in the filesystem. Or, more precisely:

deactivate the OST on MDT to avoid new object creation,
use lfs_migrate to relocate files out of that OST, using either Robinhood or the results of lfs find to identify files residing on that OST (the former can be out of date, the latter was quite slow),
make sure the OST was empty (lfs find again),
disable the OST on clients, so they didn’t use it anymore,
reactivate the OST on the MDS to clear up orphaned objects (while the OST is disconnected from the MDT, file relocations are not synchronized to the OST, so objects are orphaned there and take up space unnecessarily),
backup the OST configuration (so it could be recreated with the same parameters, including its index),
reinitialize the LUN in the disk array, and retain its configuration (most importantly its WWID),
reformat the OST with Lustre,
restore the OST configuration (especially its index),
reactivate the OST.

What can go wrong in a 10-step procedure? Turns out, it kind of all stopped at step 1.

Making sure nobody writes files to a LUN anymore

In order to be able to migrate all the files from an OST, you need to make sure that nobody can write new files to it anymore. How could you empty an OST if new files keep being created on it?
There are several approaches to this, but it took us some tries to get it right where we wanted it to be.

First, you can try to ‘deactivate’ the OST by making it read-only on the MDT. It means that users can still read the existing files on the OST, but the MDT won’t consider it for new file creations. Sounds great, except for one detail: when you do this, the OST is disconnected from the MDT, so inodes occupied by files that are being migrated are not reported as freed up to the MDT. The consequence is that the MDT still thinks that the inodes are in-use, and you end up in a de-synchronized state, with orphaned inodes on your OST. Not good.

So you need, at some point, to reconnect your OST to the MDT. Except as soon as you do this, new files get created on it, and you need to deactivate the OST, migrate them again, and bam, new orphan inodes again. Back to square one.

Another method is to mark the OST as "degraded", which is precisely designed to handle such cases: OST undergoing maintenance, or rebuilding RAIDs, during which period the OST shouldn’t be used to create new files. So, we went ahead and marked our OST as "degraded". Until we realized that files were still created on it. It turns out that this was because of some uneven usage in out OSTs (they were added to the filesystem over time, so they were not all filled at the same level): if there’s too much unbalanced utilization among OSTs, the Lustre QOS allocator will ignore the “degraded” flag on OSTs, and privilege trying to rebalance usage over obeying OST degradation flags.

Our top-notch filesystem vendor support suggested an internal setting to set on the OST (fail_loc=0x229, don’t try this at home) to artificially mark the OST as "out-of-space", which would carry both benefits of leaving it connected to the MDT for inodes cleanup, and prevent new files creation there. Unfortunately, this setting had the unexpected side effect of making load spike on the MDS, practically rendering the whole filesystem unsuable.

So we ended up deciding to temporarily sacrifice good load balancing across OSTs, and disabled the QOS allocator. This allowed us to mark our OST as "degraded", keep it connected to the MDT so inodes associated with migrated files would effectively be cleaned, while preventing new file creation. This worked great.

We let our migration complete, and at the end both OSTs were completely empty, devoid of any file.

Zombie LUN

Because any good story needs zombies.

Once we had finished emptying our OSTs, we then needed to fix them at the RAID level. Because, remember, everything went to hell after multiple disk failures during a LUN rebuild. Meaning that in their current state, those two LUNs were unusable and had to be re-initialized. We had good hopes we would be able to do this from the disk array management tools. Unfortunately, our hardware vendor didn’t think it would be possible, and strongly recommended to destroy the LUN and to rebuild it with the same disks.

The problem with that approach is that this would have generated a different identifier for our LUNs, meaning we would have had to change the configuration of our multipath layer, and more importantly, swap old WWIDs with the new ones in our Lustre management tool. Which is not supported.

Thing is, we’re kind of stubborn. And we didn’t want to change WWIDs. So we looked for a way to re-initialize those LUNs in-place. Sure enough, failing multiple drives in the LUN rendered it inoperable. And nothing in the GUI seemed to be possible from there, besides "calling support for assistance". And you know, we tried that before, so no thanks we’ll pass.

Finally, exploring the CLI options, we found one (revive diskGroup) that did exactly what we were looking for: after replacing the 2 defectives disks (which made the LUN fail), we revived it from the CLI, and it happily sprung to life again. With all its parameters intact, so from the servers point of view, it was like nothing ever happened.

Restore Lustre

So, all what was left to do, was to reformat the OSTs and restore their parameters we had backed up before failing and reviving the LUNs.

Wrap up

Everything was a smooth ride from there. While working on repairing our two failed OSTs, we were continuously replacing those ~300 defective hard drives, one at a time and monitoring the rebuilds processes. At any given time, we had something like 36 LUNs rebuilding (6 arrays, 6 LUNs each) to maximize the throughput.

Disk replacement

Our hardware vendor was sending us replacement drives in batches, and we’ve been replacing 1 disk in each LUN pretty much every day for about 3 weeks.
We built a tool to follow the replacements and select the next disks to replace (obviously placement was important as we didn’t want to remove multiple disks from the same LUN). The tool allowed to see the number of disks left to replace, the status of current rebuilds, and when possible, selected the next disks to replace by making them blink in the disk arrays.

The end

Just because that how lucky we are, another drawer failed during the last rounds of disk replacements. It took an extra few days to get a replacement on site and replace it. Fortunately, no unreadable sectors happened during the recovery.

It took a few more days to clear out remaining drawers and controllers errors and to make sure that everything was stable and in running order. The official end of the operation was declared on May 17th, 2016, about 4 months after the initial double-disk failure.

We definitely learned a lot in the process, way more that we could ever have dared to ask. And it was quite the adventure, the kind that we hope will not happen again. But considering all what happened, we’re very glad the damage was limited to a handful of files and didn’t have a much broader impact.

When setting an environment variable gives you a 40x speedup

2019-04-26T19:42:00.001Z

Today, we’d like to share some of our recent work on Sherlock that allowed a pretty significant speedup when listing files in directories with a lot of entries.

Unlike our usual announcements, this post is more of a behind-the-scenes account of things we do on a regular basis on Sherlock, to keep it up and running in the best possible conditions for our users. We hope to have more of this in the future.

Listing many files takes time

It all started from a support question, from a user reporting a usability problem with ls taking several minutes to list the contents of a 15,000+ entries directory on $SCRATCH.

Having thousands of files in a single directory is usually not very file system-friendly, and definitely not recommended. The user knew this already and admitted that wasn’t great, but when he mentioned his laptop was 1,000x faster than Sherlock to list this directory’s contents, of course, it stung. So we looked deeper.

Because `ls` is nice

We looked at what ls actually does to list the contents of a directory, and why it was taking so long to list files. On most modern distributions, ls is aliased to ls --color=auto by default. Which is nice, because everybody likes 🌈.

But those pretty colors come at a price: for each and every file it displays, ls needs to get information about a file’s type, its permissions, flags, extended attributes and the like, to choose the appropriate color to display.

One easy solution to our problem would have been to disable colored output in ls altogether, but imagine the uproar. There is no way we could have taken 🌈 away from users, we’re not monsters.

So we looked deeper. ls does coloring through the LS_COLORS environment variable, which is set by dircolors(1), based on a dir_colors(5) configuration file. Yes, that’s right: an executable reads a config file to produce an environment variable that is in turn used by ls.^[1]

🤯

Let’s dive in

To be able to determine which of those specific coloring schemes were responsible for the slowdowns, we created an experimental environment:

$ mkdir $SCRATCH/dont
$ touch $SCRATCH/dont/{1..10000} # don't try this at home!
$ time ls --color=always $SCRATCH/dont | wc -l
10000
real 0m12.758s
user 0m0.104s
sys 0m0.699s

12.7s for 10,000 files, not great. 🐌

BTW, we need the --color=always flag, because, although it’s aliased to ls --color=auto, ls detects when it’s not attached to a terminal (like when piped to something or with its output redirected), and then turns off coloring when set to auto. Smart guy.

So, what’s taking so much time? Equipped with our strace-fu, we looked:

$ strace -c ls --color=always $SCRATCH/dont | wc -l
10000
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
44.21 0.186617 19 10000 lstat
42.60 0.179807 18 10000 10000 getxattr
12.19 0.051438 5 10000 capget
0.71 0.003002 38 80 getdents
0.07 0.000305 10 30 mmap
0.05 0.000217 12 18 mprotect
0.03 0.000135 14 10 read
0.03 0.000123 11 11 open
0.02 0.000082 6 14 close
[...]

Wow: 10,000 calls to lstat(), 10,000 calls to getxattr() (which all fail by the way, because the attributes that ls is looking for don’t exist in our environment), 10,000 calls to capget().

Can do better for sure.

File capabilities? Nah

Following advice from a 10+ year-old bug, we tried file disabling capability checking:

$ eval $(dircolors -b | sed s/ca=[^:]*:/ca=:/)
$ time strace -c ls --color=always $SCRATCH/dont | wc -l
10000
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
98.95 0.423443 42 10000 lstat
0.78 0.003353 42 80 getdents
0.04 0.000188 10 18 mprotect
0.04 0.000181 6 30 mmap
0.02 0.000085 9 10 read
0.02 0.000084 28 3 mremap
0.02 0.000077 7 11 open
0.02 0.000066 5 14 close
[...]
------ ----------- ----------- --------- --------- ----------------
100.00 0.427920 10221 6 total
real 0m8.160s
user 0m0.115s
sys 0m0.961s

Woohoo, down to 8s! We got rid of all those expensive getxattr() calls, and capget() calls are gone too, 👍.

We still have all those pesky lstat(), though…

How many colors is too many colors?

So we took a look at LS_COLORS in more details.

The first attempt was to simply unset that variable:

$ echo $LS_COLORS
rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.axa=00;36:*.oga=00;36:*.spx=00;36:*.xspf=00;36:
$ unset LS_COLORS
$ echo $LS_COLORS
$ time ls --color=always $SCRATCH/dont | wc -l
10000
real 0m13.037s
user 0m0.077s
sys 0m1.092s

Whaaaaat!?! Still 13s?

It turns out that when the LS_COLORS environment variable is not defined, or when just one of its <type>=color: elements is not there, it defaults to its embedded database and uses colors anyway. So if you want to disable coloring for a specific file type, you need to override it with <type>=:, or <type> 00 in the DIR_COLORS file.

After a lot of trial and error, we narrowed it down to this:

EXEC 00
SETUID 00
SETGID 00
CAPABILITY 00

which translates in

LS_COLORS='ex=00:su=00:sg=00:ca=00:'

In normal people speak, that means: don’t colorize files based on the their file capabilities, setuid/setgid bits nor executable flag.

Let `ls` fly

And if you don’t do any of those checks, then the lstat() calls disappear, and now, boom 🚀:

$ export LS_COLORS='ex=00:su=00:sg=00:ca=00:'
$ time strace -c ls --color=always $SCRATCH/dont | wc -l
10000
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
63.02 0.002865 36 80 getdents
8.10 0.000368 12 30 mmap
5.72 0.000260 14 18 mprotect
3.72 0.000169 15 11 open
2.79 0.000127 13 10 read
[...]
------ ----------- ----------- --------- --------- ----------------
100.00 0.004546 221 6 total
real 0m0.337s
user 0m0.032s
sys 0m0.029s

0.3s to list 10,000 files, track record. 🏁

This is on Sherlock

From 13s with the default settings, to 0.3s with a small LS_COLORS tweak, that’s a 40x speedup right there, for the cheap price of not having setuid/setgid or executable files colorized differently.

Of course, this is now setup on Sherlock, for every user’s benefit.

But if you want all of your colors back, no worries, you can simply revert to the distribution defaults with:

$ unset LS_COLORS

But then, if you have directories with many many files, be sure to have some coffee handy while ls is doing its thing.

Also, if you didn’t know about doors, well, dir_colors got you covered no matter what. If you really wonder, the file type is do. ↩