Sherlock changelog

Sherlock goes full flash

2024-02-07T23:49:24.699Z

What could be more frustrating than anxiously waiting for your computing job to finish? Slow I/O that makes it take even longer is certainly high on the list. But not anymore! Fir, Sherlock’s scratch file system, has just undergone a major tech face-lift: it’s now a 10 PB all-flash storage system, providing an aggregate bandwidth of 400 GB/sec (and >800 kIOPS). Bringing Sherlock’s high-performance parallel scratch file system into the era of flash storage was not just a routine maintenance task, but a significant leap into the future of HPC and AI computing.

But first, a little bit of context

Traditionally, High-Performance Computing clusters face a challenge when dealing with modern, data-intensive applications. Existing HPC storage systems, long designed with spinning disks to provide efficient and parallel sequential read/write operations, often become bottlenecks for modern workloads generated by AI/ML or CryoEM applications. Those demand substantial data storage and processing capabilities, putting a strain on traditional systems.

So to accommodate those new needs and future evolution of the HPC I/O landscape, we at Stanford Research Computing, with the generous support of the Vice Provost and Dean of Research, have been hard at work for over two years, revamping Sherlock's scratch with an all-flash system.

And it was not just a matter of taking delivery of a new turn-key system. As most things we do, it was done entirely in-house: from the original vendor-agnostic design, upgrade plan, budget requests, procurement, gradual in-place hardware replacement at the Stanford Research Computing Facility (SRCF), deployment and validation, performance benchmarks, to the final production stages, all of those steps were performed with minimum disruption for all Sherlock users.

The technical details

The /scratch file system on Sherlock is using Lustre, an open-source, parallel file system that supports many requirements of leadership class HPC environments. And as you probably know by now, Stanford Research Computing loves open source! We actively contribute to the Lustre community and are a proud member of OpenSFS, a non-profit industry organization that supports vendor-neutral development and promotion of Lustre.

In Lustre, file metadata and data are stored separately, with Object Storage Servers (OSS) serving file data on the network. Each OSS pair and associated storage devices forms an I/O cell, and Sherlock's scratch has just bid farewell to its old HDD-based I/O cells. In their place, new flash-based I/O cells have taken the stage, each equipped with 96 x 15.35TB SSDs, delivering mind-blowing performance.

Sherlock’s /scratch has 8 I/O cells and the goal was to replace every one of them. Our new I/O cell has 2 OSS with Infiniband HDR at 200Gb/s (or 25GB/s) connected to 4 storage chassis, each with 24 x 15.35TB SSD (dual-attached 12Gb/s SAS), as pictured below:

Of course, you can’t just replace each individual rotating hard-drive with a SSD, there are some infrastructure changes required, and some reconfiguration needed. The upgrade, executed between January 2023 and January 2024, was a seamless transition. Old HDD-based I/O cells were gracefully retired, one by one, while flash-based ones progressively replaced them, gradually boosting performance for all Sherlock users throughout the year.

All of those replacements happened while the file system was up and running, serving data to the thousands of computing jobs that run on Sherlock every day. Driven by our commitment to minimize disruptions to users, our top priority was to ensure uninterrupted access to data throughout the upgrade. Data migration is never fun, and we wanted to avoid having to ask users to manually transfer their files to a new, separate storage system. This is why we developed and contributed a new feature in Lustre, which allowed us to seamlessly remove existing storage devices from the file system, before the new flash drives could be added. More technical details about the upgrade have been presented during the LAD’22 conference.

Today, we are happy to announce that the upgrade is officially complete, and Sherlock stands proud with a whopping 9,824 TB of solid-state storage in production. No more spinning disks in sight!

Key benefits

For users, the immediately visible benefits are quicker access to their files, faster data transfers, shorter job execution times for I/O intensive applications. More specifically, every key metric has been improved:

IOPS: over 100x (results may vary, see below)
Backend bandwidth: 6x (128 GB/s to 768 GB/s)
Frontend bandwidth: 2x (200 GB/s to 400 GB/s)
Usable volume: 1.6x (6.1 PB to 9.8 PB)

In terms of measured improvement, the graph below shows the impact of moving to full-flash storage for reading data from 1, 8 and 16 compute nodes, compared to the previous /scratch file system:

And we even tried to replicate the I/O patterns of AlphaFold, a well-known AI model to predict protein structure, and the benefits are quite significant, with up to 125x speedups in some cases:

This upgrade is a major improvement that will benefit all Sherlock users, and Sherlock’s enhanced I/O capabilities will allow them to approach data-intensive tasks with unprecedented efficiency. We hope it will help support the ever-increasing computing needs of the Stanford research community, and enable even more breakthroughs and discoveries.

As usual, if you have any question or comment, please don’t hesitate to reach out to Research Computing at [email protected]" rel="noopener nofollow" target="_blank" title="[email protected]">[email protected]. 🚀🔧

Tracking NFS problems down to the SFP level

2021-02-05T18:20:00Z

This is part of our technical blog series about things that happen behind-the-scenes on Sherlock, and which are part of our ongoing effort to keep it up and running in the best possible conditions for our beloved users.

For quite a long time, we've been chasing down an annoying NFS timeout issue that seemed to only affect Sherlock 3.0 nodes.

That issue would impact both login and compute nodes, both NFSv4 user mounts (like $HOME and $GROUP_HOME) and NFSv3 system-level mounts (like the one providing software installations), and occur at random times, on random nodes. It was not widespread enough to be causing real damage, but from time to time, a NFS mount would hang and block I/O for a job, or freeze a login session. When that happened, the node would still be able to ping all of the NFS servers' IP addresses, even remount the same NFS file system with the exact same options in another mount point, and no discernable network issue was apparent on the nodes. Sometimes, the stuck mounts would come back to life on their own, sometimes they would stay hanging forever.

Is it load? Is it the kernel? Is it the CPU?

It kind of looked like it could be correlated with load, and mostly appear when multiple jobs were doing NFS I/O on a given node, but we never found conclusive proof that it was the case. The only distinguishable element was that the issue was only observed on Sherlock 3.0 nodes and never affected older Sherlock 1.0/2.0 nodes. So we started suspecting something about the kernel NFS client, maybe some oddity with AMD Rome CPUs: after all, they were all quite new, and the nodes had many more cores than the previous generation. So maybe they had more trouble handling the parallel operations, ended up with a deadlock or something.

But still, all the Sherlock nodes are using the same kernel, and only the Sherlock 3.0 nodes were affected, so it appeared unlikely to be a kernel issue.

The NFS servers maybe?

We then started looking at the NFS servers. Last December’s maintenance was actually an attempt at resolving those timeout issues, even though it proved fruitless in that aspect. We got in touch with vendor support to explore possible explanations, but nothing came out of it and our support case went nowhere. Plus, if the NFS servers were at fault, it would likely have affected all Sherlock nodes, not just a subset.

It’s the NFS client parameters! Or is it?

So back to the NFS client, we've started looking at the NFS client mount parameters. The petazillion web hits about "nfs timeout" didn't really help in that matter, but in the process we found pretty interesting [email protected]/T/?utm_source=noticeable&utm_campaign=sherlock.tracking-nfs-problems-down-to-the-sfp-level&utm_content=publication+link&utm_id=bYyIewUV308AvkMztxix.GtmOI32wuOUPBTrHaeki.P3xY1hwDWMe8tR48vPEj&utm_medium=newspage" rel="noopener nofollow" target="_blank">discussions about read/write sizes and read-ahead. We tried tweaking all of those parameters left and right, deployed various configs on the compute nodes (A/B testing FTW!), but the timeout still happened.

The lead

In the end, what gave us a promising lead was an article found on the GRNET blog that explain how the authors tracked down a defective QSFP that was causing issues in their Ceph cluster. Well, it didn't take long to realize that there was a similar issue between those Sherlock nodes and the NFS servers. Packet loss was definitely involved.

The tricky part, as described in the blog post, is that the packet loss only manifested itself when using large ICMP packets, close to the MTU upper limit. When using regular packet size, no problem was apparent.

For instance, this regular ping didn't show any loss:

# ping -c 50 10.16.90.1 | grep loss
50 packets transmitted, 50 received, 0% packet loss, time 538ms

But when cranking up the packet size:

# ping -s 8972 -c 50 10.16.90.1 | grep loss
50 packets transmitted, 36 received, 28% packet loss, time 539ms

What was even funnier is that not all Sherlock 3.0 nodes were experiencing loss to the same NFS server nodes. For instance, from one client node , there was packet loss to just one of the NFS servers:

client1# clush -Lw 10.16.90.[1-8] --worker=exec ping -s 8972 -M do -c 10 -q %h | grep loss
10.16.90.1: 10 packets transmitted, 10 received, 0% packet loss, time 195ms
10.16.90.2: 10 packets transmitted, 8 received, 20% packet loss, time 260ms
10.16.90.3: 10 packets transmitted, 10 received, 0% packet loss, time 193ms
10.16.90.4: 10 packets transmitted, 10 received, 0% packet loss, time 260ms
10.16.90.5: 10 packets transmitted, 10 received, 0% packet loss, time 200ms
10.16.90.6: 10 packets transmitted, 10 received, 0% packet loss, time 264ms
10.16.90.7: 10 packets transmitted, 10 received, 0% packet loss, time 196ms
10.16.90.8: 10 packets transmitted, 10 received, 0% packet loss, time 194ms

But from another client, sitting right next to it, no loss to that server , but packets dropped to another one instead:

client2# clush -Lw 10.16.90.[1-8] --worker=exec ping -s 8972 -M do -c 10 -q %h | grep loss
10.16.90.1: 10 packets transmitted, 8 received, 20% packet loss, time 190ms
10.16.90.2: 10 packets transmitted, 10 received, 0% packet loss, time 198ms
10.16.90.3: 10 packets transmitted, 10 received, 0% packet loss, time 210ms
10.16.90.4: 10 packets transmitted, 10 received, 0% packet loss, time 197ms
10.16.90.5: 10 packets transmitted, 10 received, 0% packet loss, time 196ms
10.16.90.6: 10 packets transmitted, 10 received, 0% packet loss, time 243ms
10.16.90.7: 10 packets transmitted, 10 received, 0% packet loss, time 201ms
10.16.90.8: 10 packets transmitted, 10 received, 0% packet loss, time 213ms

The link

That all started to sound like a faulty stack link, or a a problem in one of the LACP links between the different switch stacks (Sherlock's and the NFS servers’). We didn't find anything obviously out-of-place in reviewing the switches configuration, so we went back to the switches’ documentation to try to understand how to check counters and identify bad links (which gave us the opportunity to mumble about documentation that is not in sync with actual commands, but that’s another topic...).

So we dumped the hardware counters for each link involved in the NFS connections, and on a switch, on the NFS client’s side, there was this:

te1/45 Ingress FCSDrops : 0
te1/46 Ingress FCSDrops : 0
te2-45 Ingress FCSDrops : 0
te2-46 Ingress FCSDrops : 0
te3-45 Ingress FCSDrops : 0
te3-46 Ingress FCSDrops : 1064263014
te4-45 Ingress FCSDrops : 0
te4-46 Ingress FCSDrops : 0

Something standing out, maybe?

In more details:

#show interfaces te3/46
TenGigabitEthernet 3/46 is up, line protocol is up
Port is part of Port-channel 98
[...]
Input Statistics:
18533249104 packets, 35813681434965 bytes
[...]
1064299255 CRC, 0 overrun, 0 discarded

The CRC number indicates the number of CRC failures, packets which failed checksum validation. All the other ports on the switch were at 0. So clearly something was off with that port.

The culprit: a faulty SFP!

We decided to try to shut that port down (after all, it's just 1 port out of a 8-port LACP link), and immediately, all the packet loss disappeared.

So we replaced the the optical transceiver in that port, hoping that swapping that SFP would resolve the CRC failure problem. After re-enabling the link, the number of dropped packets seemed to have decreased. But not totally disappear…

The real culprit: the other SFP

Thinking a little more about it, since the errors were actually Ingress FCSDrops on the switch, it didn’t seem completely unreasonable to consider that those frames were received by the switch already corrupted, and thus, that they would have been mangled by either the transceiver on the other end of the link, or maybe in-flight by a damaged cable. So maybe we’ve been pointing fingers at a SFP, and maybe it was innocent… 😁

We checked the switch’s port on the NFS server’s side, and the checksum errors and drop counts were all at 0. We replaced that SFP anyway, just to see, and this time, bingo: no more CRC errors on the other side.

Which lead us to the following decision tree:

if a port has RX/receiving/ingress errors, it’s probably not its fault, and the issue is most likely with its peer at the other end of the link,
if a port has TX/transmitting/egress errors, it’s probably the source of the problem,
if both ports at each end of a given link have errors, the cable is probably at fault.

By the way, if you’re wondering, here’s what a SFP looks like:

TL;DR

We had seemingly random NFS timeout issues. They turned out to be caused by a defective SFP, that was eventually identified through the port error counter of the switch at the other end of the link.

There's probably a lesson to be learned here, and we were almost disappointed that DNS was not involved (because it's always DNS), but in the end, we were glad to finally find a rational explanation to those timeouts. And since that SFP replacement, not a single NFS timeout has been logged.

Sherlock facts

2020-11-13T00:12:00.001Z

Ever wondered how many compute nodes is Sherlock made of? Or how many users are using it? Or how many Infiniband cables link it all together?

Well, wonder no more: head to the Sherlock facts page and see for yourself!

hint: there are a lot of cables :)

And if you’re tired of seeing the some old specs from two years ago, we’ve updated the Sherlock tech specs page too!

To make sure those numbers never fall behind again and continue to offer an accurate representation of Sherlock’s resources, they will be automatically be updated each time something changes on the cluster.

As usual, don’t hesitate to [email protected]" target="_blank" rel="noopener">reach out if you have any question or comment!

Adventures in storage

2019-12-03T23:30:00.001Z

This is part of our blog series about behind-the-scenes things we do on a regular basis on Sherlock, to keep it up and running in the best possible conditions for our users.
Now that Sherlock’s old storage system has been retired, we can finally tell that story. It all happened in 2016.

Or: How we replaced more than 1 PB of hard drives, while continuing to serve files to unsuspecting users.

TL;DR: The parallel filesystem in Stanford’s largest HPC cluster has been affected by frequent and repeated hard-drive failures since its early days. A defect was identified that affected all of the 360 disks used in 6 different disk arrays. A major swap operation was planned to replace the defective drives. Multiple hardware disasters piled up to make matters worse, but in the end, all of the initial disks were replaced, while retaining 1.5 PB of user data intact, and keeping the filesystem online the whole time.

History and context

Once upon a time, in a not so far away datacenter…

We, Stanford Research Computing Center, manage many high-performance computing and storage systems at Stanford. In 2013, in a effort to centralize resources and advance computational research, a new HPC cluster, Sherlock, has been deployed. To provide best-in-class computing resources to all faculty and facilitate research in all fields, this campus-wide cluster features a high-performance, Lustre-based parallel filesystem.

This filesystem, called /scratch, was designed to provide high-performance storage for temporary files during simulations. Initially made of three I/O cells, the filesystem had been designed to be easily expanded with more hardware as demand and utilization grew. Each I/O cell was comprised of:

2x object storage servers,
2x disk arrays, with:
- dual RAID controllers,
- 5 drawers of 12 disks each,
- 60 4TB SAS disks total.

Each disk array was configured with 6x 10-disk RAID6 LUNs, and every SAS path being redundant, the two OSS servers could act as a high-availability pair. This is a pretty common Lustre setup.

Close to a petabyte in size, this filesystem quickly became the go-to solution for many researchers who didn’t really have any other option to store and compute against their often large data sets. Over time, the filesystem was expanded several times and eventually more than doubled in size:

	# disk arrays	# OSTs	# disks	size
initially	6	36	360	1.1 PB
ultimately	18	108	1080	3.4 PB

As the filesystem grew, it ended up containing close to 380 million inodes (that is, filesystem entries, like files, directories or links). Please keep that in mind, turns out that’s an important factor for the following events.

The initial issue

All was fine and dandy in storage land, and we had our share of failing disks, as everybody. We were replacing them as they failed, sending them back to our vendor, and getting new ones in return. Datacenter business as usual.

Except, a lot of disks were failing. Like, really a lot, as in one every other day.

We eventually came to the conclusion that our system had been installed with a batch of disks with shorter-than-average lifespans. They were all from the same disk vendor, manufactured around the same date. But we didn’t worry too much.

Until that day, where 2 disks failed within 3 hours of each other. In the same disk array. In. The. Same. LUN.

To give some context, one failed drive in a 10-disk RAID6 array is no big deal: data can be reconstructed from the 9 remaining physical disks without any problem. If by any chance one of those remaining disks suffers from a problem and data cannot be read from it, there are still enough redundancy to reconstruct the missing data and all is well.

A single drive failure is handled quite transparently by the disk array:

it emits an alert,
you replace the failed disk,
it detects the drive has been replaced,
it starts rebuilding it from data and parity on the other disks of the LUN,
about 24 hours later, you have a brand new LUN, all shiny and happy again.

But two failed disks, on the other hand, that’s pretty much like a Russian roulette session: you may be lucky and pull it off, but there’s a good chance you won’t. While the LUN misses 2 disks, there is no redundancy left to reconstruct the data. Meaning that any read error on any of the remaining 8 disks will lead to data loss as the controller won’t be able to reconstruct anything. And worse, any bit flip during reads will go completely unnoticed, as there is no parity left to check the data. Which means that you can potentially be reconstructing completely random garbage on your drives.

Given that, it didn’t take us long to pick up the phone and call our vendor.

They confirmed our initial findings that in our initial set of 6 disk arrays, over the course of 2 years, we had already replaced about 60 disks out of 360. At a rate of 5-10 failures per month. Way higher than expected.

The LUN rebuild eventually completed fine, without any problem, but that double-failure acted as a serious warning. So we stated thinking about ways to solve our problem. And that’s when the sticky cheese hit the fan…

When problems pile up

Three days after the double failure, we had an even more important hardware event: one drawer in another disk array misbehaved, reporting itself as degraded, and 6 disks failed in that same drawer over the course of a few minutes. A 7th disk was evicted a few hours later, and left 2 LUNs without any parity in that single array. Joy all over. In a few minutes, the situation we were dreading a few days earlier just happened twice in the same array. We were a disk away from loosing serious amounts of data (we’re talking 30TB per LUN). And as past experience proved, those disks were not of the most reliable kind…

We got our vendor to dispatch a replacement drawer to us under the terms of our H+4 support contract. Except they didn’t have any replacement drawer in stock that they could get to us in 4 hours. So they overnight’d it and we got the replacement drawer the following day.

We diligently replaced the drawer and rebuild started on those 7 drives in the disk array. Which, yes, means that one LUN was rebuilding without any redundancy. Like the one from the other disk array the week before. And as everyone probably guessed, things didn’t go that well the second time: that LUN stayed degraded, despite all the rebuild operations being done and all the physical disks state being "optimal". Turned out the interface and the internal controller state disagreed on the status of a drive. On our vendor’s suggestion, we replaced that drive, a new rebuild started, and then abruptly stopped mid-course: the state of the LUN was still "degraded".

And then, we had the sensible yet completely foolish idea of calling vendor support on a week-end.

Hilarity and data loss ensued.

Never trust your hardware vendor support on week-ends

We were in a situation were a LUN was degraded, and a recently failed drive had just failed to rebuild, yet was showing up as “optimal” in the management interface. The vendor support technician then had the brilliant idea of forcefully “reviving” that drive. Which had the immediate effect of putting back online a drive that had been partially reconstructed, ie. on which 100% of the data had to be considered bit waste.
And the LUN stayed in that state, serving ridiculously out-of-sync, inaccurate and pretty much random data to our Lustre OSS servers for about 15 minutes. Fifteen minutes. Nine hundred full-size seconds. A lot of bad things can (and did) happen in 900 seconds.

Luckily, the Lustre filesystem quickly realized it was lied to, so it did the only sane thing to do, blocked all I/O and set that device read-only. Of course, some filesystem-level corruption happened during the process.

We had to bring that storage target down and check it multiple time with fsck to restore its data structure consistency. About 1,500 corrupted entries where found, detached from the local filesystem map and stored in the lost+found directory. That means that all those 1,500 objects, which were previously part of files, where now orphaned from the filesystem, as it had no way of knowing what file the belonged too anymore. So it tossed them in lost+found as it couldn’t do much else with them.

And on our cluster, users trying to access those files were kindly greeted with an error message, which, as error messages sometimes are, was unintuitively related to the matter at hand: cannot allocate memory.

With (much better) support from our filesystem vendor, we were able to recover a vast majority of those 1,500 files, and re-attach them to the filesystem, where they originally were. For Lustre admins, the magic word here is ll_recover_lost_found_objs.

So in the end, we “only” lost 29 files in the battle. We contacted each one of the owners to let them know about the tragic fate of their files, and most of them barely flinched, their typical response being: "Oh yeah, I know that’s temporary storage anyway, let me upload a new copy of that file from my local machine".

We know, we’re blessed with terrific users.

The tablecloth trick

Now, this was just starters, we hadn’t really had a chance to tackle the real issue yet. We were merely absorbing the fallout of that initial drawer failure, but we hadn’t done anything to address the high failure rate of our disk drives.

Our hardware vendor, well aware of the underlying reliability issue, as the same scenario happened other places too, kindly agreed to replace all of our remaining original disks. That is, about 300 of them:

disk array	already HDDs	total HDDs	HDDs to replace
DA00	16	60	44
DA01	15	60	45
DA02	14	60	46
DA03	13	60	47
DA04	8	60	52
DA05	15	60	45
total	81	360	279

The strategy devised by that same vendor was:

"We’ll send you a whole new disk array, filled with new disks, and you’ll replicate your existing data there".

To which we replied:

“Uh, sorry, that’s won’t work. You see, those arrays are part of a larger Lustre filesystem, we can’t really replicate data from one to another without a downtime. And we would need a downtime long enough to allow us to copy 240TB of data. Six times, 'cause you know, we have six arrays. Oh, and our users don’t like downtimes.”

So we had to find another way.

Our preference was to minimize manipulations on the filesystem and keep it online as much as possible during this big disk replacement operation. So we leaned toward the path of least resistance, and let the RAID controllers do what they do best: compute parities and write data. So we ended up removing each one of those bad disks, one at a time, replacing it with a new disk, and let the controller rebuild the LUN.

Each rebuild operation took about 24 hours, so obviously, replacing ~300 disks one at a time wasn’t such a thrilling idea: assuming somebody would be around 24/7 to swap a new drive as soon as the previous one finished, that would make the whole operation last almost a full year. Not very practical.

So we settled on doing them in batches, replacing one disk in each of the 36 LUNs in each batch. That would allow the RAID controllers to rebuild several LUNs in parallel, and cut the overall length of the operation. Instead of 300 sequential 24-hours rebuilds, we would only need 5 waves of disk replacements, which shouldn’t take more than a couple weeks total.

Should we mention the fact that our adored vendor mentioned that, since we were using RAID6, if we wanted to speed things even more, we could potentially consider replacing two drives at a time in each LUN, but that they wouldn’t recommend it? Nah, right, we shouldn’t.

Remove the disks, keep the data

So they went away and shipped us new disks. That’s where the “tablecloth trick” analogy is fully realized: we were indeed removing disk drives from our filesystem, while keeping the data intact, and inserting new disks underneath to replace them. Which would really be like pulling the tablecloth, putting a new one in place, and keeping the dishes intact.

But you know, things never go as planned, and while we started replacing that first batch of disks, we realized that those unreliable drives? Well, they were really unreliable.

When things go south

No less than five additional disks failed during that same first wave of rebuilds. Four of them in the same array (DA00). To make things worse, in one of those LUNs, one additional disk failed during the rebuild and then, unreadable sectors were encountered on a 3rd disk. Which lead to data loss and a corrupted LUN.

We contacted our vendor, which basically said: "LUN is lost, restore from backup". Ha ha! Of course, we have backups for a 3PB Lustre filesystem, and we can restore an individual OST without breaking complete havoc in the rest of the filesystem’s coherency. For some reason, our vendor support recommended to delete the LUN, recreate it, and let the Lustre file system re-populate data back. We are still trying to understand what they meant.

On the bright side, they engaged our software vendor, to provide some more assistance at the filesystem level and devise a recovery strategy. We had one of our own rolling already, and it turned out it was about the same.

Relocating files

Since we still had access to the LUN, our approach was to migrate all the files out of that LUN as quickly as possible and relocate them on other OSTs in Lustre, re-initialize the LUN at the RAID level, and them reformat it and re-insert it in the filesystem. Or, more precisely:

deactivate the OST on MDT to avoid new object creation,
use lfs_migrate to relocate files out of that OST, using either Robinhood or the results of lfs find to identify files residing on that OST (the former can be out of date, the latter was quite slow),
make sure the OST was empty (lfs find again),
disable the OST on clients, so they didn’t use it anymore,
reactivate the OST on the MDS to clear up orphaned objects (while the OST is disconnected from the MDT, file relocations are not synchronized to the OST, so objects are orphaned there and take up space unnecessarily),
backup the OST configuration (so it could be recreated with the same parameters, including its index),
reinitialize the LUN in the disk array, and retain its configuration (most importantly its WWID),
reformat the OST with Lustre,
restore the OST configuration (especially its index),
reactivate the OST.

What can go wrong in a 10-step procedure? Turns out, it kind of all stopped at step 1.

Making sure nobody writes files to a LUN anymore

In order to be able to migrate all the files from an OST, you need to make sure that nobody can write new files to it anymore. How could you empty an OST if new files keep being created on it?
There are several approaches to this, but it took us some tries to get it right where we wanted it to be.

First, you can try to ‘deactivate’ the OST by making it read-only on the MDT. It means that users can still read the existing files on the OST, but the MDT won’t consider it for new file creations. Sounds great, except for one detail: when you do this, the OST is disconnected from the MDT, so inodes occupied by files that are being migrated are not reported as freed up to the MDT. The consequence is that the MDT still thinks that the inodes are in-use, and you end up in a de-synchronized state, with orphaned inodes on your OST. Not good.

So you need, at some point, to reconnect your OST to the MDT. Except as soon as you do this, new files get created on it, and you need to deactivate the OST, migrate them again, and bam, new orphan inodes again. Back to square one.

Another method is to mark the OST as "degraded", which is precisely designed to handle such cases: OST undergoing maintenance, or rebuilding RAIDs, during which period the OST shouldn’t be used to create new files. So, we went ahead and marked our OST as "degraded". Until we realized that files were still created on it. It turns out that this was because of some uneven usage in out OSTs (they were added to the filesystem over time, so they were not all filled at the same level): if there’s too much unbalanced utilization among OSTs, the Lustre QOS allocator will ignore the “degraded” flag on OSTs, and privilege trying to rebalance usage over obeying OST degradation flags.

Our top-notch filesystem vendor support suggested an internal setting to set on the OST (fail_loc=0x229, don’t try this at home) to artificially mark the OST as "out-of-space", which would carry both benefits of leaving it connected to the MDT for inodes cleanup, and prevent new files creation there. Unfortunately, this setting had the unexpected side effect of making load spike on the MDS, practically rendering the whole filesystem unsuable.

So we ended up deciding to temporarily sacrifice good load balancing across OSTs, and disabled the QOS allocator. This allowed us to mark our OST as "degraded", keep it connected to the MDT so inodes associated with migrated files would effectively be cleaned, while preventing new file creation. This worked great.

We let our migration complete, and at the end both OSTs were completely empty, devoid of any file.

Zombie LUN

Because any good story needs zombies.

Once we had finished emptying our OSTs, we then needed to fix them at the RAID level. Because, remember, everything went to hell after multiple disk failures during a LUN rebuild. Meaning that in their current state, those two LUNs were unusable and had to be re-initialized. We had good hopes we would be able to do this from the disk array management tools. Unfortunately, our hardware vendor didn’t think it would be possible, and strongly recommended to destroy the LUN and to rebuild it with the same disks.

The problem with that approach is that this would have generated a different identifier for our LUNs, meaning we would have had to change the configuration of our multipath layer, and more importantly, swap old WWIDs with the new ones in our Lustre management tool. Which is not supported.

Thing is, we’re kind of stubborn. And we didn’t want to change WWIDs. So we looked for a way to re-initialize those LUNs in-place. Sure enough, failing multiple drives in the LUN rendered it inoperable. And nothing in the GUI seemed to be possible from there, besides "calling support for assistance". And you know, we tried that before, so no thanks we’ll pass.

Finally, exploring the CLI options, we found one (revive diskGroup) that did exactly what we were looking for: after replacing the 2 defectives disks (which made the LUN fail), we revived it from the CLI, and it happily sprung to life again. With all its parameters intact, so from the servers point of view, it was like nothing ever happened.

Restore Lustre

So, all what was left to do, was to reformat the OSTs and restore their parameters we had backed up before failing and reviving the LUNs.

Wrap up

Everything was a smooth ride from there. While working on repairing our two failed OSTs, we were continuously replacing those ~300 defective hard drives, one at a time and monitoring the rebuilds processes. At any given time, we had something like 36 LUNs rebuilding (6 arrays, 6 LUNs each) to maximize the throughput.

Disk replacement

Our hardware vendor was sending us replacement drives in batches, and we’ve been replacing 1 disk in each LUN pretty much every day for about 3 weeks.
We built a tool to follow the replacements and select the next disks to replace (obviously placement was important as we didn’t want to remove multiple disks from the same LUN). The tool allowed to see the number of disks left to replace, the status of current rebuilds, and when possible, selected the next disks to replace by making them blink in the disk arrays.

The end

Just because that how lucky we are, another drawer failed during the last rounds of disk replacements. It took an extra few days to get a replacement on site and replace it. Fortunately, no unreadable sectors happened during the recovery.

It took a few more days to clear out remaining drawers and controllers errors and to make sure that everything was stable and in running order. The official end of the operation was declared on May 17th, 2016, about 4 months after the initial double-disk failure.

We definitely learned a lot in the process, way more that we could ever have dared to ask. And it was quite the adventure, the kind that we hope will not happen again. But considering all what happened, we’re very glad the damage was limited to a handful of files and didn’t have a much broader impact.

A newer, faster and better /scratch

2019-12-03T23:30:00.001Z

As we just announced, Sherlock now features a brand new storage system for /scratch. But what was the old system, what does the new one look like, and how did the move happen? Read on to find out!

The old

Since its early days, Sherlock ran its /scratch filesystem on a storage system that was donated by Intel and Dell.

Dubbed Regal, it was one of the key components of the Sherlock cluster when we started it in early 2014, with an initial footprint of about 100 compute nodes. Its very existence allowed us to scale the cluster to more than 1,500 nodes today, almost entirely through Faculty and PIs contributions to its condominium model. That’s a 15x growth in 5 years, and adoption has been spectacular.

Regal was initially just over 1PB when it’s been deployed in May 2013, which was quite substantial at the time. And similarly to the compute part of the cluster, its modular design allowed us to expand it to over 3PB with contributions from individual research groups.

We had a number of adventures with that system, including a major scale disk replacement operation, where we replaced about a petabyte of hard drives in production, while continuing to serve files to users ; or a literal drawer explosion in one of the disk arrays!

It’s been fun, and again, invaluable to our users.

But time has come to retire it, and replace it with a newer, faster and better solution, to accommodate the ever-growing storage needs of Sherlock’s ever-growing community.

The new

This year, we stood up a completely new and separate /scratch filesystem for Sherlock.

Nicknamed Fir (we like trees), this new storage system features:

multiple metadata servers and faster metadata storage for better responsiveness with interactive operations,
faster object storage servers,
a faster backend interconnect, for lower latency operations across storage servers,
more and faster storage routers to provide more bandwidth from Sherlock to /scratch,
more space to share amongst all Sherlock users,
a newer version of Lustre which provides:
- improved client performance,
- dynamic file striping to automatically adapt file layout and I/O performance to match a file’s size
- and much more!

And not to brag, but Fir has been ranked #15 in the IO-500 list of the fastest storage systems in the world, in the 10-node challenge category, that was released at SC’19. So yes, it’s decently fast.

The migration

Now, usually, when a new filesystem is made available on a computing system, there are two approaches:

One is making the new system available under a new mount point (like /scratch2) and tell users: “here’s the new filesystem, the old one will go away soon, you have until next Monday to get your files there and update all your scripts.”
This usually results in a lot of I/O traffic going on at once from all the users rushing to copy their data to the new space, potential mistakes, confusion, and in the end, a lot of frustration, additional work and unnecessary stress on everyone. Not good.

The other one is for sysadmins to copy all of the existing data from the old system to the new one in the background, in several passes, and then scheduled a (usually long) downtime to run a last synchronization pass and substitute the old filesystem by the new one under the same mount point (/scratch).
This also brings significant load on the filesystem while the synchronisation passes are running, taking I/O resources away from legitimate user jobs, it’s usually a very long process, and in the end it brings over old and abandoned files to the new storage system, wasting precious space. Not optimal either.

So we decided to take another route, and devised a new scheme. We spent some time (and fun!) designing and developing a new kind of overlay layer, to bridge the gap between Regal and Fir, and to transparently migrate user data from one to the other.

We (aptly) named this layer migratefs and open-sourced it at:
https://github.com/stanford-rc/fuse-migratefs.

The main idea of migratefs is to take advantage of user activity to:

distribute the data transfer tasks across all of the cluster nodes, to reduce the overall migration time,
only migrate data that is actively in use, and leave older files that are never accessed nor modified on the old storage system, resulting in a new storage system that only stores relevant data,
migrate all the user data transparently, without any downtime.

So over the last few months, all of the active user data on Regal has been seamlessly migrated to Fir, without users having to modify any of their job scripts, and all without a downtime.

Which is why if you’re using $SCRATCH or $GROUP_SCRATCH today, you are actively using the new storage system, and all your active data is there already, ready to be used in your compute jobs.

Next steps

Now, Regal has been emptied of all of its data and has been retired. It’s currently being un-racked to make room for future Sherlock developments.And stay tuned, because… epic changes are coming!

More scratch space for everyone!

2019-12-03T23:30:00.001Z

Today, we’re super excited to announce several major changes to the /scratch filesystem on Sherlock.

What’s `/scratch` already?

/scratch is Sherlock’s temporary, parallel and high-performance filesystem. It’s available from all the compute nodes in the cluster, and is aimed at storing temporary data, like raw job output, intermediate files, or unprocessed results.

All the details about /scratch can be found in the Sherlock documentation, at https://www.sherlock.stanford.edu/docs/storage/filesystems/#scratch

A brand new storage system

So first of all, Sherlock’s /scratch now uses a brand new underlying storage system: it’s newer, faster and better that the old system in many ways that are described in much more details in this other post.

But to sum it up, using newer and faster hardware, the new /scratch storage system is twice as large, dramatically accelerate small files access and metadata operations, and enables new filesystem features for better overall performance.

If you’d like to take advantage of that new system and are wondering what you need to benefit from its improved performance, the answer is pretty simple: nothing! Your data is already there: if you’re using $SCRATCH or $GROUP_SCRATCH today, you don’t have to do anything, you’re already using the new storage system.

How did that happen? You can read all about it in that post I mentioned above.

More space for everyone!

Now, some things don’t change, but others do. We’re really excited to announce that starting today, every user on Sherlock gets ~~twice~~ ~~thrice~~ 🎉✨ five times✨🎈 the amount of storage that was previously offered.

Yep, that’s right, starting today, every user on Sherlock gets 100TB in $SCRATCH. And because sharing is caring, each group gets an additional 100TB to share data in $GROUP_SCRATCH.

But wait, there’s more.

Because we know ownership-based user and group quotas were confusing at times, we’re moving away from them and are adopting a new, directory-based quota system. That means that all the files that are under a given $SCRATCH directory, and only them, will be accounted for in the quota usage, no matter which user and group owns them. It will makes finding files that count towards a given quota much easier.

Previously, with ownership-based accounting, a user with data in both her own $SCRATCH folder and in $GROUP_SCRATCH would see the sum of all those files’ size counted against both her user quota and the group quota. Plus, the group quota was de facto acting as a cap for all the users in the same group, which was penalizing for groups with more members.

Now, data in a user’s $SCRATCH and $GROUP_SCRATCH are accounted for independently, and they’re cumulative. Meaning that no matter how many members a group counts, each user will be able to use the same amount of storage, and won’t be impacted by what others in the group use.

Here what things looks like, more visually (and to scale!):

before, individual ownership-based user quota (in blue) were limited by the overarching group quota (in purple).
now, each user can use up to their quota limit, without being impacted by others, and an additional 100TB is available for the group to share data among group members.

So not only individual quota values have been increased, but the change in quota type also means that the cumulative usable space in /scratch by each group will be much larger than before.

A new retention period

With that increase in space, we’re also updating the retention period on /scratch to 90 days. And because we don’t want to affect files that have been created less than 3 months ago, this change will not take effect immediately.

Starting Feb.3, 2020, all files stored in /scratch that have not been modified in the last 90 days will automatically be deleted from the filesystem.

This is in alignment with the vast majority of other computing centers, and a way to emphasize the temporary nature of the filesystem: /scratch is really designed to store temporary data, and provide high-performance throughput for parallel I/O.

For long-term storage of research data, we always recommend using Oak, which is also directly available from all compute nodes on Sherlock (you’ll find all the details about Oak at https://oak-storage.stanford.edu). Data can freely be moved between /scratch and Oak at very high throughput rates. We can suggest optimized solutions for this, so please don’t hesitate to reach out if you have any question.

TL;DR

Today, we’re announcing:

a brand new storage system for /scratch on Sherlock
a quota increase to 100TB for each user in $SCRATCH and each group in $GROUP_SCRATCH
the move to directory-based quotas for easier accounting of space utilization, and for allowing each user to reach their $SCRATCH quota
a new 90-day retention period for all files in /scratch, starting Feb. 3, 2020

All those changes have been reflected in the documentation at https://www.sherlock.stanford.edu/docs/storage/filesystems/

We hope those changes will enable more possibilities for computing on Sherlock, by allowing storage of larger datasets and running larger simulations.

As usual, if you have any question or comment, please don’t hesitate to let us know at [email protected]" target="_blank" rel="noopener">[email protected].

Easier connection to Sherlock's DTNs

2019-06-13T23:39:00.001Z

We know that easy access to data is essential, and that moving data around is a key part of every user’s workflow on Sherlock. We also know that two-factor authentication (2FA) can sometimes get in the way, and hinder the ability to get work done.

2FA is an important security measure and a definitive improvement over traditional authentication methods, that helps protect our data, identity and is becoming a part of our daily lives, even outside of Sherlock. But it doesn’t necessarily add the same protection value on Data Transfer Nodes (DTNs), which only allow file transfers and don’t offer interactive shells, than it does on login nodes.

Additional authentication steps are sometimes also causing compatibility problems with some file transfer programs, which don’t support them.

This is why we’re implementing changes, to make data transfers more seamless, easier, and more flexible.

Duo is not required on Sherlock DTNs anymore

Starting today, two-step authentication is not a requirement anymore to transfer files to/from Sherlock’s DTN.

Important Two-step authentication using Duo is still mandatory to connect to login nodes

You can now connect to dtn.sherlock.stanford.edu either:

interactively, using your SUNet ID and password,
in a completely non-interactive way, using your Kerberos credentials (using GSSAPI)

An immediate benefit is that using SSHFS on Windows computers should be possible again.

We hope this will help in making your data more easily accessible on Sherlock, and make more options available in terms of data transfer programs.

As usual, don’t hesitate to get in touch at [email protected]" target="_blank" rel="noopener">[email protected] if you have any question or suggestion.

Sherlock changelog

Sherlock goes full flash

But first, a little bit of context

The technical details

Key benefits

Tracking NFS problems down to the SFP level

Is it load? Is it the kernel? Is it the CPU?

The NFS servers maybe?

It’s the NFS client parameters! Or is it?

The lead

The link

The culprit: a faulty SFP!

The real culprit: the other SFP

TL;DR

Sherlock facts

Adventures in storage

History and context

The initial issue

When problems pile up

Never trust your hardware vendor support on week-ends

The tablecloth trick

Remove the disks, keep the data

When things go south

Relocating files

Making sure nobody writes files to a LUN anymore

Zombie LUN

Restore Lustre

Wrap up

Disk replacement

The end

A newer, faster and better /scratch

The old

The new

The migration

Next steps

More scratch space for everyone!

What’s /scratch already?

A brand new storage system

More space for everyone!

A new retention period

TL;DR

Easier connection to Sherlock's DTNs

Duo is not required on Sherlock DTNs anymore

What’s `/scratch` already?