Sherlock changelog

Sherlock goes full flash

2024-02-07T23:49:24.699Z

What could be more frustrating than anxiously waiting for your computing job to finish? Slow I/O that makes it take even longer is certainly high on the list. But not anymore! Fir, Sherlock’s scratch file system, has just undergone a major tech face-lift: it’s now a 10 PB all-flash storage system, providing an aggregate bandwidth of 400 GB/sec (and >800 kIOPS). Bringing Sherlock’s high-performance parallel scratch file system into the era of flash storage was not just a routine maintenance task, but a significant leap into the future of HPC and AI computing.

But first, a little bit of context

Traditionally, High-Performance Computing clusters face a challenge when dealing with modern, data-intensive applications. Existing HPC storage systems, long designed with spinning disks to provide efficient and parallel sequential read/write operations, often become bottlenecks for modern workloads generated by AI/ML or CryoEM applications. Those demand substantial data storage and processing capabilities, putting a strain on traditional systems.

So to accommodate those new needs and future evolution of the HPC I/O landscape, we at Stanford Research Computing, with the generous support of the Vice Provost and Dean of Research, have been hard at work for over two years, revamping Sherlock's scratch with an all-flash system.

And it was not just a matter of taking delivery of a new turn-key system. As most things we do, it was done entirely in-house: from the original vendor-agnostic design, upgrade plan, budget requests, procurement, gradual in-place hardware replacement at the Stanford Research Computing Facility (SRCF), deployment and validation, performance benchmarks, to the final production stages, all of those steps were performed with minimum disruption for all Sherlock users.

The technical details

The /scratch file system on Sherlock is using Lustre, an open-source, parallel file system that supports many requirements of leadership class HPC environments. And as you probably know by now, Stanford Research Computing loves open source! We actively contribute to the Lustre community and are a proud member of OpenSFS, a non-profit industry organization that supports vendor-neutral development and promotion of Lustre.

In Lustre, file metadata and data are stored separately, with Object Storage Servers (OSS) serving file data on the network. Each OSS pair and associated storage devices forms an I/O cell, and Sherlock's scratch has just bid farewell to its old HDD-based I/O cells. In their place, new flash-based I/O cells have taken the stage, each equipped with 96 x 15.35TB SSDs, delivering mind-blowing performance.

Sherlock’s /scratch has 8 I/O cells and the goal was to replace every one of them. Our new I/O cell has 2 OSS with Infiniband HDR at 200Gb/s (or 25GB/s) connected to 4 storage chassis, each with 24 x 15.35TB SSD (dual-attached 12Gb/s SAS), as pictured below:

Of course, you can’t just replace each individual rotating hard-drive with a SSD, there are some infrastructure changes required, and some reconfiguration needed. The upgrade, executed between January 2023 and January 2024, was a seamless transition. Old HDD-based I/O cells were gracefully retired, one by one, while flash-based ones progressively replaced them, gradually boosting performance for all Sherlock users throughout the year.

All of those replacements happened while the file system was up and running, serving data to the thousands of computing jobs that run on Sherlock every day. Driven by our commitment to minimize disruptions to users, our top priority was to ensure uninterrupted access to data throughout the upgrade. Data migration is never fun, and we wanted to avoid having to ask users to manually transfer their files to a new, separate storage system. This is why we developed and contributed a new feature in Lustre, which allowed us to seamlessly remove existing storage devices from the file system, before the new flash drives could be added. More technical details about the upgrade have been presented during the LAD’22 conference.

Today, we are happy to announce that the upgrade is officially complete, and Sherlock stands proud with a whopping 9,824 TB of solid-state storage in production. No more spinning disks in sight!

Key benefits

For users, the immediately visible benefits are quicker access to their files, faster data transfers, shorter job execution times for I/O intensive applications. More specifically, every key metric has been improved:

IOPS: over 100x (results may vary, see below)
Backend bandwidth: 6x (128 GB/s to 768 GB/s)
Frontend bandwidth: 2x (200 GB/s to 400 GB/s)
Usable volume: 1.6x (6.1 PB to 9.8 PB)

In terms of measured improvement, the graph below shows the impact of moving to full-flash storage for reading data from 1, 8 and 16 compute nodes, compared to the previous /scratch file system:

And we even tried to replicate the I/O patterns of AlphaFold, a well-known AI model to predict protein structure, and the benefits are quite significant, with up to 125x speedups in some cases:

This upgrade is a major improvement that will benefit all Sherlock users, and Sherlock’s enhanced I/O capabilities will allow them to approach data-intensive tasks with unprecedented efficiency. We hope it will help support the ever-increasing computing needs of the Stanford research community, and enable even more breakthroughs and discoveries.

As usual, if you have any question or comment, please don’t hesitate to reach out to Research Computing at [email protected]" rel="noopener nofollow" target="_blank" title="[email protected]">[email protected]. 🚀🔧

Instant lightweight GPU instances are now available

2023-04-27T01:05:18.100Z

We know that getting access to GPUs on Sherlock can be difficult and feel a little frustrating at times. Demand has been steadily growing, leading to long pending times, and waiting in line rarely feels great, especially when you have important work to do.

Which is why we are excited to announce the immediate availability of our latest addition to the Sherlock cluster: instant lightweight GPU instances! Every user can now get immediate access to a GPU instance, for a quick debugging session or to explore new ideas in a Notebook.

GPUs are the backbone of high-performance computing. They’ve become an integral component of the toolbox for many users, and are essential for deep learning, scientific simulations, and many other applications. But you don’t always need a full-fledged, top-of-the-line GPU for all your tasks. Sometimes all you want is to run a quick test to prototype an idea, debug a script, or explore new data in an interactive Notebook. For this, the new lightweight GPU instances on Sherlock will give you instant access to a GPU, without having to wait in line and compete with other jobs for resources you don’t need.

Sherlock’s instant lightweight GPU instances leverage NVIDIA’s Multi-Instance GPU (MIG) to provide multiple fully isolated GPU instances on the same physical GPU, each with their own high-bandwidth memory, cache, and compute cores. Those lightweight instances are ideal for small to medium-sized jobs, and lower the barrier to entry for all users

Similar to the interactive sessions available through the dev partition, Sherlock users can now request a GPU via the sh_dev command, and get immediate access with the following command:

$ sh_dev -g 1

For interactive apps in the Sherlock OnDemand interface, requesting a GPU in the dev partition will initiate an interactive session with access to a lightweight GPU instance.

So now, everyone gets a GPU, no questions asked! 😁

We hope those new instances will improve access to GPUs on Sherlock, enable a wider range of use cases, with all the flexibility and performance you need to get your work done, and lead to even more groundbreaking discoveries!

As always, thanks to all of our users for your continuous support and patience as we work to improve Sherlock, and if you have any question or comment, please don’t hesitate to reach out at [email protected]" rel="noopener" target="_blank">[email protected].

More free compute on Sherlock!

2022-12-14T17:27:18.657Z

We’re thrilled to announce that the free and generally available normal partition on Sherlock is getting an upgrade!

With the addition of 24 brand new SH3_CBASE.1 compute nodes, each featuring one AMD EPYC 7543 Milan 32-core CPU and 256 GB of RAM, Sherlock users now have 768 more CPU cores at there disposal. Those new nodes will complete the existing 154 compute nodes and 4,032 core in that partition, for a new total of 178 nodes and 4,800 CPU cores.

The normal partition is Sherlock’s shared pool of compute nodes, which is available free of charge to all Stanford Faculty members and their research teams, to support their wide range of computing needs.

In addition to this free set of computing resources, Faculty can supplement these shared nodes by purchasing additional compute nodes, and become Sherlock owners. By investing in the cluster, PI groups not only receive exclusive access to the nodes they purchased, but also get access to all of the other owner compute nodes when they're not in use, thus giving them access to the whole breadth of Sherlock resources, currently over over 1,500 compute nodes, 46,000 CPU cores and close to 4 PFLOPS of computing power.

We hope that this new expansion of the normal partition, made possible thanks to additional funding provided by the University Budget Group as part of the FY23 budget cycle, will help support the ever-increasing computing needs of the Stanford research community, and enable even more breakthroughs and discoveries.

As usual, if you have any question or comment, please don’t hesitate to reach out at [email protected]" rel="noopener" target="_blank">[email protected].

From Rome to Milan, a Sherlock catalog update

2021-11-30T17:00:00Z

It’s been almost a year and a half since we first introduced Sherlock 3.0 and its major new features: brand new CPU model and manufacturer, 2x faster interconnect, much larger and faster node-local storage, and more! We’ve now reached an inflexion point in Sherlock’s current generation and it’s time to update the hardware configurations available for purchase in the Sherlock catalog.

So today, we’re introducing a new Sherlock catalog refresh, a Sherlock 3.5 of sorts.

The new catalog

So, what changes? What stays the same?
In a nutshell, you’ll continue to be able to purchase the existing node types that you’re already familiar with:

CPU configurations:

CBASE: base configuration ($)
CPERF: high core-count configuration ($$)
CBIGMEM: large-memory configuration ($$$$)

GPU configurations

G4FP32: base GPU configuration ($$)
G4TF64: HPC GPU configuration ($$$)
G8TF64: best-in-class GPU configuration ($$$$)

But they now come with better and faster components!

To avoid confusion, the configuration names in the catalog will be suffixed with a index to indicate the generational refresh, but will keep the same global denomination. For instance, the previous SH3_CBASE configuration is now replaced with a SH3_CBASE.1 configuration that still offers 32 CPU cores and 256 GB of RAM.

A new CPU generation

The main change in the existing configuration is the introduction of the new AMD 3rd Gen EPYC Milan CPUs. In addition to the advantages of the previous Rome CPUs, this new generation brings:

a new micro-architecture (Zen3)
a ~20% performance increase in instructions completed per clock cycle (IPC)
enhanced memory performance, with a unified 32 MB L3 cache
improved CPU clock speeds

More specifically, for Sherlock, the following CPU models are now used:

Model	Sherlock 3.0 (Rome)	Sherlock 3.5 (Milan)
`CBASE`	1× 7502 (32-core, 2.50GHz)	1× 7543 (32-core, 2.75GHz)
`CPERF`	2× 7742 (64-core, 2.25GHz)	2× 7763 (64-core, 2.45GHz)
`CBIGMEM`	2× 7502 (32-core, 2.50GHz)	2× 7543 (32-core, 2.75GHz)
`G4FP32`	1× 7502 (32-core, 2.50GHz)	1× 7543 (32-core, 2.75GHz)
`G4TF64`	2× 7502 (32-core, 2.50GHz)	2× 7543 (32-core, 2.75GHz)
`G8TF64`	2× 7742 (64-core, 2.25GHz)	2× 7763 (64-core, 2.45GHz)

In addition to IPC and L3 cache improvements, the new CPUs also bring a frequency boost that will provide a substantial performance improvement.

New GPU options

On the GPU front, the two main changes are the re-introduction of the G4FP32 model, and the doubling of GPU memory all across the board.

GPU memory is quickly becoming the constraining factor for training deep-learning models that keep increasing in size. Having large amounts of GPU memory is now key for running medical imaging workflows, computer vision models, or anything that requires processing large images.

The entry-level G4FP32 model is back in the catalog, with a new NVIDIA A40 GPU in an updated SH3_G4FP32.2 configuration. The A40 GPU not only provides higher performance than the previous model it replaces, but it also comes with twice as much GPU memory, with a whopping 48GB of GDDR6.

The higher-end G4TF64 and G8TF64 models have also been updated with newer AMD CPUs, as well as updated versions of the NVIDIA A100 GPU, now each featuring a massive 80GB of HBM2e memory.

Get yours today!

For more details and pricing, please check out the Sherlock catalog (SUNet ID required).

If you’re interested in getting your own compute nodes on Sherlock, all the new configurations are available for purchase today, and can be ordered online though the Sherlock order form (SUNet ID required).

As usual, please don’t hesitate to [email protected]" rel="noopener" target="_blank">reach out if you have any questions!

3.3 PFlops: Sherlock hits expansion milestone

2021-04-03T00:00:00Z

Sherlock is a traditional High-Performance Computing cluster in many aspects. But unlike most of similarly-sized clusters where hardware is purchased all at once, and refreshed every few years, it is in constant evolution. Almost like a living organism, it changes all the time: mostly expanding as individual PIs, research groups, labs and even whole Schools contribute computing resources to the system ; but also sometimes contracting, when older equipment is retired.

A significant expansion milestone

A few days ago, Sherlock has reached a major expansion milestone, largely owing to significant purchases from the School of Earth, Energy & Environmental Sciences, but also thanks to multiple existing owner groups who decided to renew their investment in Sherlock by purchasing additional hardware.

With these recent additions, Sherlock reached a theoretical power of over 3 Petaflops, 3 thousand million million (10¹⁵) floating-point operations per second. That would place it around the 150th position in the most recent TOP500 list of the most powerful computer systems in the world.

Among the newly added nodes, a number of SH3_G8TF64 nodes, each featuring 128 CPU cores, 1TB of RAM, 8x A100 SXM4 GPUs (NVLink) and two Infiniband HDR interfaces providing 400Gb/s of interconnect bandwidth, both for storage and inter-node communication. Those nodes alone provide over half a Petaflop of computing power.

Sherlock now features over 1,700 compute nodes, occupying 45 data-center racks, and consuming close to half a megawatt of power. Over 44,000 CPU cores, more than 120 Infiniband switches and close to 20 miles of cables help support the daily computing activities of over 5,000 users. For even more facts and numbers, checkout the Sherlock Facts page!

A steady growth

Since in first days in 2014, and its initial 120 nodes, Sherlock has been growing at a steady pace. Three generations and as many Infiniband fabrics later, and after a few months of slowdown at the beginning of 2020, expansion has resumed and is going stronger than ever:

The road ahead

To keep expanding Sherlock and continue to serve the computing needs of the Stanford research community, rack space used by first generation Sherlock nodes needs to be reclaimed to make room for the next generation. Those 1st-gen nodes have been running well over their initial service life of 4 years, and in most cases, we’ve even been able to keep them running for an extra year. But data-center space being the hot property it has now become, and since demand for new nodes is not exactly dwindling down, we’ll be starting to retire the older Sherlock nodes to accommodate the ever-increasing requests for more computing power. We’ve started working on renewal plans with those node owners, and the process is already underway.

So for a while, Sherlock will shrink in size, as old nodes are retired. Before it can start growing again!

Catalog changes

As we move forward, the Sherlock Compute Nodes Catalog is also evolving, to follow the latest technological trends, and to adapt to the computing needs of our research community.

As part of this evolution, the recently announced SH3_G4FP32 configuration is sadly not available anymore, as vendors suddenly and globally discontinued the consumer-grade GPU model that was powering this configuration. They don’t have plans to bring back anything comparable, so that configuration had to be pulled from the catalog, unfortunately.

On a more positive note, a significant and exciting catalog refresh is coming up, and will be announced soon. Stay tuned! 🤫

As usual, we want to sincerely thank every one of you, Sherlock users, for your patience when things break, your extraordinary motivation and your continuous support. We’re proud of supporting your amazing work, and Sherlock simply wouldn’t exist without you.

Happy computing and don’t hesitate to [email protected]" rel="noopener" target="_blank">reach out if you have any questions!

Tracking NFS problems down to the SFP level

2021-02-05T18:20:00Z

This is part of our technical blog series about things that happen behind-the-scenes on Sherlock, and which are part of our ongoing effort to keep it up and running in the best possible conditions for our beloved users.

For quite a long time, we've been chasing down an annoying NFS timeout issue that seemed to only affect Sherlock 3.0 nodes.

That issue would impact both login and compute nodes, both NFSv4 user mounts (like $HOME and $GROUP_HOME) and NFSv3 system-level mounts (like the one providing software installations), and occur at random times, on random nodes. It was not widespread enough to be causing real damage, but from time to time, a NFS mount would hang and block I/O for a job, or freeze a login session. When that happened, the node would still be able to ping all of the NFS servers' IP addresses, even remount the same NFS file system with the exact same options in another mount point, and no discernable network issue was apparent on the nodes. Sometimes, the stuck mounts would come back to life on their own, sometimes they would stay hanging forever.

Is it load? Is it the kernel? Is it the CPU?

It kind of looked like it could be correlated with load, and mostly appear when multiple jobs were doing NFS I/O on a given node, but we never found conclusive proof that it was the case. The only distinguishable element was that the issue was only observed on Sherlock 3.0 nodes and never affected older Sherlock 1.0/2.0 nodes. So we started suspecting something about the kernel NFS client, maybe some oddity with AMD Rome CPUs: after all, they were all quite new, and the nodes had many more cores than the previous generation. So maybe they had more trouble handling the parallel operations, ended up with a deadlock or something.

But still, all the Sherlock nodes are using the same kernel, and only the Sherlock 3.0 nodes were affected, so it appeared unlikely to be a kernel issue.

The NFS servers maybe?

We then started looking at the NFS servers. Last December’s maintenance was actually an attempt at resolving those timeout issues, even though it proved fruitless in that aspect. We got in touch with vendor support to explore possible explanations, but nothing came out of it and our support case went nowhere. Plus, if the NFS servers were at fault, it would likely have affected all Sherlock nodes, not just a subset.

It’s the NFS client parameters! Or is it?

So back to the NFS client, we've started looking at the NFS client mount parameters. The petazillion web hits about "nfs timeout" didn't really help in that matter, but in the process we found pretty interesting [email protected]/T/?utm_source=noticeable&utm_campaign=sherlock.tracking-nfs-problems-down-to-the-sfp-level&utm_content=publication+link&utm_id=bYyIewUV308AvkMztxix.GtmOI32wuOUPBTrHaeki.P3xY1hwDWMe8tR48vPEj&utm_medium=newspage" rel="noopener nofollow" target="_blank">discussions about read/write sizes and read-ahead. We tried tweaking all of those parameters left and right, deployed various configs on the compute nodes (A/B testing FTW!), but the timeout still happened.

The lead

In the end, what gave us a promising lead was an article found on the GRNET blog that explain how the authors tracked down a defective QSFP that was causing issues in their Ceph cluster. Well, it didn't take long to realize that there was a similar issue between those Sherlock nodes and the NFS servers. Packet loss was definitely involved.

The tricky part, as described in the blog post, is that the packet loss only manifested itself when using large ICMP packets, close to the MTU upper limit. When using regular packet size, no problem was apparent.

For instance, this regular ping didn't show any loss:

# ping -c 50 10.16.90.1 | grep loss
50 packets transmitted, 50 received, 0% packet loss, time 538ms

But when cranking up the packet size:

# ping -s 8972 -c 50 10.16.90.1 | grep loss
50 packets transmitted, 36 received, 28% packet loss, time 539ms

What was even funnier is that not all Sherlock 3.0 nodes were experiencing loss to the same NFS server nodes. For instance, from one client node , there was packet loss to just one of the NFS servers:

client1# clush -Lw 10.16.90.[1-8] --worker=exec ping -s 8972 -M do -c 10 -q %h | grep loss
10.16.90.1: 10 packets transmitted, 10 received, 0% packet loss, time 195ms
10.16.90.2: 10 packets transmitted, 8 received, 20% packet loss, time 260ms
10.16.90.3: 10 packets transmitted, 10 received, 0% packet loss, time 193ms
10.16.90.4: 10 packets transmitted, 10 received, 0% packet loss, time 260ms
10.16.90.5: 10 packets transmitted, 10 received, 0% packet loss, time 200ms
10.16.90.6: 10 packets transmitted, 10 received, 0% packet loss, time 264ms
10.16.90.7: 10 packets transmitted, 10 received, 0% packet loss, time 196ms
10.16.90.8: 10 packets transmitted, 10 received, 0% packet loss, time 194ms

But from another client, sitting right next to it, no loss to that server , but packets dropped to another one instead:

client2# clush -Lw 10.16.90.[1-8] --worker=exec ping -s 8972 -M do -c 10 -q %h | grep loss
10.16.90.1: 10 packets transmitted, 8 received, 20% packet loss, time 190ms
10.16.90.2: 10 packets transmitted, 10 received, 0% packet loss, time 198ms
10.16.90.3: 10 packets transmitted, 10 received, 0% packet loss, time 210ms
10.16.90.4: 10 packets transmitted, 10 received, 0% packet loss, time 197ms
10.16.90.5: 10 packets transmitted, 10 received, 0% packet loss, time 196ms
10.16.90.6: 10 packets transmitted, 10 received, 0% packet loss, time 243ms
10.16.90.7: 10 packets transmitted, 10 received, 0% packet loss, time 201ms
10.16.90.8: 10 packets transmitted, 10 received, 0% packet loss, time 213ms

The link

That all started to sound like a faulty stack link, or a a problem in one of the LACP links between the different switch stacks (Sherlock's and the NFS servers’). We didn't find anything obviously out-of-place in reviewing the switches configuration, so we went back to the switches’ documentation to try to understand how to check counters and identify bad links (which gave us the opportunity to mumble about documentation that is not in sync with actual commands, but that’s another topic...).

So we dumped the hardware counters for each link involved in the NFS connections, and on a switch, on the NFS client’s side, there was this:

te1/45 Ingress FCSDrops : 0
te1/46 Ingress FCSDrops : 0
te2-45 Ingress FCSDrops : 0
te2-46 Ingress FCSDrops : 0
te3-45 Ingress FCSDrops : 0
te3-46 Ingress FCSDrops : 1064263014
te4-45 Ingress FCSDrops : 0
te4-46 Ingress FCSDrops : 0

Something standing out, maybe?

In more details:

#show interfaces te3/46
TenGigabitEthernet 3/46 is up, line protocol is up
Port is part of Port-channel 98
[...]
Input Statistics:
18533249104 packets, 35813681434965 bytes
[...]
1064299255 CRC, 0 overrun, 0 discarded

The CRC number indicates the number of CRC failures, packets which failed checksum validation. All the other ports on the switch were at 0. So clearly something was off with that port.

The culprit: a faulty SFP!

We decided to try to shut that port down (after all, it's just 1 port out of a 8-port LACP link), and immediately, all the packet loss disappeared.

So we replaced the the optical transceiver in that port, hoping that swapping that SFP would resolve the CRC failure problem. After re-enabling the link, the number of dropped packets seemed to have decreased. But not totally disappear…

The real culprit: the other SFP

Thinking a little more about it, since the errors were actually Ingress FCSDrops on the switch, it didn’t seem completely unreasonable to consider that those frames were received by the switch already corrupted, and thus, that they would have been mangled by either the transceiver on the other end of the link, or maybe in-flight by a damaged cable. So maybe we’ve been pointing fingers at a SFP, and maybe it was innocent… 😁

We checked the switch’s port on the NFS server’s side, and the checksum errors and drop counts were all at 0. We replaced that SFP anyway, just to see, and this time, bingo: no more CRC errors on the other side.

Which lead us to the following decision tree:

if a port has RX/receiving/ingress errors, it’s probably not its fault, and the issue is most likely with its peer at the other end of the link,
if a port has TX/transmitting/egress errors, it’s probably the source of the problem,
if both ports at each end of a given link have errors, the cable is probably at fault.

By the way, if you’re wondering, here’s what a SFP looks like:

TL;DR

We had seemingly random NFS timeout issues. They turned out to be caused by a defective SFP, that was eventually identified through the port error counter of the switch at the other end of the link.

There's probably a lesson to be learned here, and we were almost disappointed that DNS was not involved (because it's always DNS), but in the end, we were glad to finally find a rational explanation to those timeouts. And since that SFP replacement, not a single NFS timeout has been logged.

SH3_G4FP32 nodes are back in the catalog!

2020-11-05T01:49:00.001Z

A new GPU option is available in the Sherlock catalog… again!

After a period of unavailability and a transition between GPU generations, where previous models were retired while new ones were not available yet, we’re pleased to announce that the entry-level GPU node configuration is now back in the catalog. With a vengeance!

Built around the same platform as the previous SH3_G4FP32 generation, the new SH3_G4FP32.1 model features:

32 CPU cores
256 GB of memory
2TB of local NVMe scratch space
4x GeForce RTX 3090 GPUs, each featuring 24GB of GPU memory
a 200GB/s Infiniband HDR interface

Particularly well-suited for applications that don’t require full double-precision computations (FP64), the top-of-the-line RTX 3090 GPU is based on the latest NVIDIA Ampere architecture and provides what’s probably the best performance/cost ratio on the market today for those use cases, and delivers almost twice the performance of the previous generations on many ML/AI workloads, as well as a significant boost for Molecular Dynamics and CryoEM applications.

For more details and pricing, please check out the Sherlock catalog (SUNet ID required), and if you’re interested in purchasing your own compute nodes for Sherlock, the new SH3_G4FP32.1 configuration is available for purchase today, and can be ordered online though the Sherlock order form (SUNet ID required).

New GPU options in the Sherlock catalog

2020-09-18T18:00:00.001Z

Today, we’re introducing the latest generation of GPU accelerators in the Sherlock catalog: the NVIDIA A100 Tensor Core GPU.

Each A100 GPU features 9.7 TFlops of double-precision (FP64) performance, up to 312 TFlops for deep-learning applications, 40GB of HBM2 memory, and 600GB/s of interconnect bandwidth with 3rd generation NVLink connections^[1].

New Sherlock Catalog options

Targeting the most demanding HPC and DL/AI workloads, the three new GPU node options we’re introducing today should cover the most extreme computing needs:

a refreshed version of the SH3_G4FP64.1 configuration features 32x CPU cores, 256GB of memory and 4x A100 PCIe GPUs
the new SH3_G4TF64 model features 64 CPU cores, 512GB of RAM, and 4x A100 SXM4 GPUs (NVLink)
and the most powerful configuration, SH3_G8TF64 , comes with 128 CPU cores, 1TB of RAM, 8x A100 SXM4 GPUs (NVLink) and two Infiniband HDR HCAs for a whopping 400Gb/s of interconnect bandwidth to keep those GPUs busy

You’ll find all the details in the Sherlock catalog (SUNet ID required).

All those configuration are available for order today, and can be ordered online though the Sherlock order form (SUNet ID required).

Other models’ availability

We’re working on bringing a replacement for the entry-level SH3_G4FP32 model back in the catalog as soon as possible. We’re unfortunately dependent on GPU availability, as well as on the adaptations required for server vendors to accommodate the latest generation of consumer-grade GPUs. We’re expecting a replacement configuration in the same price range to be available by the end of the calendar year.

As usual, please don’t hesitate to [email protected]" target="_blank" rel="noopener">reach out if you have any questions!

In-depth technical details are available in the NVIDIA Developer blog ↩

Sherlock 3.0 is here!

2020-08-20T00:42:00.001Z

It’s been a long, long, way too long of a wait, but despite a global pandemic, heatwaves, thunderstorms, power shutoffs, fires and smoke, it’s finally here!

Today, we’re very excited to announce the immediate availability of Sherlock 3.0, the third generation of the Sherlock cluster.

What is Sherlock 3.0?

First, let’s take a quick step back for context.

The Sherlock cluster is built around core Infiniband fabrics, which connect compute nodes together and allow them to work as a single entity. As we expand Sherlock over time, more compute nodes are added to the cluster, and when a core fabric reaches capacity, a new one needs to be spun up. This is usually a good opportunity to refresh the compute node hardware characteristics, as well as continue expanding and renewing ancillary equipment and services, such as login nodes, DTNs, storage systems, etc. The collection of compute and service nodes connected to the same Infiniband fabric constitutes a sort of island, or generation, that could live on its own, but is actually an integral part of the greater, unified Sherlock cluster.

So far, since its inception in 2014, Sherlock has grown over two generations of nodes: the first one built around an FDR (56Gb/s) Infiniband fabric, and the second one, started in 2017, around an EDR (100Gb/s) fabric.

Late last year, that last EDR fabric reached capacity, and after a long and multifactorial hiatus, today, we’re introducing the third generation of Sherlock, architectured around a new Infiniband fabric, and a completely refreshed compute node offering.

What does it look like?

Sherlock still looks like a bunch of black boxes with tiny lights, stuffed in racks 6ft high, and with an insane number of cables going everywhere.

But in more technical details, Sherlock 3.0 features:

a new, faster interconnect | Infiniband HDR, 200Gb/s
The new interconnect provides more bandwidth and lower latency to all the new nodes on Sherlock, for either inter-node communication in large parallel MPI applications, or for accessing the $SCRATCH and $OAK parallel file systems.
Sherlock is one of the first HPC clusters in the world to provide 200Gb/s to the nodes.
new and faster processors | AMD 2nd generation EPYC (Rome) CPUs
To take advantage of the doubled inter-node bandwidth, a brand new generation of CPUs was required, to provide enough internal bandwidth between the CPUs and the network interfaces. The AMD Rome CPUs are actually the first (and currently still the only) x86 CPU model to provide PCIe Gen4 connectivity, that enables faster local and remote I/O, and that can unlock 200Gb/s network speeds.
Those CPUs are also faster, draw less power, and provide more cores per socket than the ones found in the previous generations of Sherlock nodes, with a minimum of 32 CPU cores per node.
more (and faster) internal storage | 2TB NVMe per node
Sherlock 3.0 nodes now each feature a minimum of 2TB of local NVMe storage (over 10x the previous amount), for applications that are particularly sensitive to IOPS rates.
refreshed $HOME storage
More nodes means more computing power, but it also means more strain on the shared infrastructure. To absorb it, we’ve also refreshed and expanded the storage cluster that supports the $HOME and $GROUP_HOME storage spaces, to provide higher bandwidth, more IOPS, and better availability.
more (and faster) login and DTN nodes
Sherlock 3.0 also feature 8 brand new login nodes, that are part of the login.sherlock.stanford.edu login pool, and each feature a pair of AMD 7502 CPUs (for a total of 64 cores) and 512 GB of RAM. As well as a new pair of dedicated Data Transfer Nodes (DTNs)
refreshed and improved infrastructure
The list would be too long to go through exhaustively, but between additional service nodes to better scale the distributed cluster management infrastructure, improved Ethernet topology between the racks, and a refreshed hardware framework for the job scheduler, all the aspects of Sherlock have been rethought and improved.

What does it change for me?

In terms of habits and workflows: nothing. You don’t have to change anything and can continue to use Sherlock exactly the way you’ve been using it so far.

Sherlock is still a single cluster, with the same:

single point of entry at login.sherlock.stanford.edu,
single and ubiquitous data storage space (you can still access all of your data on all the file systems, from all the nodes in the cluster),
single application stack (you can load the same module and run the same software on all Sherlock nodes).

But it now features a third island, with a new family of compute nodes.

One thing you’ll probably notice pretty quickly is that your pending times in queue for the normal, bigmem and gpu partitions have been dropping. Considerably.

This is because, thanks to the generous sponsorship of the Stanford Provost, we’ve been able to add the following resources to Sherlock’s public partitions:

partition	#nodes	node specs
`normal`	72	32-core (1x 7502) w/ 256GB RAM
`normal`	2	128-core (2x 7742) w/ 1TB RAM
`bigmem`	1	64-core (2x 7502) w/ 4TB RAM
`gpu`	16	32-core (1x 7502P) w/ 256GB RAM and 4x RTX 2080 Ti GPUs
`gpu`	2	32-core (1x 7502P) w/ 256GB RAM and 4x V100S GPUs
Total	93	3,200 cores, 30TB RAM, 72 GPUs

Those new Sherlock 3.0 nodes are adding over twice the existing computing power available for free to every Sherlock user in the normal, bigmem and gpu partitions.

How can I use the new nodes?

It’s easy! You can keep submitting your jobs as usual, and the scheduler will automatically try to pick the new nodes that satisfy your request requirements if they’re available.

If you want to target the new nodes specifically, take a look at the output of sh_node_feat: all the new nodes have features defined that allow the scheduler to specifically select them when your job requests particular constraints.

For instance, if you want to select nodes:

with HDR IB connectivity, you can use -C IB:HDR
with AMD Rome CPUs, you can use -C CPU_GEN:RME
with 7742 CPUs, you can use -C CPU_SKU:7742
with Turing GPUs, you can use -C GPU_GEN:TUR

Can I get more of it?

Absolutely! And we’re ready to take orders today.

If you’re interested in getting your own compute nodes on Sherlock, we’ve assembled a catalog of select configurations that you can choose from, and worked very hard with our vendors to maintain comparable price ranges with our previous generation offerings.

You’ll find the detailed configuration and pricing in the Sherlock Compute Nodes Catalog, and we’ve also prepared an Order Form that you can use to provide the required information to purchase those nodes

Sherlock catalog
http://www.sherlock.stanford.edu/docs/overview/orders/catalog
Order form
http://www.sherlock.stanford.edu/docs/overview/orders/form

For complete details about the purchasing process, please take a look at
https://www.sherlock.stanford.edu/docs/overview/orders/ and as usual,
please let us know if you have any questions.

Finally, we wanted to sincerely thank every one of you for your patience while we were working on bringing up this new cluster generation, in an unexpectedly complicated global context. We know it’s been a very long wait, but hopefully it will have been worthwhile.

Happy computing and don’t hesitate to [email protected]" target="_blank" rel="noopener">reach out!

Oh, and Sherlock is on Slack now, so feel free to come join us there too!

Adventures in storage

2019-12-03T23:30:00.001Z

This is part of our blog series about behind-the-scenes things we do on a regular basis on Sherlock, to keep it up and running in the best possible conditions for our users.
Now that Sherlock’s old storage system has been retired, we can finally tell that story. It all happened in 2016.

Or: How we replaced more than 1 PB of hard drives, while continuing to serve files to unsuspecting users.

TL;DR: The parallel filesystem in Stanford’s largest HPC cluster has been affected by frequent and repeated hard-drive failures since its early days. A defect was identified that affected all of the 360 disks used in 6 different disk arrays. A major swap operation was planned to replace the defective drives. Multiple hardware disasters piled up to make matters worse, but in the end, all of the initial disks were replaced, while retaining 1.5 PB of user data intact, and keeping the filesystem online the whole time.

History and context

Once upon a time, in a not so far away datacenter…

We, Stanford Research Computing Center, manage many high-performance computing and storage systems at Stanford. In 2013, in a effort to centralize resources and advance computational research, a new HPC cluster, Sherlock, has been deployed. To provide best-in-class computing resources to all faculty and facilitate research in all fields, this campus-wide cluster features a high-performance, Lustre-based parallel filesystem.

This filesystem, called /scratch, was designed to provide high-performance storage for temporary files during simulations. Initially made of three I/O cells, the filesystem had been designed to be easily expanded with more hardware as demand and utilization grew. Each I/O cell was comprised of:

2x object storage servers,
2x disk arrays, with:
- dual RAID controllers,
- 5 drawers of 12 disks each,
- 60 4TB SAS disks total.

Each disk array was configured with 6x 10-disk RAID6 LUNs, and every SAS path being redundant, the two OSS servers could act as a high-availability pair. This is a pretty common Lustre setup.

Close to a petabyte in size, this filesystem quickly became the go-to solution for many researchers who didn’t really have any other option to store and compute against their often large data sets. Over time, the filesystem was expanded several times and eventually more than doubled in size:

	# disk arrays	# OSTs	# disks	size
initially	6	36	360	1.1 PB
ultimately	18	108	1080	3.4 PB

As the filesystem grew, it ended up containing close to 380 million inodes (that is, filesystem entries, like files, directories or links). Please keep that in mind, turns out that’s an important factor for the following events.

The initial issue

All was fine and dandy in storage land, and we had our share of failing disks, as everybody. We were replacing them as they failed, sending them back to our vendor, and getting new ones in return. Datacenter business as usual.

Except, a lot of disks were failing. Like, really a lot, as in one every other day.

We eventually came to the conclusion that our system had been installed with a batch of disks with shorter-than-average lifespans. They were all from the same disk vendor, manufactured around the same date. But we didn’t worry too much.

Until that day, where 2 disks failed within 3 hours of each other. In the same disk array. In. The. Same. LUN.

To give some context, one failed drive in a 10-disk RAID6 array is no big deal: data can be reconstructed from the 9 remaining physical disks without any problem. If by any chance one of those remaining disks suffers from a problem and data cannot be read from it, there are still enough redundancy to reconstruct the missing data and all is well.

A single drive failure is handled quite transparently by the disk array:

it emits an alert,
you replace the failed disk,
it detects the drive has been replaced,
it starts rebuilding it from data and parity on the other disks of the LUN,
about 24 hours later, you have a brand new LUN, all shiny and happy again.

But two failed disks, on the other hand, that’s pretty much like a Russian roulette session: you may be lucky and pull it off, but there’s a good chance you won’t. While the LUN misses 2 disks, there is no redundancy left to reconstruct the data. Meaning that any read error on any of the remaining 8 disks will lead to data loss as the controller won’t be able to reconstruct anything. And worse, any bit flip during reads will go completely unnoticed, as there is no parity left to check the data. Which means that you can potentially be reconstructing completely random garbage on your drives.

Given that, it didn’t take us long to pick up the phone and call our vendor.

They confirmed our initial findings that in our initial set of 6 disk arrays, over the course of 2 years, we had already replaced about 60 disks out of 360. At a rate of 5-10 failures per month. Way higher than expected.

The LUN rebuild eventually completed fine, without any problem, but that double-failure acted as a serious warning. So we stated thinking about ways to solve our problem. And that’s when the sticky cheese hit the fan…

When problems pile up

Three days after the double failure, we had an even more important hardware event: one drawer in another disk array misbehaved, reporting itself as degraded, and 6 disks failed in that same drawer over the course of a few minutes. A 7th disk was evicted a few hours later, and left 2 LUNs without any parity in that single array. Joy all over. In a few minutes, the situation we were dreading a few days earlier just happened twice in the same array. We were a disk away from loosing serious amounts of data (we’re talking 30TB per LUN). And as past experience proved, those disks were not of the most reliable kind…

We got our vendor to dispatch a replacement drawer to us under the terms of our H+4 support contract. Except they didn’t have any replacement drawer in stock that they could get to us in 4 hours. So they overnight’d it and we got the replacement drawer the following day.

We diligently replaced the drawer and rebuild started on those 7 drives in the disk array. Which, yes, means that one LUN was rebuilding without any redundancy. Like the one from the other disk array the week before. And as everyone probably guessed, things didn’t go that well the second time: that LUN stayed degraded, despite all the rebuild operations being done and all the physical disks state being "optimal". Turned out the interface and the internal controller state disagreed on the status of a drive. On our vendor’s suggestion, we replaced that drive, a new rebuild started, and then abruptly stopped mid-course: the state of the LUN was still "degraded".

And then, we had the sensible yet completely foolish idea of calling vendor support on a week-end.

Hilarity and data loss ensued.

Never trust your hardware vendor support on week-ends

We were in a situation were a LUN was degraded, and a recently failed drive had just failed to rebuild, yet was showing up as “optimal” in the management interface. The vendor support technician then had the brilliant idea of forcefully “reviving” that drive. Which had the immediate effect of putting back online a drive that had been partially reconstructed, ie. on which 100% of the data had to be considered bit waste.
And the LUN stayed in that state, serving ridiculously out-of-sync, inaccurate and pretty much random data to our Lustre OSS servers for about 15 minutes. Fifteen minutes. Nine hundred full-size seconds. A lot of bad things can (and did) happen in 900 seconds.

Luckily, the Lustre filesystem quickly realized it was lied to, so it did the only sane thing to do, blocked all I/O and set that device read-only. Of course, some filesystem-level corruption happened during the process.

We had to bring that storage target down and check it multiple time with fsck to restore its data structure consistency. About 1,500 corrupted entries where found, detached from the local filesystem map and stored in the lost+found directory. That means that all those 1,500 objects, which were previously part of files, where now orphaned from the filesystem, as it had no way of knowing what file the belonged too anymore. So it tossed them in lost+found as it couldn’t do much else with them.

And on our cluster, users trying to access those files were kindly greeted with an error message, which, as error messages sometimes are, was unintuitively related to the matter at hand: cannot allocate memory.

With (much better) support from our filesystem vendor, we were able to recover a vast majority of those 1,500 files, and re-attach them to the filesystem, where they originally were. For Lustre admins, the magic word here is ll_recover_lost_found_objs.

So in the end, we “only” lost 29 files in the battle. We contacted each one of the owners to let them know about the tragic fate of their files, and most of them barely flinched, their typical response being: "Oh yeah, I know that’s temporary storage anyway, let me upload a new copy of that file from my local machine".

We know, we’re blessed with terrific users.

The tablecloth trick

Now, this was just starters, we hadn’t really had a chance to tackle the real issue yet. We were merely absorbing the fallout of that initial drawer failure, but we hadn’t done anything to address the high failure rate of our disk drives.

Our hardware vendor, well aware of the underlying reliability issue, as the same scenario happened other places too, kindly agreed to replace all of our remaining original disks. That is, about 300 of them:

disk array	already HDDs	total HDDs	HDDs to replace
DA00	16	60	44
DA01	15	60	45
DA02	14	60	46
DA03	13	60	47
DA04	8	60	52
DA05	15	60	45
total	81	360	279

The strategy devised by that same vendor was:

"We’ll send you a whole new disk array, filled with new disks, and you’ll replicate your existing data there".

To which we replied:

“Uh, sorry, that’s won’t work. You see, those arrays are part of a larger Lustre filesystem, we can’t really replicate data from one to another without a downtime. And we would need a downtime long enough to allow us to copy 240TB of data. Six times, 'cause you know, we have six arrays. Oh, and our users don’t like downtimes.”

So we had to find another way.

Our preference was to minimize manipulations on the filesystem and keep it online as much as possible during this big disk replacement operation. So we leaned toward the path of least resistance, and let the RAID controllers do what they do best: compute parities and write data. So we ended up removing each one of those bad disks, one at a time, replacing it with a new disk, and let the controller rebuild the LUN.

Each rebuild operation took about 24 hours, so obviously, replacing ~300 disks one at a time wasn’t such a thrilling idea: assuming somebody would be around 24/7 to swap a new drive as soon as the previous one finished, that would make the whole operation last almost a full year. Not very practical.

So we settled on doing them in batches, replacing one disk in each of the 36 LUNs in each batch. That would allow the RAID controllers to rebuild several LUNs in parallel, and cut the overall length of the operation. Instead of 300 sequential 24-hours rebuilds, we would only need 5 waves of disk replacements, which shouldn’t take more than a couple weeks total.

Should we mention the fact that our adored vendor mentioned that, since we were using RAID6, if we wanted to speed things even more, we could potentially consider replacing two drives at a time in each LUN, but that they wouldn’t recommend it? Nah, right, we shouldn’t.

Remove the disks, keep the data

So they went away and shipped us new disks. That’s where the “tablecloth trick” analogy is fully realized: we were indeed removing disk drives from our filesystem, while keeping the data intact, and inserting new disks underneath to replace them. Which would really be like pulling the tablecloth, putting a new one in place, and keeping the dishes intact.

But you know, things never go as planned, and while we started replacing that first batch of disks, we realized that those unreliable drives? Well, they were really unreliable.

When things go south

No less than five additional disks failed during that same first wave of rebuilds. Four of them in the same array (DA00). To make things worse, in one of those LUNs, one additional disk failed during the rebuild and then, unreadable sectors were encountered on a 3rd disk. Which lead to data loss and a corrupted LUN.

We contacted our vendor, which basically said: "LUN is lost, restore from backup". Ha ha! Of course, we have backups for a 3PB Lustre filesystem, and we can restore an individual OST without breaking complete havoc in the rest of the filesystem’s coherency. For some reason, our vendor support recommended to delete the LUN, recreate it, and let the Lustre file system re-populate data back. We are still trying to understand what they meant.

On the bright side, they engaged our software vendor, to provide some more assistance at the filesystem level and devise a recovery strategy. We had one of our own rolling already, and it turned out it was about the same.

Relocating files

Since we still had access to the LUN, our approach was to migrate all the files out of that LUN as quickly as possible and relocate them on other OSTs in Lustre, re-initialize the LUN at the RAID level, and them reformat it and re-insert it in the filesystem. Or, more precisely:

deactivate the OST on MDT to avoid new object creation,
use lfs_migrate to relocate files out of that OST, using either Robinhood or the results of lfs find to identify files residing on that OST (the former can be out of date, the latter was quite slow),
make sure the OST was empty (lfs find again),
disable the OST on clients, so they didn’t use it anymore,
reactivate the OST on the MDS to clear up orphaned objects (while the OST is disconnected from the MDT, file relocations are not synchronized to the OST, so objects are orphaned there and take up space unnecessarily),
backup the OST configuration (so it could be recreated with the same parameters, including its index),
reinitialize the LUN in the disk array, and retain its configuration (most importantly its WWID),
reformat the OST with Lustre,
restore the OST configuration (especially its index),
reactivate the OST.

What can go wrong in a 10-step procedure? Turns out, it kind of all stopped at step 1.

Making sure nobody writes files to a LUN anymore

In order to be able to migrate all the files from an OST, you need to make sure that nobody can write new files to it anymore. How could you empty an OST if new files keep being created on it?
There are several approaches to this, but it took us some tries to get it right where we wanted it to be.

First, you can try to ‘deactivate’ the OST by making it read-only on the MDT. It means that users can still read the existing files on the OST, but the MDT won’t consider it for new file creations. Sounds great, except for one detail: when you do this, the OST is disconnected from the MDT, so inodes occupied by files that are being migrated are not reported as freed up to the MDT. The consequence is that the MDT still thinks that the inodes are in-use, and you end up in a de-synchronized state, with orphaned inodes on your OST. Not good.

So you need, at some point, to reconnect your OST to the MDT. Except as soon as you do this, new files get created on it, and you need to deactivate the OST, migrate them again, and bam, new orphan inodes again. Back to square one.

Another method is to mark the OST as "degraded", which is precisely designed to handle such cases: OST undergoing maintenance, or rebuilding RAIDs, during which period the OST shouldn’t be used to create new files. So, we went ahead and marked our OST as "degraded". Until we realized that files were still created on it. It turns out that this was because of some uneven usage in out OSTs (they were added to the filesystem over time, so they were not all filled at the same level): if there’s too much unbalanced utilization among OSTs, the Lustre QOS allocator will ignore the “degraded” flag on OSTs, and privilege trying to rebalance usage over obeying OST degradation flags.

Our top-notch filesystem vendor support suggested an internal setting to set on the OST (fail_loc=0x229, don’t try this at home) to artificially mark the OST as "out-of-space", which would carry both benefits of leaving it connected to the MDT for inodes cleanup, and prevent new files creation there. Unfortunately, this setting had the unexpected side effect of making load spike on the MDS, practically rendering the whole filesystem unsuable.

So we ended up deciding to temporarily sacrifice good load balancing across OSTs, and disabled the QOS allocator. This allowed us to mark our OST as "degraded", keep it connected to the MDT so inodes associated with migrated files would effectively be cleaned, while preventing new file creation. This worked great.

We let our migration complete, and at the end both OSTs were completely empty, devoid of any file.

Zombie LUN

Because any good story needs zombies.

Once we had finished emptying our OSTs, we then needed to fix them at the RAID level. Because, remember, everything went to hell after multiple disk failures during a LUN rebuild. Meaning that in their current state, those two LUNs were unusable and had to be re-initialized. We had good hopes we would be able to do this from the disk array management tools. Unfortunately, our hardware vendor didn’t think it would be possible, and strongly recommended to destroy the LUN and to rebuild it with the same disks.

The problem with that approach is that this would have generated a different identifier for our LUNs, meaning we would have had to change the configuration of our multipath layer, and more importantly, swap old WWIDs with the new ones in our Lustre management tool. Which is not supported.

Thing is, we’re kind of stubborn. And we didn’t want to change WWIDs. So we looked for a way to re-initialize those LUNs in-place. Sure enough, failing multiple drives in the LUN rendered it inoperable. And nothing in the GUI seemed to be possible from there, besides "calling support for assistance". And you know, we tried that before, so no thanks we’ll pass.

Finally, exploring the CLI options, we found one (revive diskGroup) that did exactly what we were looking for: after replacing the 2 defectives disks (which made the LUN fail), we revived it from the CLI, and it happily sprung to life again. With all its parameters intact, so from the servers point of view, it was like nothing ever happened.

Restore Lustre

So, all what was left to do, was to reformat the OSTs and restore their parameters we had backed up before failing and reviving the LUNs.

Wrap up

Everything was a smooth ride from there. While working on repairing our two failed OSTs, we were continuously replacing those ~300 defective hard drives, one at a time and monitoring the rebuilds processes. At any given time, we had something like 36 LUNs rebuilding (6 arrays, 6 LUNs each) to maximize the throughput.

Disk replacement

Our hardware vendor was sending us replacement drives in batches, and we’ve been replacing 1 disk in each LUN pretty much every day for about 3 weeks.
We built a tool to follow the replacements and select the next disks to replace (obviously placement was important as we didn’t want to remove multiple disks from the same LUN). The tool allowed to see the number of disks left to replace, the status of current rebuilds, and when possible, selected the next disks to replace by making them blink in the disk arrays.

The end

Just because that how lucky we are, another drawer failed during the last rounds of disk replacements. It took an extra few days to get a replacement on site and replace it. Fortunately, no unreadable sectors happened during the recovery.

It took a few more days to clear out remaining drawers and controllers errors and to make sure that everything was stable and in running order. The official end of the operation was declared on May 17th, 2016, about 4 months after the initial double-disk failure.

We definitely learned a lot in the process, way more that we could ever have dared to ask. And it was quite the adventure, the kind that we hope will not happen again. But considering all what happened, we’re very glad the damage was limited to a handful of files and didn’t have a much broader impact.