3.3 PFlops: Sherlock hits expansion milestone

Sherlock is a traditional High-Performance Computing cluster in many aspects. But unlike most of similarly-sized clusters where hardware is purchased all at once, and refreshed every few years, it is in constant evolution. Almost like a living organism, it changes all the time: mostly expanding as individual PIs, research groups, labs and even whole Schools contribute computing resources to the system ; but also sometimes contracting, when older equipment is retired.

A significant expansion milestone

A few days ago, Sherlock has reached a major expansion milestone, largely owing to significant purchases from the School of Earth, Energy & Environmental Sciences, but also thanks to multiple existing owner groups who decided to renew their investment in Sherlock by purchasing additional hardware.

With these recent additions, Sherlock reached a theoretical power of over 3 Petaflops, 3 thousand million million (1015) floating-point operations per second. That would place it around the 150th position in the most recent TOP500 list of the most powerful computer systems in the world.

Among the newly added nodes, a number of SH3_G8TF64 nodes, each featuring 128 CPU cores, 1TB of RAM, 8x A100 SXM4 GPUs (NVLink) and two Infiniband HDR interfaces providing 400Gb/s of interconnect bandwidth, both for storage and inter-node communication. Those nodes alone provide over half a Petaflop of computing power.

Sherlock now features over 1,700 compute nodes, occupying 45 data-center racks, and consuming close to half a megawatt of power. Over 44,000 CPU cores, more than 120 Infiniband switches and close to 20 miles of cables help support the daily computing activities of over 5,000 users. For even more facts and numbers, checkout the Sherlock Facts page!

A steady growth

Since in first days in 2014, and its initial 120 nodes, Sherlock has been growing at a steady pace. Three generations and as many Infiniband fabrics later, and after a few months of slowdown at the beginning of 2020, expansion has resumed and is going stronger than ever:

The road ahead

To keep expanding Sherlock and continue to serve the computing needs of the Stanford research community, rack space used by first generation Sherlock nodes needs to be reclaimed to make room for the next generation. Those 1st-gen nodes have been running well over their initial service life of 4 years, and in most cases, we’ve even been able to keep them running for an extra year. But data-center space being the hot property it has now become, and since demand for new nodes is not exactly dwindling down, we’ll be starting to retire the older Sherlock nodes to accommodate the ever-increasing requests for more computing power. We’ve started working on renewal plans with those node owners, and the process is already underway.

So for a while, Sherlock will shrink in size, as old nodes are retired. Before it can start growing again!

Catalog changes

As we move forward, the Sherlock Compute Nodes Catalog is also evolving, to follow the latest technological trends, and to adapt to the computing needs of our research community.

As part of this evolution, the recently announced SH3_G4FP32 configuration is sadly not available anymore, as vendors suddenly and globally discontinued the consumer-grade GPU model that was powering this configuration. They don’t have plans to bring back anything comparable, so that configuration had to be pulled from the catalog, unfortunately.

On a more positive note, a significant and exciting catalog refresh is coming up, and will be announced soon. Stay tuned! 🤫

As usual, we want to sincerely thank every one of you, Sherlock users, for your patience when things break, your extraordinary motivation and your continuous support. We’re proud of supporting your amazing work, and Sherlock simply wouldn’t exist without you.

Happy computing and don’t hesitate to reach out if you have any questions!