A newer, faster and better /scratch

As we just announced, Sherlock now features a brand new storage system for /scratch. But what was the old system, what does the new one look like, and how did the move happen? Read on to find out!

The old

Since its early days, Sherlock ran its /scratch filesystem on a storage system that was donated by Intel and Dell.

Dubbed Regal, it was one of the key components of the Sherlock cluster when we started it in early 2014, with an initial footprint of about 100 compute nodes. Its very existence allowed us to scale the cluster to more than 1,500 nodes today, almost entirely through Faculty and PIs contributions to its condominium model. That’s a 15x growth in 5 years, and adoption has been spectacular.

Regal was initially just over 1PB when it’s been deployed in May 2013, which was quite substantial at the time. And similarly to the compute part of the cluster, its modular design allowed us to expand it to over 3PB with contributions from individual research groups.

We had a number of adventures with that system, including a major scale disk replacement operation, where we replaced about a petabyte of hard drives in production, while continuing to serve files to users ; or a literal drawer explosion in one of the disk arrays!

kaboom

It’s been fun, and again, invaluable to our users.

But time has come to retire it, and replace it with a newer, faster and better solution, to accommodate the ever-growing storage needs of Sherlock’s ever-growing community.

The new

This year, we stood up a completely new and separate /scratch filesystem for Sherlock.

Nicknamed Fir (we like trees), this new storage system features:

multiple metadata servers and faster metadata storage for better responsiveness with interactive operations,
faster object storage servers,
a faster backend interconnect, for lower latency operations across storage servers,
more and faster storage routers to provide more bandwidth from Sherlock to /scratch,
more space to share amongst all Sherlock users,
a newer version of Lustre which provides:
- improved client performance,
- dynamic file striping to automatically adapt file layout and I/O performance to match a file’s size
- and much more!

new

And not to brag, but Fir has been ranked #15 in the IO-500 list of the fastest storage systems in the world, in the 10-node challenge category, that was released at SC’19. So yes, it’s decently fast.

The migration

Now, usually, when a new filesystem is made available on a computing system, there are two approaches:

One is making the new system available under a new mount point (like /scratch2) and tell users: “here’s the new filesystem, the old one will go away soon, you have until next Monday to get your files there and update all your scripts.”
This usually results in a lot of I/O traffic going on at once from all the users rushing to copy their data to the new space, potential mistakes, confusion, and in the end, a lot of frustration, additional work and unnecessary stress on everyone. Not good.

The other one is for sysadmins to copy all of the existing data from the old system to the new one in the background, in several passes, and then scheduled a (usually long) downtime to run a last synchronization pass and substitute the old filesystem by the new one under the same mount point (/scratch).
This also brings significant load on the filesystem while the synchronisation passes are running, taking I/O resources away from legitimate user jobs, it’s usually a very long process, and in the end it brings over old and abandoned files to the new storage system, wasting precious space. Not optimal either.

So we decided to take another route, and devised a new scheme. We spent some time (and fun!) designing and developing a new kind of overlay layer, to bridge the gap between Regal and Fir, and to transparently migrate user data from one to the other.

We (aptly) named this layer migratefs and open-sourced it at:
https://github.com/stanford-rc/fuse-migratefs.

migratefs

The main idea of migratefs is to take advantage of user activity to:

distribute the data transfer tasks across all of the cluster nodes, to reduce the overall migration time,
only migrate data that is actively in use, and leave older files that are never accessed nor modified on the old storage system, resulting in a new storage system that only stores relevant data,
migrate all the user data transparently, without any downtime.

So over the last few months, all of the active user data on Regal has been seamlessly migrated to Fir, without users having to modify any of their job scripts, and all without a downtime.

Which is why if you’re using $SCRATCH or $GROUP_SCRATCH today, you are actively using the new storage system, and all your active data is there already, ready to be used in your compute jobs.

Next steps

Now, Regal has been emptied of all of its data and has been retired. It’s currently being un-racked to make room for future Sherlock developments.And stay tuned, because… epic changes are coming!

Sherlock changelog

www.sherlock.stanford.edu

A newer, faster and better /scratch

The old

The new

The migration

Next steps