ClusterShell on Sherlock
timestamp1670036242756
Ever wondered how your jobs were doing while they were running? Keeping a eye on a log file is nice, but what if you could quickly gather process lists, usage metrics and other data points from all the nodes your multi-node jobs are running on, all at once?
Enter ClusterShell, the best parallel shell application (and library!) of its kind.
With ClusterShell on Sherlock, you can quickly run a command on all the nodes your job is running on, to gather information about your applications and processes, in real time, and gather live output without having to wait for your job to end to see how it did. And with its tight integration with the job scheduler, no need to fiddle with manual node lists anymore, all it needs is a job id!
You allocated a few nodes in an interactive session and want to distribute some files on each node’s local storage devices? Check: ClusterShell has a copy mode just for this.
Want to double-check that your processes are correctly laid out? Check: you can run a quick command to check the process tree across the nodes allocated to your job with:
$ clush -w @job:$JOBID pstree -au $USER
and verify that all your processes are running correctly.
You’ll find more details and examples in our Sherlock documentation, at https://www.sherlock.stanford.edu/docs/software/using/clustershell
Questions, ideas, or suggestions? Don’t hesitate to reach out to [email protected] to let us know!
Did you like this update?