The theme of this blog is “Simplicity.” Today’s HPC user has an overabundance of choices when it comes to HPC Schedulers, clouds, infrastructure in those clouds, and data management solutions. Let's simplify it! Using CycleCloud as the nucleus, my intent is to show how simple it is to deploy a Slurm cluster on the Hammerspace data platform while using a standard NFS protocol. And for good measure, we will use a new feature in CycleCloud called Scheduled Events – which will automatically unmount the NFS share when the VM’s are shutdown.
Abstract
The theme of this blog is “Simplicity”. Today’s HPC user has an overabundance of choices when it comes to HPC Schedulers, clouds, infrastructure in those clouds, and data management solutions. Let's simplify it! Using CycleCloud as the nucleus, my intent is to show how simple it is to deploy a Slurm cluster on the Hammerspace data platform while using a standard NFS protocol. And for good measure, we will use a new feature in CycleCloud called Scheduled Events – which will automatically unmount the NFS share when the VM’s are shutdown.
CycleCloud and SLURM
Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that delivers a fully managed SLURM workload environment on Azure. This occurs without requiring manually configured infrastructure or Slurm settings.
To get started, go to the Azure marketplace and type “Azure CycleCloud for Slurm”
I have not provided a detailed breakdown of the steps for Azure CycleCloud for Slurm as Kiran Buchetti does an excellent job of that in the blog here. It is a worthwhile read so please take a minute to review.
Getting back to the theme of this blog, simplicity of Azure CycleCloud Workspace for Slurm is one of its most important value propositions. Please see below for my top reasons why:
- CycleCloud Workplace for Slurm is a simple template for entire cluster creations. Without the above, a user would have to manually install CycleCloud, install Slurm, configure the compute partitions, attach storage, etc. Instead, you fill out a marketplace template and a working cluster is live in 15-20 minutes.
- Preconfigured best practices, prebuilt Slurm nodes, partitions, network and security rules are done for the end user. No deep knowledge of HPC or SLURM is required!
- Automatic Cost control: Workplace for Slurm is designed to deploy only when a job is submitted. From there, the solution will auto shutdown after a job is complete. Moreover, workplace for Slurm comes with preconfigured partitions (GPU partition, HTC spot partition) – so end users can submit jobs to the right partition based on performance and budget.
Now that we have a cluster built – let's turn our attention to data management. I have chosen to highlight the Hammerspace Data Platform in this blog. Why? Namely, because it is a powerful solution that provides high performance and global access to CycleCloud HPC/AI nodes. Sticking true to our theme... it is also incredibly simple to integrate with CycleCloud.
Who is Hammerspace ?
Before discussing integration, let's take a minute to introduce you to Hammerspace. Hammerspace is a software-defined data orchestration platform that provides a global file system across on-premises infrastructure and public clouds. It enables users and applications to access and manage unstructured data anywhere at any time. This all comes without the need to copy, migrate, or manually manage data. Hammerspace’s core philosophy is that “Data should follow the user, not the other way around”.
Great information on Hammerspace at the following link: Hammerspace Whitepapers
Linux Native
Hammerspace's foundation as a data platform is built natively into the Linux kernel, requiring no additional software installation on any nodes. The company’s goal is to deliver a High-Performance Plug and Play model – using standard NFS protocols (v3, v4, pNFS) – that make high performance & scalable file access familiar to any Linux system administrator.
Let’s break down why the native Kernel approach is important to a CycleCloud Workplace on SLURM user:
- POSIX compliant high performance file access with no changes in code required. No agents needed on the hosts, no additional CycleCloud templates needed. From a CycleCloud perspective, Hammerspace is simply an “external NFS”
- No re-staging of jobs required. Its NFS – all the compute nodes can access the same data (regardless of where it resides). The days of copying / moving data between compute nodes are over.
- Seamless Mounting. Native NFS mounts can be added easily in CycleCloud and files are instantly available for SLURM jobs with no unnecessary job prep time. We will take a deeper dive into this topic in the next section.
How to export NFS
Native NFS mounts can be added easily to CycleCloud such as the example below...
NFS mounts can be entered on the Marketplace template or alternatively via the scheduler. For Hammerspace – click on External NFS. Put in the IP of the Hammerspace Anvil Metadata server, add in your mount options, and that’s it.
The example below uses NFS mounts for /sched and /data
Once the nodes are provisioned, log into any of the nodes and they will be mounted.
On the Hammerspace user interface, we see the /sched share deployed with any relevant IOPS, growth, and files
That’s it. That’s all it takes to mount a powerful parallel file system to CycleCloud. Now let's look at the benefits of a Hammerspace/CycleCloud implementation
- Simplified data management: CycleCloud orchestrates HPC infrastructure on demand – Hammerspace ensures that the data is immediately available whenever the compute comes up. Hammerspace will also place data in the right location or tier based on its policy driven management. This reduces the need for manual scripting to put data on lower cost tiers of storage.
- No application refactoring: Applications do not need to add additional agents, nor do they have to change to benefit from using a Global Access system like Hammerspace.
CycleCloud Scheduled Events
The last piece of the story is the shutdown/termination process. The HPC jobs are complete, now it is time to shut down the nodes and save costs. What happens to the NFS mounts that are on each node? Prior to CycleCloud 8.2.2 – if nodes were not unmounted properly, NFS mounts could hang indefinitely waiting for IO. Users can now take advantage of “Scheduled Events” in CycleCloud – a feature that lets you put a script on your HPC nodes to automatically be executed when a supported event occurs. In our case, our supported event is a node termination.
The following is taken straight from the CycleCloud Main page here.
CycleCloud supports enabling Terminate Notification on scaleset VMs (e.g., execute nodes). To do this, set EnableTerminateNotification to true on the nodearray. This will enable it for scalesets created for this nodearray. To override the timeout allowed, you can set TerminateNotificationTimeout to a new time. For example, in a cluster template:
The script to unmount a NFS share during a terminate event is not trivial:
Add it to your project project.spec
Attach it to the shutdown task:
Simple! Now a user can run a job and terminate the nodes after job completion without worrying about what it does to the backend storage. No more cleanup! This is cost savings, operational efficiency, and resource cleanliness (no more stale azure resources like IP’s, NICs, and disks cluttering up a subscription).
Conclusion
Azure CycleCloud along with Slurm and the Hammerspace Data Platform provides a powerful, scalable and cost-efficient solution for HPC in the cloud. CycleCloud automates the provisioning (and the elastic scaling up and down) of the Infrastructure, SLURM manages the task of job scheduling, and Hammerspace delivers a global data environment with high performance parallel NFS.
Ultimately, the most important element of the solution is the simplicity. Hammerspace enables HPC organizations to focus on solving core problems vs the headache of managing infrastructure, setup, and unpredictable storage mounts. By reducing the administrative overhead needed to run HPC environments, the solution described in this blog will help organizations accelerate time to results, lower costs, and drive innovation across all industries.