storage
17 TopicsCycleCloud + Hammerspace
Abstract The theme of this blog is “Simplicity”. Today’s HPC user has an overabundance of choices when it comes to HPC Schedulers, clouds, infrastructure in those clouds, and data management solutions. Let's simplify it! Using CycleCloud as the nucleus, my intent is to show how simple it is to deploy a Slurm cluster on the Hammerspace data platform while using a standard NFS protocol. And for good measure, we will use a new feature in CycleCloud called Scheduled Events – which will automatically unmount the NFS share when the VM’s are shutdown. CycleCloud and SLURM Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that delivers a fully managed SLURM workload environment on Azure. This occurs without requiring manually configured infrastructure or Slurm settings. To get started, go to the Azure marketplace and type “Azure CycleCloud for Slurm” I have not provided a detailed breakdown of the steps for Azure CycleCloud for Slurm as Kiran Buchetti does an excellent job of that in the blog here. It is a worthwhile read so please take a minute to review. Getting back to the theme of this blog, simplicity of Azure CycleCloud Workspace for Slurm is one of its most important value propositions. Please see below for my top reasons why: CycleCloud Workplace for Slurm is a simple template for entire cluster creations. Without the above, a user would have to manually install CycleCloud, install Slurm, configure the compute partitions, attach storage, etc. Instead, you fill out a marketplace template and a working cluster is live in 15-20 minutes. Preconfigured best practices, prebuilt Slurm nodes, partitions, network and security rules are done for the end user. No deep knowledge of HPC or SLURM is required! Automatic Cost control: Workplace for Slurm is designed to deploy only when a job is submitted. From there, the solution will auto shutdown after a job is complete. Moreover, workplace for Slurm comes with preconfigured partitions (GPU partition, HTC spot partition) – so end users can submit jobs to the right partition based on performance and budget. Now that we have a cluster built – let's turn our attention to data management. I have chosen to highlight the Hammerspace Data Platform in this blog. Why? Namely, because it is a powerful solution that provides high performance and global access to CycleCloud HPC/AI nodes. Sticking true to our theme... it is also incredibly simple to integrate with CycleCloud. Who is Hammerspace ? Before discussing integration, let's take a minute to introduce you to Hammerspace. Hammerspace is a software-defined data orchestration platform that provides a global file system across on-premises infrastructure and public clouds. It enables users and applications to access and manage unstructured data anywhere at any time. This all comes without the need to copy, migrate, or manually manage data. Hammerspace’s core philosophy is that “Data should follow the user, not the other way around”. Great information on Hammerspace at the following link: Hammerspace Whitepapers Linux Native Hammerspace's foundation as a data platform is built natively into the Linux kernel, requiring no additional software installation on any nodes. The company’s goal is to deliver a High-Performance Plug and Play model – using standard NFS protocols (v3, v4, pNFS) – that make high performance & scalable file access familiar to any Linux system administrator. Let’s break down why the native Kernel approach is important to a CycleCloud Workplace on SLURM user: POSIX compliant high performance file access with no changes in code required. No agents needed on the hosts, no additional CycleCloud templates needed. From a CycleCloud perspective, Hammerspace is simply an “external NFS” No re-staging of jobs required. Its NFS – all the compute nodes can access the same data (regardless of where it resides). The days of copying / moving data between compute nodes are over. Seamless Mounting. Native NFS mounts can be added easily in CycleCloud and files are instantly available for SLURM jobs with no unnecessary job prep time. We will take a deeper dive into this topic in the next section. How to export NFS Native NFS mounts can be added easily to CycleCloud such as the example below... NFS mounts can be entered on the Marketplace template or alternatively via the scheduler. For Hammerspace – click on External NFS. Put in the IP of the Hammerspace Anvil Metadata server, add in your mount options, and that’s it. The example below uses NFS mounts for /sched and /data Once the nodes are provisioned, log into any of the nodes and they will be mounted. On the Hammerspace user interface, we see the /sched share deployed with any relevant IOPS, growth, and files That’s it. That’s all it takes to mount a powerful parallel file system to CycleCloud. Now let's look at the benefits of a Hammerspace/CycleCloud implementation Simplified data management: CycleCloud orchestrates HPC infrastructure on demand – Hammerspace ensures that the data is immediately available whenever the compute comes up. Hammerspace will also place data in the right location or tier based on its policy driven management. This reduces the need for manual scripting to put data on lower cost tiers of storage. No application refactoring: Applications do not need to add additional agents, nor do they have to change to benefit from using a Global Access system like Hammerspace. CycleCloud Scheduled Events The last piece of the story is the shutdown/termination process. The HPC jobs are complete, now it is time to shut down the nodes and save costs. What happens to the NFS mounts that are on each node? Prior to CycleCloud 8.2.2 – if nodes were not unmounted properly, NFS mounts could hang indefinitely waiting for IO. Users can now take advantage of “Scheduled Events” in CycleCloud – a feature that lets you put a script on your HPC nodes to automatically be executed when a supported event occurs. In our case, our supported event is a node termination. The following is taken straight from the CycleCloud Main page here. CycleCloud supports enabling Terminate Notification on scaleset VMs (e.g., execute nodes). To do this, set EnableTerminateNotification to true on the nodearray. This will enable it for scalesets created for this nodearray. To override the timeout allowed, you can set TerminateNotificationTimeout to a new time. For example, in a cluster template: The script to unmount a NFS share during a terminate event is not trivial: Add it to your project project.spec Attach it to the shutdown task: Simple! Now a user can run a job and terminate the nodes after job completion without worrying about what it does to the backend storage. No more cleanup! This is cost savings, operational efficiency, and resource cleanliness (no more stale azure resources like IP’s, NICs, and disks cluttering up a subscription). Conclusion Azure CycleCloud along with Slurm and the Hammerspace Data Platform provides a powerful, scalable and cost-efficient solution for HPC in the cloud. CycleCloud automates the provisioning (and the elastic scaling up and down) of the Infrastructure, SLURM manages the task of job scheduling, and Hammerspace delivers a global data environment with high performance parallel NFS. Ultimately, the most important element of the solution is the simplicity. Hammerspace enables HPC organizations to focus on solving core problems vs the headache of managing infrastructure, setup, and unpredictable storage mounts. By reducing the administrative overhead needed to run HPC environments, the solution described in this blog will help organizations accelerate time to results, lower costs, and drive innovation across all industries.Benchmark Different Capacities for EDA Workloads on Microsoft HPC Storages
Overview Semiconductor (or Electronic Design Automation [EDA]) companies prioritize reducing time to market (TTM), which depends on how quickly tasks such as chip design validation and pre-foundry work can be completed. Faster TTM also helps save on EDA licensing costs, as less time spent on work means more time available for the licenses. To achieve shorter TTM, storage solutions are crucial. As illustrated in the article “Benefits of using Azure NetApp Files for Electronic Design Automation (EDA)” (1*), with Large Volume feature, which requires a minimum size of 50TB, Azure NetApp Files can be boosted to reach up to 652,260 I/O rate at 2ms latency, and 826,379 at performance edge (~7 ms) for one Large Volume. Objective In real-world production, EDA files—such as tools, libraries, temporary files, and output—are usually stored in different volumes with varying capacities. Not every EDA job needs extremely high I/O rates or throughput. Additionally, cost is a key consideration, since larger volumes are more expensive. The objective of this article is to share benchmark results for different storage volume sizes: 50TB, 100TB, and 500TB, all using the Large Volume features. We also included a 32TB case—where Large Volume features aren't available on ANF—for comparison with Azure Managed Lustre File System (AMLFS), another Microsoft HPC storage solution. These benchmark results can help customers evaluate their real-world needs, considering factors like capacity size, I/O rate, throughput, and cost. Testing Method EDA workloads are classified into two primary types—Frontend and Backend, each with distinct requirements for the underlying storage and compute infrastructure. Frontend workloads focus on logic design and functional aspects of chip design and consist of thousands of short-duration parallel jobs with an I/O pattern characterized by frequent random reads and writes across millions of small files. Backend workloads focus on translating logic design to physical design for manufacturing and consists of hundreds of jobs involving sequential read/write of fewer larger files. The choice of a storage solution to meet this unique mix of frontend and backend workload patterns is non-trivial. Frontend and backend EDA workloads are very demanding on storage solutions – standard industry benchmarks indicate a high I/O profile of the workloads described above that include a substantial amount of NFS access, lookup, create, getattrs, link and unlink operations, as well as small and large file read and write operations. This blog contains the output from the performance testing of an industry standard benchmark for EDA. For this particular workload, the benchmark represents the I/O blend typical of a company running both front- and backend EDA workloads in parallel. Testing Environment We used 10 E64dsv5 as client VMs connecting to one single ANF or AMFLS volume with nconnect mount option (for ANF) to ensure generate enough workloads for benchmark. The client VM’s tuning and configuration are the same that specified on (1*). ANF mount option: nocto,actimeo=600,hard,rsize=262144,wsize=262144,vers=3,tcp,noatime,nconnect=8 AMLFS mount: sudo mount -t lustre -o noatime,flock All resources reside in the same VNET and same Proximity Placement Group when possible to ensure low network latency. Figure 1. High level architecture of the testing environment Benchmark Results As EDA jobs are highly latency sensitive. For today’s more complex chip designs, 2 milliseconds of latency per EDA operation is generally seen as the ideal target, while edge performance limit is around 7 milliseconds. We listed the I/O rates achieved at both latency points for easier reference. Throughput (in MB/s) is also included, as it is essential for many back-end tasks and the output phase. (Figure 2., Figure 3,. Figure 4, and Table 1.) For cases where the Large Volume feature is enabled, we observe the following: 100TB with Ultra tier and 500TB with Standard, Premium or Ultra tier can reach to over 640,000 I/O rate at 2ms latency. This is consistent to the 652,260 as stated in (*1). For Ultra 500TB volume can even reach 705,500 I/O rate at 2ms latency. For workloads not requiring much I/O rate, either 50TB with Ultra tier or 100TB with Premium tier can reach 500,000 I/O rate. For an even smaller job, 50TB with Premium tier can reach 255,000 and more inexpensive. For scenarios throughput is critical, 500TB with Standard, Premium or Ultra tier can all reach 10~12TB/s throughput. Figure 2. Latency vs. I/O rate: Azure NetApp Files- one Large Volume Figure 3. Achieved I/O rate at 2ms latency & performance edge (~7ms): Azure NetApp Files- one Large Volume Figure 4. Achieved throughput (MB/s) at 2ms latency & performance edge (~7ms): Azure NetApp Files- one Large Volume Table 1. Achieved I/O rate and Throughput at both latency: Azure NetApp Files- one Large Volume For cases with less than 50TB of capacity, where the Large Volume feature not available for ANF, we included Azure Managed Lustre File System (AMLFS) for comparison. With the same 32TB volume size, a regular ANF volume achieves about 90,000 I/O at 2ms latency, while an AMLFS Ultra volume (500 MB/s/TiB) can reach roughly double that, around 195,000. This shows that AMLFS is a better choice for performance when the Large Volume feature isn't available on ANF. (Figure 5.) Figure 5. Achieved I/O rate at 2ms latency: ANF regular volume vs. AMLFS Summary This article shared benchmark results for different storage capacities needed for EDA workloads, including 50TB, 100TB, and 500TB volumes with the Large Volume feature enabled. It also compared a 32TB volume—where the Large Volume feature isn’t available on ANF—to Azure Managed Lustre File System (AMLFS), another Microsoft HPC storage option. These results can help customers choose or design storage that best fits their needs by balancing capacity, I/O rate, throughput, and cost. With the Large Volume feature, 100TB Ultra and 500TB Standard, Premium, or Ultra tiers can achieve over 640,000 I/O at 2ms latency. For jobs that need less I/O, 50TB Ultra or 100TB Premium can reach 500,000, while 50TB Premium offers 255,000 at a lower cost. When throughput matters most, 500TB volumes across all tiers can deliver 10–12TB/s. If you have a smaller job or can’t use the Large Volume feature, Azure Managed Lustre File System (AMLFS) gives you better performance than a regular ANF volume. A final reminder, this article primarily provided benchmark results to help semiconductor customers in designing their storage solutions, considering capacity size, I/O rate, throughput, and cost. It did not address other important criteria such as heterogeneous integration or legacy compliance, which are also important when selecting an appropriate storage solution. References Benefits of using Azure NetApp Files for Electronic Design Automation (EDA) Learn more about Azure Managed LustreAnnouncing the AI Infrastructure on Azure repository
Today we’re excited to release the AI Infrastructure on Azure repository—a one-stop reference for teams building large-scale AI clusters on Azure. Authors Davide Vanzo - Senior Technical Program Manager - Azure Specialized Jer-Ming Chia - Principal Technical Program Manager - Azure Specialized Jesse Lopez - Senior Technical Program Manager - Azure Specialized Jingchao Zhang - Senior Technical Program Manager - Azure Specialized Paul Edwards - Principal Technical Program Manager - Azure Specialized Wolfgang De Salvador - Senior Product Manager - Azure Storage Introduction When building a supercomputer on Azure for AI workloads, teams must stitch together orchestration, storage, and compute components. They often spend weeks fine-tuning those configurations for peak performance. This repo delivers well-tested (Infrastructure as Code) blueprints for fully integrated clusters that prioritize reliability and performance, and that can be used to reproduce our published benchmarks. Design Considerations Building an AI supercomputer on Azure spans many moving parts: VM family selection (e.g. ND GB200 v6 vs ND H200 v5), deployment model (fully containerized AKS clusters to traditional HPC), and storage strategy—you can even run training without POSIX file systems by tuning your data lifecycle, as detailed in several blog posts and sessions. Other impactful design drivers include: Storage & I/O Capacity needs (dataset, checkpoints, logs) Throughput & IOPS (sequential vs random access patterns) Filesystem interface (POSIX-compliant vs API-native cloud storage) Tiering strategy Software & Orchestration AI framework & version (e.g MegatronLM, LLM-foundry, DeepSpeed) Container runtime (e.g enroot+pyxis, Singularity, Docker) Scheduler/Orchestration integration (e.g Slurm, Kueue, Volcano) OS image & driver stack (e.g Ubuntu/HPC image, NVIDIA drivers, IB drivers) Node Health Checks (checks for Infiniband fabric performance/health, GPU errors etc) Workflow & Automation Checkpoint frequency & size (impacts storage performance) Data staging/ingest (pre-processing on CPU nodes vs GPU nodes) Monitoring & logging (telemetry pipelines, DCGM, Prometheus) Systems optimizations CPU configs (NUMA topology files & affinity overrides) NCCL tuning (topology mapping, P2P chunk size, channel count) IB fabric tuning (queue-per-connection, zero-copy transfers) Storage tuning (mount options, I/O scheduler, parallel-FS striping) Given the breadth of these design considerations, landing on an optimal configuration can be challenging. This repo’s purpose is to centralize our battle-tested configurations and optimization guidance—so you can push the health, reliability, and performance of your Azure AI supercomputers to the limit. We’ve also published end-to-end benchmarks here, giving you clear baselines to compare your own deployments against. It also includes recommended node- and cluster-level health checks, baseline performance benchmarks, and a sample LLM training run using this configuration. Configuration guidance In this initial release, the repo provides a ready-to-run template for a "canonical" SLURM-managed HPC cluster, leveraging Azure NetApp Files for networked storage, and Azure Managed Lustre Service for parallel filesystem performance. This section of the repository is aimed to contain well-tested infrastructure as code configurations for AI supercomputers on Azure that have been widely tested and adopted. Storage Guidance The repository also provides guidance for choosing storage backends. For instance, evaluating Azure Managed Lustre tiers to match the size and performance required for the specific training jobs. One of the key elements to optimize in distributed training is the checkpoint time. This is critical for GPU utilization and it is strongly connected to the filesystem throughput. An example of this scenario for a GPT-3–style model (175 B parameters) has been presented in the repository for the case of Azure Managed Lustre. In a similar way, we present guidance on how to use BlobFuse2 with Azure Blob Storage for training jobs. Azure Blob Storage has demonstrated the ability to reach 25 Tbps of egress bandwidth on a single account in a recent Microsoft Build session. Moreover, the repository is meant to host guidance on specific filesystem tunings to maximize the delivered performance. Node and Cluster-level Healthchecks Validating cluster readiness before large-scale training runs helps catch system issues early, so you don’t waste compute cycles and can hit the performance baselines. We recommend running a series of healthchecks at both the node- and cluster-level to catch hardware or software issues early. It is recommended that AzureHPC Node Health Checks (AzNHC) is used to validate node-level functionality. Built on the LBNL NHC framework, AzNHC adds Azure-specific hardware tests for HPC and GPU VM SKUs. It includes SKU-specific tests, such as GPU availability, NVLink health, ECC memory error checks, device-to-host and host-to-device bandwidth tests, InfiniBand throughput (GDR and non-GDR), topology validation, and intra-node NCCL all-reduce benchmarks. It runs inside a Docker container that can be invoked easily. In parallel, at the cluster level, testing inter-node GPU communication with NCCL all-reduce benchmarks is an effective way to measure collective bandwidth across your fleet. The Azure HPC Image includes the prebuilt nccl-tests suite in /opt/nccl-tests/build/, which can be used to run across all nodes via MPI. The recommended NCCL settings -- CollNet/NVLS, GDR, and relaxed PCI ordering -- provide optimal collective performance and serve as the baseline. The repo includes best practices for running these validation tests. Benchmarks A recent published set of benchmarks demonstrates near-linear scaling from 8 up to 1,024 NDv5 H100 GPUs (Standard_ND96isr_H100_v5), delivering training performance on par with NVIDIA’s reference DGX systems and underscoring Azure’s infrastructure scalability and efficiency for large-scale AI workloads. These benchmarks ran on the repo’s reference architecture. The recipes, deployment instructions, and full benchmark results are all available inside the example section of the repository. Workload Examples Equally important, the repo contains some real world examples of E2E AI training -- including best practices for training jobs data preparation and execution. Currently, in the examples section, we introduced the Megatron-LM GPT175B and the LLM Foundry MPT-30B and MPT-70B case. The current examples are focused on the Azure CycleCloud Workspace for Slurm architecture, but there is the plan to extend them to additional orchestration solutions in the future. These guides allow interested users to configure sample distributed training jobs, relying on important configuration guidance for their environment and infrastructure. What’s next The repository presented in this blog post will be expanded with additional scenarios, best practices and configuration recipes. We will share periodically updates on new contents and evolution of what available in the catalog. We welcome contributions and we encourage to actively open requests for new content that you may find of interest. Thank you to all our readers!Computer-Aided Engineering “CAE” on Azure
Table of Contents: What is Computer-Aided Engineering (CAE)? Why Moving CAE to Cloud? Cloud vs. On-Premises What Makes Azure Special for CAE Workloads? What Makes Azure Stand out Among Public Cloud Providers? “InfiniBand Interconnect” Key CAE Workloads on Azure Azure HPC VM Series for CAE Workloads CAE Software Partnership “ISV’s” Robust Ecosystem of System Integrator “SI” Partners Real-World Use Case: Automotive Sector Final Thoughts -------------------------------------------------------------------------------------------------------- 1. What is Computer-Aided Engineering “CAE”? Computer-Aided Engineering (CAE) is a broad term that refers to the use of computer software to aid in engineering tasks. This includes simulation, validation, and optimization of products, processes, and manufacturing tools. CAE is integral to modern engineering, allowing engineers to explore ideas, validate concepts, and optimize designs before building physical prototypes. CAE encompasses various fields such as finite element analysis (FEA), computational fluid dynamics (CFD), and multibody dynamics (MBD) CAE tools are widely used in industries like automotive, aerospace, and manufacturing to improve product design and performance. For example, in the automotive industry, CAE tools help reduce product development costs and time while enhancing the safety, comfort, and durability of vehicles CAE tools are often used to analyze and optimize designs created within CAD (Computer-Aided Design) software CAE systems typically involve three phases: Pre-processing: Defining the model and environmental factors to be applied to it. Analysis solver: Performing the analysis, usually on high-powered computers. Post-processing: Visualizing the results In a world where product innovation moves faster than ever, Computer-Aided Engineering (CAE) has become a cornerstone of modern design and manufacturing. From simulating airflow over an F1 car to predicting stress in an aircraft fuselage, CAE allows engineers to explore ideas, validate concepts, and optimize designs—before a single prototype is built. -------------------------------------------------------------------------------------------------------- 2. Why Move CAE to Cloud? Cloud vs. On-Premises Historically, CAE workloads were run on-premises due to their compute-intensive nature and large data requirements. Traditional CAE methods—dependent on expensive, on-premises HPC clusters—are facing a tipping point. Many organizations are now embracing cloud-based CAE. When considering whether to use cloud or on-premises solutions, there are several factors to consider: Cost and Maintenance: On-premises solutions require a large upfront investment in hardware and ongoing costs for maintenance and upgrades. Cloud solutions, on the other hand, spread costs over time and often result in lower total cost of ownership. Security and Privacy: On-premises solutions offer control over security but require significant resources to manage. Cloud providers offer advanced security features and compliance certifications, often surpassing what individual companies can achieve on their own Scalability and Flexibility: Cloud solutions provide unmatched scalability and flexibility, allowing businesses to quickly adjust resources based on demand. On-premises solutions can be more rigid and require additional investments to scale Reliability and Availability: Cloud providers offer high availability and disaster recovery options, often with service level agreements (SLAs) guaranteeing uptime. On-premises solutions depend on the company's infrastructure and may require additional investments for redundancy and disaster recovery Integration and Innovation: Cloud solutions often integrate seamlessly with other cloud services and offer continuous innovation through regular updates, new features, and run more simulations in parallel, reducing time-to-solution, accelerating product development cycle, and faster time to market. On-premises solutions may lag in terms of innovation and require manual integration efforts. Global Access: Teams can collaborate and access data/models from anywhere. Cloud gives you global, on-demand supercomputing access without the physical, financial, and operational burden of traditional on-premise clusters. In summary, the choice between cloud and on-premises solutions depends on various factors including cost, performance, security, maintenance, flexibility, and specific business needs. Cloud provides customers with global scalability, high availability, and a broad range of capabilities within a secure, integrated platform. It enables organizations to concentrate on core product innovation, accelerating their journey to market. The following table shows Azure vs. on-premises for CAE Workloads: Aspect Cloud (Azure) On-Premises Global Reach 60+ regions worldwide — deploy compute close to users, customers, or engineers. Limited to where physical hardware is located (one or few sites). Access Flexibility Access from anywhere with secure authentication (VPN/SSO/Conditional Access). Access generally restricted to internal corporate network or VPN. Collaboration Teams across continents can work on shared HPC clusters easily. Remote collaboration can be slow and complex; security risks higher. Elastic Scaling Instantly scale resources up/down globally based on demand. Start small, grow big — then shrink when needed. Scaling requires buying, installing, maintaining new hardware. Time to Deploy No wait for procurement. Minutes to spin up a new HPC cluster in a new region. Weeks/months to procure, rack, configure hardware in new location. Disaster Recovery Built-in regional redundancy, backup options, replication across regions. Disaster recovery requires manual setup, physical duplication. Compliance & Data Residency Choose specific Azure regions to meet compliance (GDPR, HIPAA, ITAR, etc.). Need to build compliant infrastructure manually. Network Latency Optimize by deploying close to users; fast backbone network across regions. Bound by physical proximity; long-distance remote work suffers latency. Maintenance Azure handles hardware upgrades, security patches, downtime minimization. In-house IT teams responsible for all hardware, software, and patching. Security at Scale MSFT commits to invest $20B on cybersecurity over five years. Azure invests >$1B annually in cybersecurity; ISO, SOC, GDPR certified globally. Requires dedicated resources to manage security protocols and maintain visibility across all systems. This can be more complex and resource-intensive compared to cloud solutions Cost Optimization Operates on a pay-as-you-go model, enabling businesses to scale usage and costs as needed. This avoids the capital expenditure of purchasing hardware. Azure also offers various pricing options and discounts, such as reserved capacity, spot pricing, and Azure Hybrid Benefit, which can significantly reduce costs — massive cost control flexibility. Requires significant upfront capital investment in hardware, software licenses, and infrastructure setup. These costs include purchasing and maintaining physical servers, which are subject to technological obsolescence. Ongoing expenses include system maintenance, support, power consumption, and cooling Innovation Access latest GPUs, CPUs (like H100, H200, GB200, AMD-MI300X, HBv3, HBv4, HBv5) Needs investments in hardware refresh cycles. Managed Storage Offers agility with instant provisioning. Scalability as virtually unlimited with automatic scale up or down. Fully managed including updates, patches, backup, etc. High Availability & DR through redundancy, geo-replication, and automated DR options. Security through enterprise-grade security with encryption at rest and in transit & compliance certifications. Pay-as-you-go or reserved pricing with no upfront HW cost (CapEx). Global access through internet. Innovation through continuous improvements with Ai-driven optimization. Offers control but demands heavy investment in HW, time-consuming deployment. Scaling is limited by physical HW capacity. Must be managed by in-house IT teams so required significant time expertise and resources. Redundancy & DR must be designed, funded and maintained manually. Security depends on in-house capabilities and requires investment. High upfront capital expenditure (CapEx). Access limited to local networks unless extended with complex remote-access solutions. Innovation depends on HW refresh cycles limited by expense and infrequency. Software Images & Marketplace Instant access to thousands of pre-built software images via Marketplace. Speedy deployment of complete environments in minutes from ready-to-use templates. Huge ecosystem — access to Microsoft, open-source, and third-party vendor solutions — constantly updated. Automated maintenance and updates as Marketplace software often comes with built-in update capabilities, auto-patching, and cloud-optimized versions. Cost flexibility by either Pay-as-you-go (PAYG) licensing, bring-your-own-license (BYOL) options, or subscription models available. Innovation trough early access to beta, cloud-native, and AI-enhanced software from top vendors through the marketplace. Security is guarded s Marketplace images are verified by cloud provider security and compliance standards. Software must be sourced, manually installed, and configured so takes days to weeks. Manual deployment, installation, environment setup, and configuration can take days or weeks. Limited by licensing agreements, internal vendor contracts, and physical hardware compatibility. Manual updates required and IT must monitor, download, test, and apply patches individually. Large upfront license purchases often needed with renewal and true-up costs can be complex and expensive. Innovation is limited as new software adoption is delayed by procurement, budgeting, and testing cycles. Security assurance depends on internal vetting processes and manual hardening. -------------------------------------------------------------------------------------------------------- 3. What Makes Azure Special for CAE Workloads? Microsoft Azure: a cloud platform enabling scalable, secure, and high-performance CAE workflows across industries. Our goal in Azure is to provide the CAE field with a one-stop, best-in-class technology platform, rich with solution offerings and supported by a robust ecosystem of partners. Azure offers several unique features and benefits that make it particularly well-suited for Computer-Aided Engineering (CAE) workloads: GPU Acceleration: Azure provides powerful GPU options, such as NVIDIA GPUs, which significantly enhance the performance of leading CAE tools. This results in improved turnaround times, reduced power consumption, and lower hardware costs. For example, tools like Ansys Speos for lighting simulation and CPFD's Barracuda Virtual Reactor have been optimized to take advantage of these GPUs. High-Performance Computing (HPC): Azure offers specialized HPC solutions, such as the HBv3, HBv4/HX series, which are designed for high-performance workloads. These solutions provide the computational power needed for complex simulations and analyses. Scalability and Flexibility: Azure's cloud infrastructure allows for easy scaling of resources to meet the demands of CAE workloads. This flexibility ensures that you can handle varying levels of computational intensity without the need for significant upfront investment in hardware. Integration with Industry Tools: Azure supports a wide range of CAE software and tools, making it easier to integrate existing workflows into the cloud environment. This includes certification and optimization of CAE tools on Azure. Support for Hybrid Environments: Azure provides solutions for hybrid cloud environments, allowing you to seamlessly integrate on-premises resources with cloud resources. This is particularly useful for organizations transitioning to the cloud or requiring a hybrid setup for specific workloads. Global Reach: As of April 2025, Microsoft Azure operates over 60 announced regions and more than 300 data centers worldwide, making it the most expansive cloud infrastructure among major providers. Azure ensures low latency and high availability for CAE workloads, regardless of where your team is located. These features collectively make Azure a powerful and flexible platform for running CAE workloads, providing the computational power, scalability, and security needed to handle complex engineering simulations and analyses. -------------------------------------------------------------------------------------------------------- 4. What Makes Azure Stand out Among Public Cloud Providers? “InfiniBand Interconnect” InfiniBand interconnect is one of the key differentiators that makes Microsoft Azure stand out among public cloud providers, especially for high-performance computing (HPC) and CAE workloads. Here’s what makes InfiniBand a game changer, unique, and impactful on Azure: a) Ultra-Low Latency & High Memory Bandwidth InfiniBand on Azure delivers 200 Gbps (and up to 400 Gbps with HDR/NDR in some cases and 800 Gbps for the latest SKU,"HBv5", currently in Preview) interconnect speeds. This ultra-low-latency, high-throughput network is ideal for tightly coupled parallel workloads, such as CFD, FEA, weather simulations, and molecular modeling. When the newly added AMD SKU, HBv5, transitions from preview to general availability (GA), memory bandwidth will no longer be a limitation for workloads such as CFD and Weather simulations. The HBv5 offers an impressive 7 TB/s of memory bandwidth, which is 8 times greater than the latest bare-metal and cloud alternatives. It also provides nearly 20 times more bandwidth than Azure HBv3 and Azure HBv2, which utilize the 3rd Gen EPYC™ with 3D V-cache “Milan-X” and the 2nd Gen EPYC™ “Rome” respectively. Additionally, the HBv5 delivers up to 35 times more memory bandwidth compared to a 4–5-year-old HPC server nearing the end of its hardware lifecycle. b) RDMA (Remote Direct Memory Access) Support RDMA enables direct memory access between VMs, bypassing the CPU, which drastically reduces latency and increases application efficiency — a must for HPC workloads. c) True HPC Fabric in the Cloud Azure is the only major public cloud provider that offers InfiniBand across multiple VM families like: HBv3/4 (for CFD, FEA, Multiphysics, Molecular Dynamics) HX-series (Structural Analysis) ND (GPU + MPI) It allows scaling MPI workloads across thousands of cores — something typically limited to on-premises supercomputers. d) Production-Grade Performance for CAE Solvers like ANSYS Fluent, STAR-CCM+, Abaqus, and MSC Nastran have benchmarked extremely well on Azure, thanks in large part to the InfiniBand-enabled infrastructure. If you’re building CAE, HPC, or AI workloads that rely on ultra-fast communication between nodes, Azure’s InfiniBand-powered VM SKUs offer the best cloud-native alternative to on-prem HPC clusters. -------------------------------------------------------------------------------------------------------- 5. Key CAE Workloads on Azure: CAE isn’t a one-size-fits-all domain. Azure supports a broad spectrum of CAE applications, such as: Computational Fluid Dynamics (CFD): ANSYS Fluent, Ansys CFX, Siemens Simcenter STAR-CCM+, Convergent Science CONVERGE CFD, Autodesk CFD, OpenFOAM, NUMECA Fine/Open, Altair ACuSolve, Simerics MP+, Cadence Fidelity CFD, COMSOL Multiphysics (CFD Module), Dassault Systeme XFlow, etc. Finite Element Analysis (FEA): ANSYS Mechanical, Dassault Systemes Abaqus, Altair OptiStruct, Siemens Simcenter 3D, MSC Nastran, Autodesk Fusion 360 Simulation, COMSOL Multiphysics (Structural Module), etc. Thermal & Electromagnetic Simulation: COMSOL Multiphysics, Ansys-HFSS, CST Studio Suite, Ansys Mechanical (Thermal Module), Siemens Simcenter 3D Thermal, Dassault Systemes Abaqus Thermal, etc. Crash & Impact Testing: Ansys LS-DYNA, Altair Radioss, ESI PAM-Crash, Siemens Simcenter Madymo, Dassault Systemes Abaqus “Explicit”, Ansys Autodyn, etc. These applications require a combination of powerful CPUs, big memory footprint, high memory bandwidth, and low-latency interconnects. Some applications also offer GPU-accelerated versions. All of which are available in Azure’s purpose-built HPC VM families. -------------------------------------------------------------------------------------------------------- 6. Azure HPC VM Series for CAE Workloads Azure offers specialized VM series tailored for CAE applications. These VMs support RDMA-enabled InfiniBand networking, critical for scaling CAE workloads across nodes in parallel simulations. CPU: HBv3, HBv4 Series: Ideal for memory-intensive workloads like CFD and FEA, offering high memory bandwidth and low-latency interconnects. HX Series: Optimized for structural analysis applications, providing significant performance boosts for solvers like MSC Nastran & others. GPU: ND Series: GPU-accelerated VMs optimized for CAE workloads, offering high double-precision compute, large memory bandwidth, and scalable performance with NVIDIA H100, H200, GB200 & AMD M300X GPUs. The highest-performing compute-optimized CPU offering in Azure today is the HBv4/HX series, featuring 176 cores of 4th Gen AMD EPYC processors with 3D V-Cache technology (“Genoa-X”). Below is a sample performance comparison of four different AMD SKU generations against the Intel “HCv1-Skylake” SKU, using the Ansys Fluent (F1 Racecar 140M cells) model. Full performance & scalability of HBv4 and HX-Series VMs with Genoa-X CPUs is HERE. -------------------------------------------------------------------------------------------------------- 7. CAE Software Partnership “ISV’s” Independent Software Vendors (ISVs) play a critical role on Azure by bringing trusted, industry-leading applications to the platform. Their solutions — spanning CAE, CFD, FEA, data analytics, AI, and more — are optimized to run efficiently on Azure’s scalable infrastructure. ISVs ensure that customers can seamlessly move their workloads to the cloud without sacrificing performance, compatibility, or technical support. They also drive innovation by collaborating with Azure engineering teams to deliver cloud-native, HPC-ready, and AI-enhanced capabilities, helping businesses accelerate product development, simulations, and decision-making. Below is a partial list of these ISVs & their offerings on Azure: ANSYS Access: SaaS platform built on Azure, offering native cloud experiences for Fluent, Mechanical, LS-Dyna, HFSS, etc. Altair One: SaaS platform on Azure supporting Altair solvers such as HyperWorks, OptiStruct, Radioss, AcuSolve, etc. Siemens Simcenter: Validated on Azure for fluid, structural, and thermal simulation with solvers such as STAR-CCM+, NX, Femap Dassault Systèmes: Solvers such as Abaqus, CATIA, SIMULIA, XFlow COMSOL: For its flagship solver “COMSOL Multiphysics” CPFD Software: CPFD Software has optimized its simulation tool “Barracuda Virtual Reactor” for Azure, enabling engineers to perform particle-fluid simulations efficiently. -------------------------------------------------------------------------------------------------------- 8. Robust Ecosystem of System Integrator “SI” Partners Azure CAE System Integrators (SIs) are specialized partners that assist organizations in deploying and managing CAE workloads on Microsoft Azure. These SIs provide expertise in cloud migration, HPC optimization, and integration of CAE applications, enabling businesses to leverage Azure’s scalable infrastructure for engineering simulations and analyses. a) What Do Azure CAE System Integrators Offer? Azure CAE SIs deliver a range of services tailored to the unique demands of engineering and simulation workloads: Cloud Migration: Transitioning on-premises CAE applications and data to Azure’s cloud environment. HPC Optimization: Configuring Azure’s HPC resources to maximize performance for CAE tasks. Application Integration: Ensuring compatibility and optimal performance of CAE software (e.g., ANSYS, Siemens, Altair, Abaqus) on Azure. Managed Services: Ongoing support, monitoring, and maintenance of CAE environments on Azure. b) Leading Azure CAE System Integrators Several SIs have been recognized for their capabilities in deploying CAE solutions on Azure. Partial list is below: Rescale, TotalCAE, Oakwood Systems, UberCloud “SIMR”, Capgemini, Accenture, Hexagon Manufacturing Intelligence. c) Benefits of Collaborating with Azure CAE SIs By partnering with Azure CAE System Integrators, organizations can effectively harness the power of cloud computing to enhance their engineering and simulation capabilities. Engaging with Azure CAE System Integrators can provide: Expertise: Access to professionals experienced in both CAE applications and Azure infrastructure. Efficiency: Accelerated deployment and optimization of CAE workloads. Scalability: Ability to scale resources up or down based on project requirements. Cost Management: Optimized resource usage leading to potential cost savings. -------------------------------------------------------------------------------------------------------- 9. Real-World Use Case: Automotive Sector Rimac used Azure cloud computing to help with the design, testing, and manufacturing of its next-generation components and sportscars, and it’s gaining even greater scale and speed in its product development processes with a boost from Microsoft Azure HPC Rimac’s Azure HPC environment uses Azure CycleCloud to organize and orchestrate clusters—putting together different cluster types and sizes flexibly and as necessary. The solution includes Azure Virtual Machines, running containers on Azure HBv3 virtual machines with 3 rd Gen AMD EPYC™ Milan Processors with AMD 3D V-Cache, which are much faster than previous generation Azure virtual machines for explicit calculations. Rimac’s solution takes full advantage of the power of AMD, which offers the highest performing x86 CPU for technical computing. “We’ve gained a significant increase in computational speed with AMD, which leads to lower utilization of HPC licenses and faster iterations,” says Ivan Krajinović, Head of Simulations, Rimac Technology “However complex the model we need to create, we know that we can manage it with Azure HPC. We now produce more highly complex models that simply wouldn’t have been possible on our old infrastructure.” Ivan Krajinović -------------------------------------------------------------------------------------------------------- 10. The Future of CAE is Cloud-Native The next frontier in CAE is not just lifting and shifting legacy solvers into the cloud—but enabling cloud-native simulation pipelines. List includes: AI-assisted simulation tuning Serverless pre/post-processing workflows Digital twins integrated with IoT data on Azure Cloud-based visualization with NVIDIA Omniverse With advances in GPU acceleration, parallel file systems (like Azure Managed Lustre File System, AMLFS), and intelligent job schedulers, Azure is enabling this next-gen CAE transformation today. -------------------------------------------------------------------------------------------------------- 11. Final Thoughts Moving CAE to Azure is more than a tech upgrade—it’s a shift in mindset. It empowers engineering teams to simulate more, iterate faster, and design better—without being held back by hardware constraints. If you’re still running CAE workloads on aging, capacity-constrained systems, now is the time to explore what Azure HPC can offer. Let the cloud be your wind tunnel, your test track, your proving ground. -------------------------------------------------------------------------------------------------------- Let’s Connect Have questions or want to share how you’re using CAE in the cloud? Let’s start a conversation! We'd love to hear your thoughts! Leave a comment below and join the conversation. 👇 #CAE #HPC #AzureHPC #EngineeringSimulation #CFD #FEA #CloudComputing #DigitalEngineering #MicrosoftAzureUsing Azure CycleCloud with Weka
What is Azure CycleCloud? Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing HPC environments on Azure. With Azure CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically scale the infrastructure to run jobs efficiently at any scale. CycleCloud is used for running workloads like scientific simulations, rendering tasks, Genomics and Bionomics, Financial Modeling, Artificial Intelligence, Machine Learning and other data-intensive operations that require large amounts of compute power. CycleCloud supports GPU computing which is useful for the workloads described above. One of the strengths of Azure CycleCloud is its ability to automatically scale resources up or down based on demand. If your workload requires more GPU power (such as for deep learning training), CycleCloud can provision additional GPU-enabled instances as needed. The question remains – If the GPU’s provisioned by CycleCloud are waiting for storage I/O operations, not only is the performance of the application severely impacted, the GPU is also underutilized meaning you are not fully exploiting the resources you are paying for! This brings us to Weka.io. But before we talk about the problems Weka & CycleCloud solve, let's talk about what Weka is. What is WEKA? The WEKA® Data Platform was purpose-built to seamlessly and sustainably deliver speed, simplicity, and scale that meets the needs of modern enterprises and research organizations without compromise. Its advanced, software-defined architecture supports next-generation workloads in virtually any location with cloud simplicity and on-premises performance. At the heart of the WEKA® Data Platform is a modern fully distributed parallel filesystem, WekaFS™ which can span across 1,000’s of NVMe SSD spread across multiple hosts and seamlessly extend itself over S3 compatible object storage. You can deploy WEKA software on a cluster of Microsoft Azure LSv3 VMs with local SSD to create a high-performance storage layer. WEKA can also take advantage of Azure Blob Storage to scale your namespace at the lowest cost. You can automate your WEKA deployment through HashiCorp Terraform templates for fast easy installation. Data stored with your WEKA environment is accessible to applications in your environment through multiple protocols, including NFS, SMB, POSIX, and S3-compliant applications. Key components to WEKA Data Platform in Azure include: The Architecture is deployed directly in the customer Tenant within a subscription ID of the customers choosing. WEKA software is deployed across 6 or more Azure LSv3 VMs. The LSv3 VMs are clustered to act as one single device. The WekaFS™ namespace is extended transparently onto Azure Hot Blob Scale Up and Scale down functions are driven by Logic App’s and Function Apps All client secrets are kept in Azure Vault Deployment is fully automated using Terraform WEKA Templates What is the integration? Using the Weka-CycleCloud template available here, any compute nodes deployed via CycleCloud will automatically install the WEKA agent as well as automatically mount to the WEKA filesystem. Users can deploy 10, 100, even 1000’s of compute nodes and they will all mount to the fastest storage in Azure (WEKA). Full integration steps are available here: WEKA/CycleCloud for SLUM Integration Benefits The combined solution of Weka combines the best of both worlds. With the CycleCloud / Weka template, customers will get: Simplified HPC management. With CycleCloud, you can provision clusters with a few clicks using preconfigured templates – and the clusters will all be mounted directly to WEKA. A High-Performance End to End Architecture. CycleCloud & WEKA allows users to combine the benefits of CPUs/GPUs with ultra fast storage. This is essential to ensure high throughput and low latency for computational workloads. The goal is to ensure that the storage subsystem can keep up with the high-speed demands of the CPU/GPU, especially in scenarios where you're running compute-heavy workloads like deep learning, scientific simulations, or large-scale data processing. Cost Optimization #1. Both CycleCloud and WEKA allow for autoscaling (up and down). Adjust the number of compute resources (CycleCloud) as well as the number of Storage backend nodes (WEKA) based on workload needs. Cost Optimization #2. WEKA.IO offers intelligent data tiering to help optimize performance and storage costs. The tiering system is designed to automatically move data between different storage classes based on access patterns, which maximizes efficiency while minimizing expenses. Conclusion The CycleCloud & WEKA integration delivers a simplified HPC (AI/ML) cloud management platform, exceptional performance for data-intensive workloads, cost optimization via elastic scaling, flash optimization, & data tiering, all in one user Interface. This enables organizations to achieve high throughput, low latency, and optimal CPU/GPU resource utilization for their most demanding applications and use cases. Try it today! Special thanks to Raj Sharma and the WEKA team for their work on this integration!Deploying ZFS Scratch Storage for NVMe on Azure Kubernetes Service (AKS)
This guide demonstrates how to use ZFS LocalPV to efficiently manage the NVMe storage available on Azure NDv5 H100 VMs. Equipped with eight 3.5TB NVMe disks, these VMs are tailored for high-performance workloads like AI/ML and large-scale data processing. By combining the flexibility of AKS with the advanced storage capabilities of ZFS, you can dynamically provision stateful node-local volumes while aggregating NVMe disks for optimal performance.Breaking the Speed Limit with WEKA File System on top of Azure Hot Blob
WEKA delivers unbeatable performance for your most demanding applications running in Microsoft Azure, supporting high I/O, low latency, small files, and mixed workloads with zero tuning and automatic storage rebalancing. Examine how WEKA’s patented filesystem, WekaFS™, and its parallel processing algorithms accelerate Blob storage performance. The WEKA® Data Platform is purpose-built to deliver speed, simplicity, and scale that meets the needs of modern enterprises and research organizations without compromise. At the heart of the WEKA® Data Platform is a modern fully distributed parallel filesystem, WekaFS™, which can span across 1,000’s of NVMe SSD spread across multiple hosts and seamlessly extend itself over compatible object storage.Migrate data to Azure Managed Lustre retaining POSIX attributes
In this blog, you learn how to copy data to your Azure Managed Lustre file system and then to long-term storage in Azure Blob Storage while retaining certain POSIX attributes that include permissions and user and group ownership. This process includes using the export jobs with archive process.