These huge computing resources are usually built for specific numeric computationally expensive problems. In the case of Summit, its main use case is Nuclear Weapons Simulation, though interestingly, this isn't one of the use cases they advertise on their website!
All modern High Performance Computing (HPC) architectures are of the basic design shown below.
The challenge is how to build a software framework to distribute and parallel process code over such an hardware architecture.
Historically, the main framework used has been Message Passing Interface, or MPI. In MPI tasks are distributed over the HPC architecture and communicate with each other via messages. The challenge with MPI is that the Software Engineer has to make decisions on how to split the tasks, the message interface and what type of network model the tasks will be run. These network models are broken down into:
- Point to Point - where two tasks send messages to each other directly
- Broadcast - where data is published to all tasks
- Scatter - where data is broken into partitions and distributed to multiple tasks
- Gather - essentially the reverse of scattering, where a single task gathers data back from multiple tasks
- Reduce - where a single task processing the data from each remote task. This is very much a coordinated version of Point to Point
This is where the Python Dask library comes to the rescue. Dask excels in that it provides a familiar Python Numpy and Pandas like interface for common numeric, scientific and data science problems, wth a 'bag' API for more general purpose map./reduce like computation, suited to unstructured data. The real power of Dask, though, comes from the fact that it builds the most optimum task graph for you, so you can concentrate your effort on solving the domain problem, not how to maximise the resources of the HPC.
Lets see some code. Here's how you get a Dask Cluster started
from dask.distributed import Client, progress client = Client(processes=False, threads_per_worker=4, n_workers=1, memory_limit='2GB') clientOnce spun up and available, Dask should return a message telling you the number of Cores and Memory the cluster has allocated. If you're using a shared HPC Cluster with other workloads running, you may not get all the Cores and Memory you request.
Lets create a large numeric array of data, using the familiar Numpy like syntax.
from dask.distributed import Client, progress import dask.array as da x = da.random.random((10000, 10000), chunks=(1000, 1000)) x dask.arrayLets now carry out a simple computation on this array.
y = x + x.T z = y[::2, 5000:].mean(axis=1) z dask.array mean_agg-aggregate, shape=(5000,), dtype=float64, chunksize=(500,)Notice here that z has not returned an answer, but a pointer to a new Dask array. By default, all calls to the Dask API are lazy, and it's only when you issue a .compute() call that the whole preceding task graph gets submitted to the Cluster for computation and evaluation - like this.
z.compute() array([0.99524228, 1.01138959, 1.00196082, ..., 0.99702404, 1.00168843, 0.99625349])The power of Dask is that you can get started on a Laptop with Python, then transfer your algorithm to a HPC Cluster and scale up the computation with no change to the code.
Ah, you'll say, I don't have access to a HPC like Summit. True, most people don't have access to a HPC Cluster. I'm personally lucky that I work for a company that has it's own private HPC environment. Unless you work in Academia, or for a large industrial company, typically in the Automotive, Aerospace / Defence and Pharmaceutical industries, you unlikely to be able to access that level of compute.
This is where Cloud Computing comes to the rescue. In particular services such as Amazon EC2 Spot Instances.EC2 Spot Instances allow to you request compute resources at substantial On-Demand discounted rates. This is because Amazon have the right to interrupt and pause your compute with only 2 minutes notice. For example, at the time of writing this article, you can have a m4.16xlarge (64 vCPU 256GB RAM) Spot Instance at ~$1 per hour - which is incredible. However, this particular configuration comes with a potential interruption rate of greater than 20%. However, if you optimise your compute workload in Dask to suit, for example you work all 64 vCPUs, you may never see any interruptions ~80% of the time.
So there you have it. Super Computing is now available for everyone. All you have to do is work out what major massive computation problem to solve. I'd recommend looking at this video for inspiration.
Further info can be found:
- Dask - https://dask.org/, https://blog.dask.org/
- Amazon EC2 Spot Instances - https://aws.amazon.com/ec2/spot/