Saturday, 9 February 2019

HPC Super Computing for Mortals

The US has recently wrestled back the HPC Super Computer Crown from China with Summit, a 200 PetaFLOP Goliath with over 9000 22-core PowerPC CPUs and a mind-boggling 27,000 NVIDA Tesla V100 GPUs.

These huge computing resources are usually built for specific numeric computationally expensive problems. In the case of Summit, its main use case is Nuclear Weapons Simulation, though interestingly, this isn't one of the use cases they advertise on their website!

All modern High Performance Computing (HPC) architectures are of the basic design shown below.


The challenge is how to build a software framework to distribute and parallel process code over such an hardware architecture.

Historically, the main framework used has been Message Passing Interface, or MPI. In MPI tasks are distributed over the HPC architecture and communicate with each other via messages. The challenge with MPI is that the Software Engineer has to make decisions on how to split the tasks, the message interface and what type of network model the tasks will be run. These network models are broken down into:
  • Point to Point - where two tasks send messages to each other directly
  • Broadcast - where data is published to all tasks
  • Scatter - where data is broken into partitions and distributed to multiple tasks
  • Gather - essentially the reverse of scattering, where a single task gathers data back from multiple tasks
  • Reduce - where a single task processing the data  from each remote task. This is very much a coordinated version of Point to Point
For simple parallel problems, such as algorithms which can be classed as embarrassingly parallel, the design choice can be relatively simple. However, as the set of algorithms and processing pipelines becomes more complicated, the MPI implementation becomes challenging. Often, with MPI, more time can be spent on the 'plumbing', rather than writing the code the solve the domain problem. It's no surprise that HPC software development with MPI is a very specialist skill and the advantages of large scale computing is out of reach for the 'average' Software Engineer or Data Scientist.

This is where the Python Dask library comes to the rescue. Dask excels in that it provides a familiar Python Numpy and Pandas like interface for common numeric, scientific and data science problems, wth a 'bag' API for more general purpose map./reduce like computation, suited to unstructured data. The real power of Dask, though, comes from the fact that it builds the most optimum task graph for you, so you can concentrate your effort on solving the domain problem, not how to maximise the resources of the HPC.

Lets see some code. Here's how you get a Dask Cluster started
1
2
3
4
from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client
Once spun up and available, Dask should return a message telling you the number of Cores and Memory the cluster has allocated. If you're using a shared HPC Cluster with other workloads running, you may not get all the Cores and Memory you request.

Lets create a large numeric array of data, using the familiar Numpy like syntax.
1
2
3
4
5
from dask.distributed import Client, progress
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x
dask.array
Lets now carry out a simple computation on this array.
1
2
3
4
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z
dask.array mean_agg-aggregate, shape=(5000,), dtype=float64, chunksize=(500,)
Notice here that z has not returned an answer, but a pointer to a new Dask array. By default, all calls to the Dask API are lazy, and it's only when you issue a .compute() call that the whole preceding task graph gets submitted to the Cluster for computation and evaluation - like this.
1
2
3
z.compute()
array([0.99524228, 1.01138959, 1.00196082, ..., 0.99702404, 1.00168843,
       0.99625349])
The power of Dask is that you can get started on a Laptop with Python, then transfer your algorithm to a HPC Cluster and scale up the computation with no change to the code.

Ah, you'll say, I don't have access to a HPC like Summit. True, most people don't have access to a HPC Cluster. I'm personally lucky that I work for a company that has it's own private HPC environment. Unless you work in Academia, or for a large industrial company, typically in the Automotive, Aerospace / Defence and Pharmaceutical industries, you unlikely to be able to access that level of compute.

This is where Cloud Computing comes to the rescue. In particular services such as Amazon EC2 Spot Instances.EC2 Spot Instances allow to you request compute resources at substantial On-Demand discounted rates. This is because Amazon have the right to interrupt and pause your compute with only 2 minutes notice. For example, at the time of writing this article, you can have a m4.16xlarge (64 vCPU 256GB RAM) Spot Instance at ~$1 per hour - which is incredible. However, this particular configuration comes with a potential interruption rate of greater than 20%. However, if you optimise your compute workload in Dask to suit, for example you work all 64 vCPUs, you may never see any interruptions ~80% of the time.

So there you have it. Super Computing is now available for everyone. All you have to do is work out what major massive computation problem to solve. I'd recommend looking at this video for inspiration.


Further info can be found:

  • Dask - https://dask.org/, https://blog.dask.org/
  • Amazon EC2 Spot Instances - https://aws.amazon.com/ec2/spot/

Sunday, 14 December 2014

SuperComputing goes Embedded

Graphics Processors, or more specifically General Purpose Graphics Processing Units (GPGPUs), have been steadily making inroads into the SuperComputer market over the last 5 years or so. Their high rate of floating point performance coupled with lower power requirements driving the shift from CPU to GPU cores. The graph below from NVIDIA demonstrates this trend.


There are only two major players in the GPGPU market, AMD and NVIDIA, with NVIDIA seen as the market leader, particularly with their Kepler architecture.

Earlier this year NVIDIA made an announcement on a breakthrough in Embedded System-on-a-Chip (SoC) design with the Tegra K1. The diagram below outlines the Tegra K1 architecture.


Essentially an ARM A15 2.3GHz 4+1 core CPU mated with a 192 core Kepler GPU providing an amazing ~350 GFLOPS of compute at < 10W of power.

Obviously NVIDIA have an eye on the mobile gaming market, building on their Shield strategy, but equally recognise this step change in GFLOPS/W opens up major opportunities in the embedded market. This ranges from real-time vision and computation for autonomous cars, to advanced imaging applications for defence applications, UAVs in particular. In fact, General Electric Intelligent Platforms have signed a deal with NVIDIA to license the Tegra K1 SoC for their next generation embedded vehicle computing and avionics systems.

But, the really great thing about the Tegra K1 is that NVIDIA have released a development board call the Jetson K1, which retails for an amazing $192 in the US.

I've purchased one myself and have started to get to grips with the challenges of CUDA programming. Once you've mastered the conceptual shift to parallel programming with CUDA, then it starts to become relatively straightforward to develop algorithms and computations that take advantage of the GPU.

If you want to find out more detail on the Jetson K1, then I'd recommend visiting the Jetson page on eLinux.org.

Sunday, 30 November 2014

IoT Comes of Age

Over the past couple of years Internet of Things (IoT) has been one of the key buzzwords in the IT industry. The idea is simple, with more and more devices connected to the Internet there's opportunity to connect machines with people and Enterprises and gather and analyse huge quantities of data. General Electric (GE) is probably one of the larger organisations who is a thought leader in this area with their Industrial Internet concept.

If Cisco's mobile data traffic forecasts are anything to believe, then IoT is just at the cusp of exponential growth.
One of the key challenges of realising an IoT solution is the limitations of wireless network technologies, including 3/4G, Bluetooth, Wi-Fi, ZigBee etc. In particular demands on power and limits in range.

Until now that is.

A number of Semiconductor manufacturers, including TI and Amtel, have now begun to develop Sub 1-GHz RF MCUs (Micro-Controller Unit) with very low power requirements. These devices now open up interesting applications, particular in the field of remote sensing and mesh networks.

To give you an idea how capable these new RF MCUs are, take a look at the video below from TI, demonstrating a battery powered sensor sending data at ~1.2kbps over 25km!


This technology is not just the preserve of large companies or Electronics Engineers. If you have a Raspberry Pi (and to be honest, who hasn't), then you can get in on this Sub 1-Ghz revolution with a RasWIK - a Raspberry Pi Wireless Inventors Kit for just £49.99 from a UK company called Ciseco.

This kit is based upon the TI CC 1100 RF MCU, but Ciseco have made the challenge of writing your own over-the-air protocol, by developing their own firmware layer they call LLAP - Lightweight Logical Application Protocol.

The kit bundles a XRF Transceiver for your PI, and an XRF enabled Arduino UNO R3 with a bunch of sensors and LEDs to get you building your IoT platform.

I have had this kit for 6 months now. Within a week of this arriving in the post, I had a wireless home temperature monitoring solution sending data to the Internet.

This Sub 1-GHz RF technology, in my opinion, is the leap that IoT has been waiting for. This opens up the opportunity to build very low cost RF sensor networks that can run on Coin-Cell batteries for, potentially, years before requiring new batteries.

Now, what to do with all that data? That's for another post.

The Freedom from Locks

I'm currently working on a project where I need to (i) cope with very high data rates over a shared memory buffer and (ii) squeeze as much processing power out of the (pretty low power) CPU as I possibly can. Oh, it it's going to have to be multi-threaded.

The system is on a POSIX platform and I experimented with Queues, Shared Memory but non of them gave the performance I needed. One of the big issues is making the application thread safe, and that usually involves locks. Locks are expensive and are, on most compilers and CPU targets, pretty slow. There is an interesting article here comparing performance. Locks are costly to acquire and release, and there's always the potentially for deadlock and contention scenarios.

So, I looked to adopting a Lock Free Ring / Circular Buffer approach. The challenge is to create a data structure and algorithm that allows multiple threads pass data between them concurrently without invoking any locks.

The diagram below shows the structure of a Ring Buffer.


This structure is often used in the classic multi threaded Producer / Consumer problem, but there are potential concurrency problems with this structure:
  • The Producer must find free space in the queue
  • The Consumer must find the next item
  • The Producer must be able to tell the queue is full
  • The Consumer must be able to tell if the queue is empty
All these operations could incur a contention / lock issue. So how to you get around this?

Firstly you need to define a static fixed global data structure with free running counters:
1
2
3
4
static volatile sig_atomic_t tail=0;
static volatile sig_atomic_t head=0;
static char buffer[CAPACITY];
static int mask=CAPACITY-1;
The Volatile keyword in C/C++ tells the compiler that the value held by the variable can be modified by another thread, process, or even external device. In fact the Volatile keyboard is often used to detect data from an external piece of hardware that uses memory mapped i/o.

sig_atomic_t is an Integer data type that tells the compiler to ensure that the variable is not partially written or partially read in the presence of asynchronous interrupts. Essentially used for signal handling in multi threaded / process contexts.

The combination of a volatile sig_atomic_t integer is that it creates a inter process thread safe signal handling variable that ensures any operation completes in a single CPU cycle that cannot be interrupted.

So, we've declared tail and head as our read and write positions in our circular buffer. Now, how do we insert a piece of data into the ring buffer?
1
2
3
4
5
6
7
8
9
10
11
12
13
int Offer(char e) {
 int currentTail;
 int wrapPoint;
 
 currentTail=tail;
 wrapPoint=currentTail-CAPACITY;
 if (head<=wrapPoint) {
  return 0;
 }
 buffer[currentTail % mask]=e;
 tail=currentTail+1;
 return 1;
}
Now lets retrieve a byte from the buffer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
char Poll(void) {
 char e;
  int i;
 int currentHead;
 
 currentHead=head;
 if (currentHead>=tail) {
  return 0;
 }
 i=currentHead & mask;
 e=buffer[i];
 head=currentHead+1;
 return e;
}
Now, here's Producer / Consumer thread that invokes our lock free buffer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include "LockFreeBuffer.h"
 
#define REPETITIONS 10000
#define TEST_VALUE 65
#define ITERATIONS 256
#define BILLION 1E9
 
void *Producer(void *arg);
void *Consumer(void *arg);
 
int main(void)
{
 int i,t;
 double secs;
 struct timespec start,stop;
 pthread_t mythread[2];
 
 printf("Starting...\n");
 for (i=0;i<ITERATIONS;i++) {
  Init();
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&start);
  if (pthread_create(&mythread[0],NULL,Producer,NULL)) {
   perror("Thread failed to create...\n");
   exit(-1);
  }
  if (pthread_create(&mythread[1],NULL,Consumer,NULL)) {
   perror("Thread failed to create...\n");
   exit(-1);
  }
  for(t=0;t<2;t++) {
   if (pthread_join(mythread[t],NULL)) {
    perror("Thread failed to join...\n");
    exit(-1);
   }
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&stop);
  secs=(stop.tv_sec-start.tv_sec)+((stop.tv_nsec-start.tv_nsec)/BILLION);
  printf("Operations/sec: %4.0f\n",REPETITIONS/secs);
 }
 printf("Completed...\n");
 exit(0);
}
 
void *Producer(void *arg)
{
 int i;
 
 i=REPETITIONS;
 do {
  while (!Offer(TEST_VALUE));
 } while (0!=--i);
 pthread_exit(NULL);
}
 
void *Consumer(void *arg)
{
 char buf;
 int i;
 
 i=REPETITIONS;
 buf=0;
 do {
  while (buf=Poll());
 } while (0!=--i);
 pthread_exit(NULL);
}
If you're interested in more on lock free buffers, check out the LMAX Distruptor for Java

Sunday, 23 October 2011

Unsung Heros - Dennis Ritchie

Coming a week after the death of Steve Jobs, it was announced that Dennis Ritchie had passed away at the age of 70.

There is huge media focus on Jobs, quite understandably and rightly so, but Ritchie, in my view, contributed so much more to the world of technology we now see around us today.

To be far, a number of main stream media (MSM) outlets did run with the story, including an obituary in the UK Guardian.

Ritchie joined Bell Labs in 1967 to work on Multics - the pioneering OS started in MIT / Bell Labs in the 60s, taken over by Honeywell in the 70s.

Ritchie joined the Multics programme with Bell at a point of turmoil. Multics was failing to deliver. Bell dropped interest in Multics in 1969, but Ritchie, with fellow "co-conspirators" Thompson, McIlroy and Ossanna knew there was a need for a time-sharing OS to support their programming and language development interests.

During 1969, Thompson started to developed a game called Space Travel on Multics, but with the shut down of the Multics programme he'd lose his game and hence it started to port it to FORTRAN on a GE-635. The game was slow on FORTRAN on the 635 and costly as computers were charged by the hour in those days.

So to keep his gaming interest alive, Thompson got access to a PDP-7 Minicomputer that had, for the time, a good graphics processor and terminal. It's wasn't long before Ritchie and Thompson had programmed the PDP-7 in Assembler to get the raw performance they wanted for Space Travel. In essence, they had to build an OS on the PDP-7 to support the game development. They called this OS Uniplexed Information and Computing Systems (UNICS) as a reference and pun on their ill fated Multics programme. UNICS got shorted to "Unix".

In 1970 Bell Labs got a PDP-11 and Ritchie and the team began to port Unix. By this time the features and stability of the OS was growing. By 1971 Bell Labs had started to see commercial potential in what Ritchie and the team had put together on the PDP-11 and by the end of 71 the first release of Unix was made.

Bell Labs was, essentially, a state monopoly in the Telecom space and was not allowed to commercially profit, so basically, they gave it away free to academic and Government institutions. Given that the period also coincided with the birth of large scale networking and the TCP/IP protocol, it's no co-incidence that Unix became synonymous with the growth of the the Internet.

Once Unix was ported on the PDP-11, Ritchie and the team set about getting a high-level language up and running on Unix. Thompson started to set about developing a FORTRAN port. During this developed Ritchie became influence by earlier work at MIT on a language called BCPL. This became known simply as B. The goal of the compiler was to try and bridge the traditional high-level languages of FORTRAN and COBOL with low level systems capabilities of Assembler.

Through a number of iterations B morphed into C, and the language we know today became pretty much complete by 1973.

Ritchie's work on C culminated in the classic text The C Programme Language, first published in 1978. I purchased a copy while (attempting) to teach myself C in 1984. In fact I still have that copy of the book.

Look at any computing device today, from a mobile phone (iPhone / Android) to Flight Control Computers on a UAV and the operating system running these devices can be directly traced to Ritchie's pioneering work in the late 60s and early 70s.

In terms of the legacy of Ritchie's work on C it's the basis of numerous modern programming languages in wide use today, from C#, Java, JavaScript, to influencing scripting languages like Python, Ruby and Groovy.

Steve Jobs can be certainly credited with the turnaround of Apple and bringing design and aesthetics to consumer technology, but it is Dennis Ritchie we should remember as providing the core foundations for computing today.

Dennis MacAlistair Ritchie, computer scientist, born 9 September 1941; died 12 October 2011.

Dennis Ritchie's Home Page on Bell Labs website.