Sunday, 14 December 2014

SuperComputing goes Embedded

Graphics Processors, or more specifically General Purpose Graphics Processing Units (GPGPUs), have been steadily making inroads into the SuperComputer market over the last 5 years or so. Their high rate of floating point performance coupled with lower power requirements driving the shift from CPU to GPU cores. The graph below from NVIDIA demonstrates this trend.


There are only two major players in the GPGPU market, AMD and NVIDIA, with NVIDIA seen as the market leader, particularly with their Kepler architecture.

Earlier this year NVIDIA made an announcement on a breakthrough in Embedded System-on-a-Chip (SoC) design with the Tegra K1. The diagram below outlines the Tegra K1 architecture.


Essentially an ARM A15 2.3GHz 4+1 core CPU mated with a 192 core Kepler GPU providing an amazing ~350 GFLOPS of compute at < 10W of power.

Obviously NVIDIA have an eye on the mobile gaming market, building on their Shield strategy, but equally recognise this step change in GFLOPS/W opens up major opportunities in the embedded market. This ranges from real-time vision and computation for autonomous cars, to advanced imaging applications for defence applications, UAVs in particular. In fact, General Electric Intelligent Platforms have signed a deal with NVIDIA to license the Tegra K1 SoC for their next generation embedded vehicle computing and avionics systems.

But, the really great thing about the Tegra K1 is that NVIDIA have released a development board call the Jetson K1, which retails for an amazing $192 in the US.

I've purchased one myself and have started to get to grips with the challenges of CUDA programming. Once you've mastered the conceptual shift to parallel programming with CUDA, then it starts to become relatively straightforward to develop algorithms and computations that take advantage of the GPU.

If you want to find out more detail on the Jetson K1, then I'd recommend visiting the Jetson page on eLinux.org.

Sunday, 30 November 2014

IoT Comes of Age

Over the past couple of years Internet of Things (IoT) has been one of the key buzzwords in the IT industry. The idea is simple, with more and more devices connected to the Internet there's opportunity to connect machines with people and Enterprises and gather and analyse huge quantities of data. General Electric (GE) is probably one of the larger organisations who is a thought leader in this area with their Industrial Internet concept.

If Cisco's mobile data traffic forecasts are anything to believe, then IoT is just at the cusp of exponential growth.
One of the key challenges of realising an IoT solution is the limitations of wireless network technologies, including 3/4G, Bluetooth, Wi-Fi, ZigBee etc. In particular demands on power and limits in range.

Until now that is.

A number of Semiconductor manufacturers, including TI and Amtel, have now begun to develop Sub 1-GHz RF MCUs (Micro-Controller Unit) with very low power requirements. These devices now open up interesting applications, particular in the field of remote sensing and mesh networks.

To give you an idea how capable these new RF MCUs are, take a look at the video below from TI, demonstrating a battery powered sensor sending data at ~1.2kbps over 25km!


This technology is not just the preserve of large companies or Electronics Engineers. If you have a Raspberry Pi (and to be honest, who hasn't), then you can get in on this Sub 1-Ghz revolution with a RasWIK - a Raspberry Pi Wireless Inventors Kit for just £49.99 from a UK company called Ciseco.

This kit is based upon the TI CC 1100 RF MCU, but Ciseco have made the challenge of writing your own over-the-air protocol, by developing their own firmware layer they call LLAP - Lightweight Logical Application Protocol.

The kit bundles a XRF Transceiver for your PI, and an XRF enabled Arduino UNO R3 with a bunch of sensors and LEDs to get you building your IoT platform.

I have had this kit for 6 months now. Within a week of this arriving in the post, I had a wireless home temperature monitoring solution sending data to the Internet.

This Sub 1-GHz RF technology, in my opinion, is the leap that IoT has been waiting for. This opens up the opportunity to build very low cost RF sensor networks that can run on Coin-Cell batteries for, potentially, years before requiring new batteries.

Now, what to do with all that data? That's for another post.

The Freedom from Locks

I'm currently working on a project where I need to (i) cope with very high data rates over a shared memory buffer and (ii) squeeze as much processing power out of the (pretty low power) CPU as I possibly can. Oh, it it's going to have to be multi-threaded.

The system is on a POSIX platform and I experimented with Queues, Shared Memory but non of them gave the performance I needed. One of the big issues is making the application thread safe, and that usually involves locks. Locks are expensive and are, on most compilers and CPU targets, pretty slow. There is an interesting article here comparing performance. Locks are costly to acquire and release, and there's always the potentially for deadlock and contention scenarios.

So, I looked to adopting a Lock Free Ring / Circular Buffer approach. The challenge is to create a data structure and algorithm that allows multiple threads pass data between them concurrently without invoking any locks.

The diagram below shows the structure of a Ring Buffer.


This structure is often used in the classic multi threaded Producer / Consumer problem, but there are potential concurrency problems with this structure:
  • The Producer must find free space in the queue
  • The Consumer must find the next item
  • The Producer must be able to tell the queue is full
  • The Consumer must be able to tell if the queue is empty
All these operations could incur a contention / lock issue. So how to you get around this?

Firstly you need to define a static fixed global data structure with free running counters:
static volatile sig_atomic_t tail=0;
static volatile sig_atomic_t head=0;
static char buffer[CAPACITY];
static int mask=CAPACITY-1;
The Volatile keyword in C/C++ tells the compiler that the value held by the variable can be modified by another thread, process, or even external device. In fact the Volatile keyboard is often used to detect data from an external piece of hardware that uses memory mapped i/o.

sig_atomic_t is an Integer data type that tells the compiler to ensure that the variable is not partially written or partially read in the presence of asynchronous interrupts. Essentially used for signal handling in multi threaded / process contexts.

The combination of a volatile sig_atomic_t integer is that it creates a inter process thread safe signal handling variable that ensures any operation completes in a single CPU cycle that cannot be interrupted.

So, we've declared tail and head as our read and write positions in our circular buffer. Now, how do we insert a piece of data into the ring buffer?
int Offer(char e) {
 int currentTail;
 int wrapPoint;

 currentTail=tail;
 wrapPoint=currentTail-CAPACITY;
 if (head<=wrapPoint) {
  return 0;
 }
 buffer[currentTail % mask]=e;
 tail=currentTail+1;
 return 1;
}
Now lets retrieve a byte from the buffer.
char Poll(void) {
 char e;
  int i;
 int currentHead;

 currentHead=head;
 if (currentHead>=tail) {
  return 0;
 }
 i=currentHead & mask;
 e=buffer[i];
 head=currentHead+1;
 return e;
}
Now, here's Producer / Consumer thread that invokes our lock free buffer.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
#include "LockFreeBuffer.h"

#define REPETITIONS 10000
#define TEST_VALUE 65
#define ITERATIONS 256
#define BILLION 1E9

void *Producer(void *arg);
void *Consumer(void *arg);

int main(void)
{
 int i,t;
 double secs;
 struct timespec start,stop;
 pthread_t mythread[2];

 printf("Starting...\n");
 for (i=0;i<ITERATIONS;i++) {
  Init();
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&start);
  if (pthread_create(&mythread[0],NULL,Producer,NULL)) {
   perror("Thread failed to create...\n");
   exit(-1);
  }
  if (pthread_create(&mythread[1],NULL,Consumer,NULL)) {
   perror("Thread failed to create...\n");
   exit(-1);
  }
  for(t=0;t<2;t++) {
   if (pthread_join(mythread[t],NULL)) {
    perror("Thread failed to join...\n");
    exit(-1);
   }
  }
  clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&stop);
  secs=(stop.tv_sec-start.tv_sec)+((stop.tv_nsec-start.tv_nsec)/BILLION);
  printf("Operations/sec: %4.0f\n",REPETITIONS/secs);
 }
 printf("Completed...\n");
 exit(0);
}

void *Producer(void *arg)
{
 int i;

 i=REPETITIONS;
 do {
  while (!Offer(TEST_VALUE));
 } while (0!=--i);
 pthread_exit(NULL);
}

void *Consumer(void *arg)
{
 char buf;
 int i;

 i=REPETITIONS;
 buf=0;
 do {
  while (buf=Poll());
 } while (0!=--i);
 pthread_exit(NULL);
}
If you're interested in more on lock free buffers, check out the LMAX Distruptor for Java