The Parallella and its Epiphany coprocessor are capable of very high performance. However, many new users find it very hard to get started with developing software for this platform. The libraries and tools found in the Epiphany SDK are rather low-level, and finding good documentation and support is hard because of the relatively small community size compared to other the community of other small-form computing devices such as the Arduino and the Raspberry Pi.

Providing better libraries and tools to tailor to the average developer who may or may not be overly familiar with parallel programming has been a continued (community) effort. This has resulted in many higher-level libraries based on technologies and models such as MPI, BSP, and OpenMP. However, all these libraries, including the Epiphany BSP library that was developed by us, still require the developer to write in C, which is not necessarily the right language for every application. While MPI continues to dominate the HPC landscape, many new ‘Big Data’ technologies such as Hadoop and Spark have been developed which allow the user to write massively parallel programs using an API which is much higher level than the aforementioned technologies. These programs are generally built by defining large distributed data sets and performing operations directly on this set, instead of manually writing the individual send and receive functions at the appropriate places (referred to as the ‘transport layer’).

Because of the hardware constraints it is not easy to port these new technologies for the Parallella. However, the Epiphany compiler does have support for C++. Using modern C++ constructs such as smart-pointers, lambda functions and ranged-based for loops it is entirely possible to write small, performant libraries for the Epiphany that allow for writing parallel programs without taking a performance hit, while no longer having to worry about the transport layer. Instead higher level functions are used and composed to construct complex algorithms. One of the new developments in the OpenCL project is the announcement of SYCL, which is very similar in nature to the set-up we describe.

However, C++ programs do not work out of the box on the Parallella. Without doing some initial work, the binaries will be far too large to fit in the local memory. In the first part of this article we will get you started with running C++ programs on the Parallella. In the second part we will describe our plans of a new library based on Epiphany BSP and C++ that will make developing software that targets the Epiphany processor much easier.

Compiling C++ on the Epiphany

A common misconception is that C++ is not suitable for embedded systems because it produces large binaries. Indeed, the following snippet produces a binary with more than 100 kB of code:

#include <vector>

int main() {
    std::vector<int> vec;
    return 0;
}

With a local memory size of 32 KB per core this is not suitable for the Epiphany platform. Note that just looking at the size of the produced file will not tell you the size of the code, because the executable file contains a lot of other info like symbols required for linking. Instead one can use our tool epiphany-bin-info (link to GitHub) that shows the actual code size for Epiphany binaries (amongst other things). It is also possible to use tools like readelf and objdump although this may take a little more effort.

There are however a few easy tricks that can be done after which the same source file produces less than 6 KB of code (even including the Epiphany BSP library). Although compiling with -Os (optimize for size) helps, this is not enough.

The main reason for C++ binaries being larger than C binaries is exception handling. Imagine a function f that has a try/catch block in which it calls a function g which calls a function h. When the function h throws an exception, it will be caught by the catch block of f. The only way this is possible is if all functions include elaborate exception handling code that correctly deconstructs objects so that an exception in h safely passes back to g and then back to f.

What you should therefore do is use the flag -fno-exception that disables the generation of such code. However, exception handling code can still be included in your binary if you use new and delete. Even if you do not explicitly use them, objects like std::vector will handle memory allocation with new and delete. To solve this, one can simply override these functions and replace them with a simple malloc call.

void* operator new(std::size_t size) {
    return malloc(size);
}
void operator delete(void* ptr) {
    free(ptr);
}

There are a few more functions that need to be overridden. See reduce_binary_size.cpp in our GitHub repo for a complete file that you can include.

Another option you should use is -std=c++14 (as opposed to plain C++ or C++11). This is because C++14 improves upon earlier C++ revisions with respect to lambda’s and move-semantics. Although these extra features might seem like something that make the code larger, they are actually optimizations that reduce code size.

In summary, your compile command should look something like this:

e-g++ -std=c++14 -Os -fno-exceptions -o output_file.elf source.cpp reduce_binary_size.cpp

Introducing Bulk

Bulk is an alternative interface for writing parallel programs in C++ in bulk-synchronous style. The main goal of the project is to do away with the unnecessary boilerplate and ubiquitous pointer arithmetic that is found in for example the BSPlib standard. Here we will give a preview of the computations that are possible in the current version of Bulk, which we will release as an open-source project soon.

Initializing the system

The central object in Bulk is called a hub. The hub keeps track of the state of the system, such as the number of processors in use and the distributed variables that have been created, as well as facilitating sending and receiving data by providing communication buffers. The hub can spawn a SPMD (single program multiple data) section to be run on a given number of processors, with which we mean any independent entity that can execute code (such as an Epiphany core).

hub.spawn(hub.available_processors(), [&hub](int s, int p) {
    std::cout << "Hello, world " << s << "/" << p << std::endl;
}

Here, s is the processor id, and p the number of processors that are running the SPMD block.

Distributed variables

Supersteps can be seen sections of code in-between calls to a global synchronisation. Variables are declared once in the same superstep on each processor. This variable can then have a different value (the image of a variable) on each processor, but the images of the variables are identified with each other using the order in which the variables are declared. This allows processors to read from, and write to remote processors.

auto a = hub.create_var<int>();

hub.put(hub.next_processor(), s, a);
hub.sync();

// ... a.value() is now updated
// and contains the id of the previous processor

We can also obtain the value of a variable from a remote processor

auto b = hub.get<int>(hub.next_processor(), a);
hub.sync();

// ... b.value() is now available

Here, we the type of b is a future<T>, since its value is only valid after a future call to sync.

Message passing

Messages can also be sent to other processors. The syntax is greatly simplified when compared to the MPI or BSP code that accomplish the same result:

for (int t = 0; t < p; ++t) {
    // send a message with tag `s` and content `s * s` to all other processors
    hub.send<int, int>(t, s, s * s);
}

hub.sync();

for (auto message : hub.messages<int, int>()) {
    std::cout << message.tag << ", " << message.content << std::endl;
}

Co-arrays

Co-arrays originate as a Fortran extension. It allows for arrays to have an extra dimension over each processor. In its simplest form, it is similar to the distributed variables we introduced before, but it uses a different syntax:

auto a = hub.create_var<int>();
a(4) = 1; // the same as hub.put<int>(4, 1, a);
a = 2;

Here the first statements sends the value 1 to the processor 4 (valid only after the next global synchronisation), while the second statement writes the value 2 to the local image of the variable. This can be seen as an array of total size p, where p is the number of processors, while on each processor it only has a single value.

This can easily be extended for 2-dimensional arrays which have a local size greater than one. These can be created as follows:

auto a = hub.create_coarray<int>(local_size);
a(4)[2] = 1;
a[3] = 2;

This will first write a 1 on the element with index 2 on processor 4, and next write 2 to the local array element with index 3.

Other features

We also introduce higher level functions such as fold, map, and reduce which support user-defined functions. These can be used in combination with distributed data objects to make programming the Parallella as well as other parallel and distributed systems about writing algorithms, instead of sending individual data elements, without losing the power and flexibility of a low-level library.