This post is part of a series of blog posts on Epiphany BSP. In the previous part of this series we have given the code of a BSP program in which we compute the dot product of two vectors. In this post we will give a detailed introduction to the API of Epiphany BSP, which should clarify anything you did not understand about that program. We will give an introduction to all primitive functions and explain the EBSP interface in detail.

Epiphany BSP simplifies the communication between cores and processors

If you want to write EBSP programs you need to have access to a Parallella board with a recent version of the Epiphany SDK (ESDK) installed. To follow along with this tutorial it is perhaps easiest to clone the EBSP example project found on GitHub and write the code in src/ecore_code.c and src/host_code.c. Alternatively you can download the latest EBSP release from the release page. To (re)build the program issue make from the command-line and afterwards run bin/host_program.

This post is organized in a number of sections:

  1. Getting started shows how to write and run a simple program using EBSP.
  2. Variables introduces the concept of registered variables, and how to read/write from/to them.
  3. Message passing introduces BSP message passing and how it is used in EBSP.
  4. Vertical data transfer shows how to send and share data between the host processor and the Epiphany processor.
  5. Conclusion summarizes the introduced concepts, and shows where to go if you want to learn more about programming using EBSP.

This is by no means an exhaustive introduction to the bulk-synchronous parallel programming paradigm. However, even if you are completely new to BSP, the Parallella platform, or both – as long as you are somewhat comfortable with programming in C you should be able to write real, parallel programs on the Parallella after reading this post. As an added bonus, since BSP runs on a large number of systems such as multicore systems and on clusters or supercomputers using BSP on MPI, your programs can easily be adjusted to run on other parallel systems as well. If anything is unclear, or you need any help setting up your Parallella for EBSP development please leave a comment at the bottom of this post!

Getting started: Hello World

EBSP programs are written in single-program multiple-data (SPMD) style. This means that each core runs the same code, but obtains different data. Later we will see how we can transfer data to and from the Epiphany cores, but for now our first step will be to get the cores to output their designated core number (called pid for processor identifier). Like all programs written for the Parallella, an EBSP program consists of two parts. One part contains the code that runs on the host processor, the ARM chip that hosts the Linux OS. The other part contains the code that runs on each Epiphany core. In heterogeneous computing it is common to call this second part the kernel. A host program consists of at least four EBSP functions, which are generally used as in the following example:

// file: host_code.c

#include <host_bsp.h>

int main(int argc, char **argv)
{
    bsp_init("ecore_program.srec", argc, argv);
    bsp_begin(16);
    ebsp_spmd();
    bsp_end();

    return 0;
}

The first call to bsp_init initializes the EBSP library. The first argument is the filename of the (compiled) kernel program, and the second and third arguments are the program arguments. Next we tell the EBSP system how many cores we would like to use (in this case; all 16 cores for a standard Parallella board) by calling bsp_begin passing 16 as its first argument. The call to ebsp_spmd starts the execution of the kernel program on the 16 cores. When the execution has finished we finalize the EBSP system by calling bsp_end without arguments.

Next we write the kernel for our Hello World program. Besides outputting “Hello World” we also show the processor number. The code looks like this:

// file: ecore_code.c

#include <e_bsp.h>

int main()
{
    bsp_begin();
    int s = bsp_pid();
    int p = bsp_nprocs();
    ebsp_message("Hello World from processor %d / %d", s, p);
    bsp_end();

    return 0;
}

Let us also go over the kernel code line by line. First we initialize the EBSP system on the core, by calling bsp_begin. In a kernel program this call does not require any arguments, since there is no additional program to run! Next we obtain information about our own designated processor number (commonly called s) using bsp_pid, and the total number of processors (commonly called p) by calling bsp_nprocs. We then output a message to host using ebsp_message. This function can be used completely identically to printf in ordinary C programs. Again we finalize the system with a call to bsp_end which cleans up the EBSP system.

You may have noticed that some EBSP functions, which we will refer to as primitives, are prefixed with bsp_ while others are prefixed by ebsp_. This is because the EBSP library introduces some functions that are not in the BSPlib standard but that can be very helpful when programming for the Epiphany.

Running this program should result in output similar to the following:

$08: Hello World from processor 8 / 16
$01: Hello World from processor 1 / 16
$07: Hello World from processor 7 / 16
$02: Hello World from processor 2 / 16
$15: Hello World from processor 15 / 16
$03: Hello World from processor 3 / 16
$10: Hello World from processor 10 / 16
$06: Hello World from processor 6 / 16
$12: Hello World from processor 12 / 16
$13: Hello World from processor 13 / 16
$05: Hello World from processor 5 / 16
$04: Hello World from processor 4 / 16
$11: Hello World from processor 11 / 16
$14: Hello World from processor 14 / 16
$09: Hello World from processor 9 / 16
$00: Hello World from processor 0 / 16

The output has the form $[pid]: output. As we see, indeed the EBSP kernel is being run on every core! Note that there are no guarantees about which core gets to the ebsp_message statement first, and therefore the output need not be in order of processor number.

Variables

If we want to write more interesting EBSP programs, we need to have a way to communicate between the different Epiphany cores. In EBSP communication happens in one of two ways: using message passing, which we will introduce later, or via registered variables. An EBSP variable exists on every processor, but it does not have to have the same size on every Epiphany core.

Variable registration

We register a variable by calling bsp_push_reg:

int a = 0;
bsp_push_reg(&a, sizeof(int));
bsp_sync();

Here we declare an integer a, and initialize it with zero. Next we register the variable with BSP system, by passing its local location, and its size.

To ensure that all cores have registered a variable, we perform a barrier synchronisation after the registration. The Epiphany cores will halt execution until every other core reaches this point in the program, so it synchronizes the program execution between the Epiphany cores. Only one variable may be declared between calls to bsp_sync!

Putting and getting values

Registered variables can be written to or read from by other cores. In BSP this is refered to as putting something in a variable, or getting the value of a variable. To write for example our processor ID to the next core we can write:

int b = s;
bsp_put((s + 1) % p, &b, &a, 0, sizeof(int));
bsp_sync();

Let us explain this code line by line. As in the Hello World example, here we define s and p to be the processor id and the number of processors respectively. In our call to bsp_put we pass the following arguments (in order):

  1. The pid of the target (i.e. the receiving) processor.
  2. A pointer to the source data that we want to copy to the target processor.
  3. A pointer representing a registered variable. Note that this pointer refers to the registered variable on the sending processor – the EBSP system can identify these processors such that it knows which remote address to write to.
  4. The offset (in bytes) from which we want to start writing at the target processor.
  5. The number of bytes to copy.

Before we want to use a communicated value on the target processor, we need to again perform a barrier synchronisation by calling bsp_sync. This ensures that all the outstanding communication gets resolved. After the call to bsp_sync returns, we can use the result of bsp_put on the target processor. The code between two consecutive calls to bsp_sync is called a superstep.

When we receive the data, we can for example write the result to the standard output. Below we give the complete program which makes use of bsp_put to communicate with another processor. Here, and in the remainder of this post we will only write the code in between the calls to bsp_begin and bsp_end, the other code is identical to the code in the Hello World example.

int s = bsp_pid();
int p = bsp_nprocs();

int a = 0;
bsp_push_reg(&a, sizeof(int));
bsp_sync();

int b = s;
bsp_put((s + 1) % p, &b, &a, 0, sizeof(int));
bsp_sync();

ebsp_message("received: %i", a);

This results in the following output:

$01: received: 0
$02: received: 1
$07: received: 6
$00: received: 15
...

Where we have suppressed the output from the other cores. As we see we are correctly receiving the processor id of the previous cores!

An alternative way of communication is getting the value of a registered variable from a remote core. The syntax is very similar:

a = s;
int b = 0;
bsp_get((s + 1) % p, &a, 0, &b, sizeof(int));
bsp_sync();

The arguments for bsp_get are:

  1. The pid of processor we want to get the value from.
  2. The pointer representing a registed variable.
  3. The offset (in bytes) at the remote processor from which we want to start reading.
  4. A pointer to the local destination.
  5. The number of bytes to copy.

And again, we perform a barrier synchronisation to ensure the data has been transferred. If you are familiar with concurrent programming, then you might think we are at risk of a race condition! What if processor s reaches the bsp_get statement before processor (s + 1) % p has set the value for a equal to its process number? Do we then obtain zero? In this case, we do not have to worry – no data transfer is initialized until each core has reached bsp_sync. Indeed we receive the correct output:

$01: received: 2
$03: received: 4
$11: received: 12
$14: received: 15
...

Unbuffered communication

So far we have discussed writing to, and reading from variables using bsp_put and bsp_get. These two functions are buffered. When calling bsp_put for example, the current source value at the time of the function call is guarenteed to be sent to the target processor, but it does not get sent until the next barrier synchronisation – so behind the scenes the EBSP library stores a copy of the data. The BSP standard was originally designed for distributed memory systems with very high latency, in which this design makes a lot of sense. On the Epiphany platform this gives a lot of unnecessary overhead since data is copied to external memory.

This problem is not unique to the Epiphany platform however. Together with the MulticoreBSP which target modern multicore processors, two additional BSP primitives were introduced that provide unbuffered variable communication, bsp_hpput and bsp_hpget. Here the hp... prefix stands for high performance.

However, although their function signatures are completely identical, these are not meant as a drop-in replacements for bsp_put and bsp_get. They are unsafe in the sense that data transfer happens at once. This means that when using these functions you should be aware of possible race conditions – which can notoriously lead to mistakes that can be very hard to debug.

To facilitate writing code using only unbuffered communication we will expose an ebsp_barrier function in the next EBSP release that performs a barrier synchronisation without transferring any outstanding communication that has arisen from calls to bsp_put and bsp_get. Let us look at an example program using these unbuffered variants.

int s = bsp_pid();
int p = bsp_nprocs();

int a = 0;
bsp_push_reg(&a, sizeof(int));
bsp_sync();

int b = s;
// barrier ensures b has been written to on each core
bsp_sync();

bsp_hpput((s + 1) % p, &b, &a, 0, sizeof(int));

// barrier ensures data has been received
bsp_sync();
ebsp_message("received: %i", a);

When writing or reading large amounts of data in between different bsp_sync calls, the hp... functions are much more efficient in terms of local memory usage (which is very valuable because of the small size) as well as running speed. However, extra care is needed to effectively synchronize between threads. For example, if we remove any of the two bsp_sync calls in the previous example program, there will be a race condition.

We test the program, and see that the output is indeed identical to before:

$01: received: 0
$08: received: 7
$02: received: 1
$10: received: 9
...

Message passing

The next subject we will discuss is passing messages between Epiphany cores. Message passing provides a way to communicate between cores, without having to register variables. This relies on a message queue, which is available to every processor. Using message passing, you can communicate to other cores without registering variables. This can be very useful when the amount of data varies from core to core, and it is not clear beforehand how the data will be distributed. It is good to keep in mind that message passing is a lot slower than alternative communication methods since it utilizes the external memory.

A BSP message has a tag and a payload. The tag identifies the message, and the payload contains the acutal data. The size (in bytes) of a tag is universal, i.e. it is the same across all Epiphany cores (as well as the host). The tagsize can be set using bsp_set_tagsize:

int tagsize = sizeof(int);
bsp_set_tagsize(&tagsize);
bsp_sync();

The tagsize must be set on each core simultaneously, that is to say in the same superstep. Alternatively, The tagsize can also be set on the host before issuing ebsp_spmd. For compatibility reasons, the call to bsp_set_tagsize writes the old value for the tagsize to its argument. We also provide an alternative way to obtain the tagsize, by simply calling ebsp_get_tagsize.

int tagsize = ebsp_get_tagsize();

After setting the tagsize (and synchronizing), we are ready to start sending messages. We can send a message using bsp_send:

int tag = 1;
int payload = 42 + s;
bsp_send((s + 1) % p, &tag, &payload, sizeof(int));
bsp_sync();

We first need to declare variables holding the tag and the payload. In our case these are integers, but in general you can use any data type. In order, the arguments of bsp_send are:

  1. The pid of processor we want to send the message to.
  2. A pointer to the tag data.
  3. A pointer to the payload data.
  4. The size of the payload. Note that you can vary this size between every send call, contrary to the tagsize.

After synchronizing, the target processor can receive the message. To receive messages, we must first inspect the queue:

int packets = 0;
int accum_bytes = 0;
bsp_qsize(&packets, &accum_bytes);

The call to bsp_qsize writes the number of packets to the first argument, and the total number of bytes in the queue to the second argument. Next, we can loop over each packet, moving the packages to the local core:

int payload_in = 0;
int payload_size = 0;
int tag_in = 0;
for (int i = 0; i < packets; ++i) {
    bsp_get_tag(&payload_size, &tag_in);
    bsp_move(&payload_in, sizeof(int));
    ebsp_message("payload: %i, tag: %i", payload_in, tag_in);
}

We use two new primitives here. First we obtain for each packet (note that here we only have a single packet) the payload size and the incoming tag, using bsp_get_tagsize. The payload itself is moved using bsp_move. The first argument should point to a buffer large enough to store the payload data, and the second argument is the number of bytes to move. Note that we could use our obtained payload size to allocate a buffer large enough to hold the payload, and we could pass it to the second argument of bsp_move. It is good to keep in mind that if less bytes are moved than the size of the payload, the remaining data is thrown away. Here we know all messages contain a single integer, such that we can just write the payload into a local variable directly.

We finish our discussion of inter-core BSP message passing by providing a complete program that sends messages around:

int s = bsp_pid();
int p = bsp_nprocs();

int tagsize = sizeof(int);
bsp_set_tagsize(&tagsize);
bsp_sync();

int tag = 1;
int payload = 42 + s;
bsp_send((s + 1) % p, &tag, &payload, sizeof(int));
bsp_sync();

int packets = 0;
int accum_bytes = 0;
bsp_qsize(&packets, &accum_bytes);

int payload_in = 0;
int payload_size = 0;
int tag_in = 0;
for (int i = 0; i < packets; ++i) {
    bsp_get_tag(&payload_size, &tag_in);
    bsp_move(&payload_in, sizeof(int));
    ebsp_message("payload: %i, tag: %i", payload_in, tag_in);
}

This code results in the following output:

$02: payload: 43, tag: 1
$08: payload: 49, tag: 1
$00: payload: 57, tag: 1
$13: payload: 54, tag: 1
...

Message passing is a very general and powerful technique when using variables to communicate proves too restrictive. However, the flexibility of message passing comes with performance penalty, because the buffers that are involved are too large to store on a single core. As before, bsp_hpput and bsp_hpget should be your preferred way of communicating if you are optimizing for speed.

Transferring data up and down

Writing kernels for the Epiphany is only useful when you can provide them with data to process. The easiest way to send data from the host program running on the host processor to the Epiphany cores is completely analogous to message passing between cores. So far the code we have written on the host only initializes the BSP system, starts the SPMD program on the Epiphany, and finalizes the system afterwards. Before the call to ebsp_spmd we can prepare messages containing e.g. initial data for the Epiphany cores. This works completely identically to inter-core message passing, using ebsp_set_tagsize instead of bsp_set_tagsize, and ebsp_send_down instead of bsp_send:

// file: host_code.c
int n = bsp_nprocs();

int tagsize = sizeof(int);
ebsp_set_tagsize(&tagsize);

int tag = 1;
int payload = 0;
for (int s = 0; s < n; ++s) {
    payload = 1000 + s;
    ebsp_send_down(s, &tag, &payload, sizeof(int));
}

These messages are available like any other on the Epiphany cores, but only between the call to bsp_begin and the first call to bsp_sync. So on the Epiphany cores we can read the messages using:

// file: ecore_code.c

bsp_begin();

// here the messages from the host are available
int packets = 0;
int accum_bytes = 0;
bsp_qsize(&packets, &accum_bytes);

int payload_in = 0;
int payload_size = 0;
int tag_in = 0;
for (int i = 0; i < packets; ++i) {
    bsp_get_tag(&payload_size, &tag_in);
    bsp_move(&payload_in, sizeof(int));
    ebsp_message("payload: %i, tag: %i", payload_in, tag_in);
}

// after this call the messages are invalidated
bsp_sync();
... // remainder of the program, see below

Resulting in the output:

$00: payload: 1000, tag: 1
$03: payload: 1003, tag: 1
$02: payload: 1002, tag: 1
$13: payload: 1013, tag: 1
...

A similar method can be used to send data up (from the Epiphany cores to the host). If you have followed along with our discussion so far the second half of the kernel code should be self explanatory:

// file: ecore_code.c
... // obtain initial data, see above

int payload = payload_in + 1000;
int tag = s;
ebsp_send_up(&tag, &payload, sizeof(int));

bsp_end();

Note that now we are using our processor number as the tag, such that the host can use the tag to differentiate between messages coming from different cores. Usage of ebsp_send_up is limited to the final superstep, i.e. between the last call to bsp_sync and the call to bsp_end. In the host program we can read the resulting messages similarly to how we read them on the Epiphany processor:

// file: host_code.c

...
ebsp_spmd();

int packets = 0;
int accum_bytes = 0;
ebsp_qsize(&packets, &accum_bytes);

int payload_in = 0;
int payload_size = 0;
int tag_in = 0;
for (int i = 0; i < packets; ++i) {
    ebsp_get_tag(&payload_size, &tag_in);
    ebsp_move(&payload_in, sizeof(int));
    printf("payload: %i, tag: %i", payload_in, tag_in);
}

ebsp_end();

This gives the output:

payload: 2001, tag: 1
payload: 2013, tag: 13
payload: 2003, tag: 3
payload: 2002, tag: 2
...

For the first time we have written data to the cores, applied a transformation to the data using the Epiphany cores, and sent it back up to the host program.

Message passing is a nice way to get initial data to the Epiphany cores, and to get the results of computations back to the host. However, it is very restrictive, and does not give the user a lot of control over the way the data gets sent down. An alternative approach is given by ebsp_write and ebsp_read. These calls require manually addressing the local memory on each core. Every core has 32kb of local memory, corresponding to addresses 0x0000 to 0x8000. The default settings of EBSP put the program data at the start of this space (i.e. at 0x0000), and the stack moves downwards from the end (i.e. at 0x8000). Using ebsp_write from the host program, you can prepare data at specific spaces on the local cores:

int data[4] = { 1, 2, 3, 4 };
for (int s = 0; s < n; ++s) {
    ebsp_write(0, data, (void*)0x5000, 4 * sizeof(int));
}

This would write 4 integers to each core starting at 0x5000. Similarly, ebsp_read can be used to retrieve data from the cores. We would not recommend this approach for users just beginning with the Parallella and EBSP in particular. A better approach to move large amounts of data from and to the Epiphany processor uses data streams, which will be introduced in the next EBSP release. This allows data to be moved in predetermined chunks, which are acted upon independently. We will explain this approach in detail in a future blogpost.

Conclusion

The EBSP API can be a bit daunting at first, especially since we have chosen to adhere rather strictly to the BSPlib standard. However, once you get used to calling the BSP primitives, it provides a very convenient and straightforward way to implement parallel programs on the Parallella. We suggest you make extensive use of the EBSP documentation while developing, since it should include every detail you need to know when writing EBSP programs and kernels. The EBSP library also comes with a number of examples, which can be found in the main GitHub repository and should make for interesting reading.

In a future post we will introduce more advanced EBSP features, such as the concept of data streaming, which will allow complex algorithms involving e.g. large matrices, and explicitely using the external memory.