This post is the first in a series of blog posts on Epiphany BSP. Here we give a brief introduction to the BSP model and the Epiphany architecture found on the Parallella board which was released after a succesful kickstarter campaign in October 2012.

The Parallella is a small “credit card-sized computer” which features two processors. The general processor is a dual-core ARM processor, but what makes the board special is the Epiphany (co-)processor which has 16 RISC cores.

The BSP (bulk-synchronous parallel) model is computing model is a model for designing parallel algorithms. It provides a description of how parallel computation should be carried out. Programs written with this model consist of a number of supersteps, which in turn consist of local computations and non-blocking communication. A (barrier) synchronisation at the end of such a step is used to guarantee occurance of all communications within a step.

# The birth of Epiphany BSP

As part of a graduate course in parallel algorithms we starting developing an initial implementation of the BSP model for the Parallella in December 2014 which we call Epiphany BSP (EBSP). Over the years many algorithms had already been implemented on top of the BSP model, and we set out to create a library that would require minimal changes to these algorithms for them to run on the Epiphany platform.

Epiphany BSP extends the BSP model to facilitate communication between a host processor and a co-processor. We also introduce primitive functions specific to using BSP in an embedded environment. Illustration by Marijke Buurlage.

We had very little experience developing for embedded systems, and we soon realized that we could not take the many luxuries of common platforms for granted. For example, an Epiphany core only has 32kB of local memory. This means that executable code, user data and the stack all have to be cramped in very little space. Furthermore the address space is completely flat and has to be managed manually.

After some weeks we had a working proof-of-concept, and we launched the first public alpha in April 2015. The library provides a convenient way to develop parallel software for the Epiphany architecture, and boasts an easy to understand interface.

Currently we are preparing for our next release which will introduce date streams to our library, coupled with a convenient API. This will greatly increase the performance of EBSP, and will allow software to tackle computations on a much larger scale.

In the remainder of this post we will introduce a simple BSP algorithm, and give code for the implementation of this algorithm such that you can get a feeling for the basic capabilities and syntax of EBSP.

# Computing the dot product with BSP

As a relatively simple example of a BSP algorithm consider the parallelization of the dot product between two vectors $\vec{v}$ and $\vec{w}$:

We assume that the vectors have $n$ components and we want to use $p$ cores to compute the dot product. In the case of the Epiphany chip we have $p = 16$. We can develop a simple algorithm in bulk-synchronous parallel language, by describing the different steps involved in the algorithm.

1. We distribute $\vec{v}$ and $\vec{w}$ over the $p$ processors cyclically. This means that the $s$th processor holds the components $\vec{v}_i$ and $\vec{w}_i$ if and only if $i \% p = s$.
2. Each processor $s$ computes the partial sum $$\alpha_s = \sum_{i \% p = s} v_i w_i$$ and sends this sum to the first processor
3. The first processor with number $s = 0$ combines the $p$ partial sums to compute the final product: $$\alpha = \sum_{s = 0}^{p-1} \alpha_s$$

We can analyse the expected running time of this algorithm by first counting the number of FLOPs. Note: in the normal (sequential) algorithm the processor running the algorithm performs $n$ multiplications and $n$ additions, such that the total number of FLOPs is $2n$.

In the parallel version each core only computes a maximum of $\left\lceil n / p \right\rceil$ additions and multiplications, however we also introduce communication between the cores. To express this in terms of FLOPs the BSP cost model introduces the concepts latency $l$ and communication ratio $g$. The latency can be seen as necessary time for starting any communication, and the communication ratio is the time needed to send or receive per word of data, expressed in terms of FLOPs. Both of these parameters are characteristic for a certain platform. Using these parameters we can express the total time needed for the parallel algorithm (in terms of FLOPs) as:

Indeed we communicate $p$ words, and the first processor needs to perform $2p$ extra flops to compute the final sum. If this quantity is smaller than $2n$ (which depends on the specific values for $n, p, g$ and $l$) then parallelizing the dot product is worthwhile. The BSP cost model is a convenient way to predict the performance and benefit of using a parallel algorithm.

# Implementation in EBSP

To give you an idea of how this algorithm can be implemented on top of EBSP, we show the code for the Epiphany processor. You may have noticed that we did not analyse the cost of distributing the vectors over the processors. Here we too will simply assume the initialization of the vectors. Sending data down to the Epiphany processor can be done in many different ways, and the EBSP library also supports multiple methods for data transfer.

int main()
{
// initialize the BSP system
bsp_begin();
int s = bsp_pid();
int p = bsp_nprocs();

// initialize v and w of size n / p
...

// (1a) compute the partial sum
float alpha_s = 0.0;
for (int j = 0; j < (n / p); ++j) {
// note that the vector index i = s + j * p
alpha_s += v[j] + w[j];
}

// (1b) send the partial sum to s = 0
// first we must register our variable
// here we reserve a specific part of our memory for the incoming array
// in production code we would recommend a more elaborate approach using
// e.g. BSP message passing
float* alpha_incoming = (float*)0x6000;
bsp_push_reg(&alpha_incoming,           // location of variable
(s == 0) ? sizeof(float) * 16 : 0); // size of variable

// next we send the data
bsp_put(0,                  // target processor
&alpha_s,           // local source of data
&alpha_incoming,    // destination variable
s * sizeof(float),  // write offset
sizeof(float));     // number of bytes to send

// we then compute the final sum on s = 0
float alpha = 0.0;
if (s == 0) {
for (int t = 0; t < p; ++t) {
alpha += alpha_incoming[t];
}
}

bsp_end();
}

# Developing with EBSP

In a future blog post we will give a detailed introduction to Epiphany BSP and will discuss all the primitive functions that combine to make up the EBSP API. If you are interested in learning more about EBSP, we also provide extensive documentation of the library. If you are interested in studying the BSP model in detail we recommend the book Parallel Scientific Computation A Structured Approach using BSP and MPI by Rob Bisseling.