In this article we assume you have some knowledge about Epiphany BSP. If you don’t, do not worry, we recommend you start here.

## The Parallella architecture

The Parallella board has two processors; the host processor, a dual-core ARM processor, and the co-processor, the famous Epiphany chip. The Epiphany chip has has 16 RISC (Reduced Instruction Set Computer) cores. The Epiphany chip is what makes the Parallella special, and is one of the worlds most energy efficient processors (measured in flop/watt). However, writing software for this chip can be a challenge, which is one of the reasons we developed Epiphany BSP. One of the challenges is that Epiphany cores have a very limited amount of memory available; only 32 kb! There are bigger memory banks available, but read/write speed to this memory is very slow. What to do when we need to process a lot of data? In this article we will discuss our solution!

## An application

Some things we might like to do on the Parallella:

• find a list of words frequently occuring in a 1 MB text file

• generate a list with the first million prime numbers

• calculate the sum of two large vectors

On an Epiphany core, we cannot ‘just’ do these things in memory as we would normally. The input is too big, the output is too big, or both! Notice, however, that in these examples, we do not need random access. Streaming the data (iterating over the data sequentially) would be enough. So instead of keeping the whole input/output in memory, we can split the data into smaller pieces that do fit in the local Epiphany core memory. We will call these pieces chunks.

The flow of pseudo-streamed data

Note that we do have fast random access within one chunk, and that it is possible to go through the chunks in any order we like, even repeating chunks if we want to. Allowing such non-sequential access breaks with the classical streaming paradigm. However it turns out that allowing such non-sequential access does have applications! One such application is matrix-matrix multiplication, especially Cannons algorithm can be adopted nicely to take advantage of pseudo-streams. An implementation of Cannons algorithm is provided in the examples section of Epiphany BSP. Since our technique does not satisfy the definition of Streams, we refer to it as pseudo-streaming.

## Using pseudo-streams in Epiphany BSP

Lets take a look at one of the applications of pseudo-streaming, the inner product, and how to implement it using Epiphany BSP. The inner product takes two vectors of the same length, multiplies their elements and sums the results.

First we have to send our input data, $a_i$ and $b_i$, from the host (the ARM core) to the co-processor (Epiphany cores). The library will split the data into chunks.

ebsp_create_down_stream((void*)a, pid, n_bytes, n*sizeof(int));
ebsp_create_down_stream((void*)b, pid, n_bytes, n*sizeof(int));

The arguments are:

• (void*)a, a pointer to an input vector

• pid, the processor id of the recieving Epiphany core

• n_bytes, the number of bytes in the input vector

• n*sizeof(int), the number of bytes in one chunk

On the receiving Epiphany core, we first have to open a down stream;

ebsp_open_down_stream((void**)&a, 0);
ebsp_open_down_stream((void**)&b, 1);

The arguments are:

• (void**)&a, a pointer to a place where chunks are stored

• 0, the id of the stream

Once the streams are opened, our library imediately starts the transfer of the first chunk using the DMA engine. The core can continue calculating things while this chunk is copied. We can then request a new chunk;

a_size = ebsp_move_chunk_down(&a, 0, 1);
b_size = ebsp_move_chunk_down(&b, 1, 1);

The first two arguments are identical to those of ebsp_open_down_stream, the last argument is a boolean to enable/disable prealloc mode.

The ebsp_move_chunk_down call is blocking; it halts the core until the chunk has been copied completely. It does, however, start the transfer of the next chunk (if prealloc is enabled). Finally we iterate over this chunk to calculate part of the inner product;

for (unsigned offset = 0; offset < a_size/sizeof(int); offset++) {
int ai = (int*)(a[offset]);
int bi = (int*)(b[offset]);
sum += ai * bi;
}

Putting it all together;

ebsp_open_down_stream((void**)&a, 0);
ebsp_open_down_stream((void**)&b, 1);

while (1) {
a_size = ebsp_move_chunk_down(&a, 0, 1);
b_size = ebsp_move_chunk_down(&b, 1, 1);

if (a_size == 0)
break;

for (unsigned offset = 0; offset < a_size/sizeof(int); offset++) {
int ai = (int*)(a[offset]);
int bi = (int*)(b[offset]);
sum += ai * bi;
}
}

Notice that a loop is added to go through all chunks, and that a_size is set to zero when there are no chunks left.

You can read more about pseudo-streams in Epiphany BSP in our documentation.

# Want to know more?

Want to know more about pseudo-streaming? We have written a paper on this (and more), a link will appear here once it is published. Jan-Willem Buurlage has written more about this in his Masters Thesis.