# Introduction

In this blog post, we describe how we measured the speed of the 16-core Epiphany chip on the Parallella. In particular we cover the speeds of data transfers from one Epiphany core to another as well as transfer speeds from and to external memory. The results were used to optimize the implementation of the EBSP library.

Let us first introduce some of the relevant terms and concepts for readers unfamiliar with the system. The Parallella board contains a dual core ARM processor, generally referred to as the host processor. The board also contains an Epiphany chip consisting of 16 Epiphany cores, or E-cores. Every E-core has 32 kB of memory, referred to as local memory. There is 1 GB of ‘normal’ RAM on the board, of which 32 MB is dedicated for communication between the host processor and E-cores and can be accessed directly by the E-cores. This 32 MB of RAM is referred to as external memory. There are many different ways to refer to these different types of memory. We have written a detailed overview of the memory types in the documentation of EBSP.

Every E-core has a DMA engine (DMA stands for Direct Memory Access) built in. For our purposes this can be viewed as a separate core that can only transfer memory. In practice, the DMA engine is given a source and destination address, among some other options. It will then start transferring data while the E-core can continue with other operations.

# The different transfer speeds

There are quite a few different speeds that can be tested. There are three entities on the Parallella that are able to perform a data transfer:

• The host ARM processor
• An individual Epiphany core
• An E-core DMA engine

There are two ‘targets’ that data can be written to or read from:

• External memory
• Local E-core memory

Important: the scenario described here applies to the combination of the Epiphany chip with the Parallella development board. Although the core-to-core speeds should be valid for the Epiphany chip in general, the speeds to external memory are restricted by the specific hardware components on the Parallella.

The diagram below shows the six resulting possibilities for data transfer, labelled A to F, where each of these can be used both for reading and writing.

The E-cores are connected by a mesh network through which the data travels between different cores. This mesh network is connected to the rest of the hardware (external memory) by something we will refer to as the e-link. We should note that technically the e-link is only one part of the connection; there is also an FPGA involved.

If a single core is doing consecutive 8-byte writes to external memory (label C in the above diagram), then the hardware activates a burst-mode that increases transfer speed. This is supposed to work for the DMA transfers as well (label D) but our measurements do not show this.

For transfers performed by the E-cores or the E-core DMA engines, we can make the distinction between a busy and nonbusy mesh network and/or e-link. (We assume that the host and E-cores are never using the network at the same time.) If the mesh network or e-link is a bottleneck then we can expect slower speeds when all cores are writing or reading at the same time.

Putting this all together, there are quite a few different combinations of entity and target that we want to measure. This is summarized in the following table:

Executing entity Network state External External Other E-core Other E-core
E-core busy ? ? ? ?
nonbusy ? ? ? ?
E-core DMA busy ? ? ? ?
nonbusy ? ? ? ?
host n.a. ? ? ? ?

Which gives us a total of 20 numbers, and even then we do not yet have a complete picture, the amount of data you transfer influences the speed as well. Indeed, when reading or writing a large chunk of data, the average time spent per byte decrease, i.e we obtain higher speeds. So to properly benchmark the system, instead of 20 numbers, you actually need to measure 20 speed-vs-size graphs. We have performed extensive measurements, and will cover the most interesting results in this blogpost.

# Data transfer details

To understand some of the results that follow later, for example why reading is slower than writing, we need to know some technical details. We will try to explain some of these details in this section.

The E-cores can read and write data using the load and store processor instructions. A single read or write instruction can read or write 1, 2, 4 or 8 bytes.

These instructions generate what we will call a data request packet or data write packet which then travels through the system. When a data request packet arrives at its destination (the place you want to read from), a data write packet will be generated that travels back and delivers the actual data. All data write packets can hold up to 8 bytes of data (payload), even if you are only reading or writing a single byte. This means the 1,2,4-byte packets have unused padding bytes that are sent along. To obtain maximum speed, one should therefore always use 8-byte data transfers.

If you want to perform 8-byte data transfers, we suggest you do not use memcpy. Unfortunately the default memcpy implementation that is used when compiling a C program with the Epiphany gcc compiler does not do 8-byte transfers and is sometimes even stored in external memory itself. The following C-snippet can be used to accomplish 8-byte transfers. It should be noted that this requires both source and destination addresses to be 8-byte aligned or the ldrd and strd instructions fail.

// Only works if destination, source and nbytes are all multiples of 8 bytes
void transfer(void* destination, const void* source, int nbytes)
{
long long* dst = destination;
const long long* src = source;
int count = nbytes >> 3; // divide by 8
for (int i = 0; i < count; i++)
*dst++ = *src++;
}

## Transfers performed by the E-core

The strd instruction sends out a data write packet after which the core can continue execution immediately. This means the instruction only takes a single clockcycle unless the network is busy. The data itself might arrive much later at its destination (hundreds of clockcycles later if this is external memory). As read packets cannot overtake write packets in the network, it is guaranteed that the data is stored before any subsequent read of the same address.

The ldrd instruction sends out a data request packet (which later generates a data write packet from the other side) but the core blocks execution until the data is read and returned to the core. This is an obvious requirement, since you expect to be able to use the data after a read instruction. It is a problem however for reading large chunks of data from external memory: every ldrd instruction will take hundreds of clockcycles. If you only want to read 8 bytes (or less) in total, there is not much that can be done to improve speeds. For larger chunks, the DMA engine provides a solution (see below).

The diagram below illustrates the difference between the reading and writing mechanism

Diagram illustrating the difference between reading and writing data on the Epiphany system. The vertical direction shows time and the horizontal direction shows the location within the mesh network.

## Transfers performed by the E-core DMA engine

Using the DMA engine has two advantages:

1. The E-core can do other computations while the DMA engine transfers data

2. Reads (of large chunks) are many times faster. As opposed to the ldrd instruction, the DMA engine can send out read packets without blocking (waiting for results to arrive). This is illustrated in the diagram below. Only when sending the very last read packet the DMA will wait for the data to be returned (this is an optional option called MSGMODE).

Diagram illustrating the workings of the DMA engine. The vertical direction shows time and the horizontal direction shows the location within the mesh network.

Note that the DMA is of no benefit for very small data transfers because there is a startup cost involved. As our results will show later, setting up the DMA (filling a struct, passing it to the DMA) and starting it costs about 500 clockcycles. Using the DMA is advantageous if the data to be transfered is larger than a few hundred bytes (which also depends on whether the transfer concerns external memory or e-core memory).

## Transfers performed by the host

A specific, fixed, section of physical RAM is used as external memory. A normal Linux application (as opposed to a driver) can not access the RAM directly by physical addresses but only indirectly by virtual addresses, so a driver is needed to access the external memory. The way it works is that the Epiphany driver /dev/epiphany allows the host application to call mmap which then creates a page table entry (in this case with the non-cached flag) for the host application, pointing to the correct locations in external memory. With this memory mapping, there is a virtual address that the host process can write to, which corresponds to the correct region of RAM. The non-cached property makes sure that when the host writes to external memory, the host processor cache is bypassed and so the data will actually be present in external memory. This results in easier communication with the Epiphany chip, but makes transferring large amounts of data very slow.

# Results

The two tables below show our results for transferring large chunks of data (2048 KB). For busy results, all cores are reading or writing simultaneously and the speed that is shown is per core. For nonbusy results, only one core is reading or writing and the other cores are waiting (looping).

## Transfers to External memory

Executing entity Network state External Read External Write
E-core busy 8.3 MB/s 14.1 MB/s
nonbusy 8.9 MB/s 270 MB/s *
E-core DMA busy 11 MB/s 12.1 MB/s
nonbusy 80 MB/s 230 MB/s
host n.a. 48 MB/s 98 MB/s

* 270 MB/s is with burst mode. The speed is 226 MB/s without burst mode.

## Transfers to (Other) E-core

Executing entity Network state Other E-core Read Other E-core Write
E-core busy 173 MB/s 509 MB/s
nonbusy 171 MB/s 441 MB/s
E-core DMA busy 250 MB/s 1000 MB/s
nonbusy 300 MB/s 1000 MB/s
host n.a. 5.3 MB/s 42.4 MB/s

For total data throughput in the busy benchmarks, multiply the per-core-speed by 16.

## External memory analysis

Lets first look at transfer speeds to external memory. The bottleneck here is the connection to external memory and not the Epiphany chip itself. In other words: a core can always supply data faster than the connection to external memory is able to handle it.

### Writing to external memory

The bandwidth is close to 226 MB/s when the cores are writing to it (without burst mode) and this bandwidth is shared over all cores. When only a single core is writing it can achieve the full 226 MB/s, and with 16 cores writing at the same time, the bandwidth is divided and one obtains a writing speed of about 14 MB/s per core. When a single core does consecutive writes (and other cores do nothing) then the transfer speed is increased to about 270 MB/s due to the burst mode introduced before.

The writing speed using the DMA engine is lower than we expected: the DMA engine should be at least as fast as the cores themselves, except for the overhead cost of starting up the DMA. With a chunk size of 2048 KB, this overhead should be negligible. Note that this does not mean you should do all writes without the DMA. On the contrary, the E-core can do computations while the DMA engine writes data, so it still gives a large speedup if your specific application allows such parallelism.

One might expect (like we did) that the host processor can write to external memory very quickly, as this memory is residing in the normal RAM. It turns out that this transfer speed is relatively slow. A reason for this might be that the Epiphany driver only allows non-cached access to external memory as stated before.

Instead of looking at the transfer speeds for only 2048 KB chunks, we can also look at speed versus size graphs. The following graph shows read and write speeds in the nonbusy situation (only a single core is reading or writing simultaneously) corresponding to the second and fourth row in the first table.

Writing and reading speeds to external memory in MB/s in the nonbusy state.

Note that we have performed the benchmark on every one of the 16 cores (while the other cores block in a loop) and all results are merged in this graph, causing the noise.

The data of the read transfer speeds shows what we expect. The non-DMA transfer is bottlenecked by the fact that the E-core has to wait for the data to return, and a larger amount of data does not improve average speed. The DMA read transfers however do gain speed because they do not have this blocking limitation.

The graph of the burst mode transfer speeds (shown above) shows jumps at specific positions. One possible explanation for this is that the hardware burst mode was somehow interrupted and had to be reinitialized. A second peculiarity is the decreasing speed of the non-burst mode write speeds with a peak at 70 bytes. This might suggest that we should always divide our data in blocks of 70 bytes and transfer those blocks individually, but this is not true. To understand these results better, we have to look at a different representation of the same data.

Clockcycle cost of writing to external memory in the nonbusy state. The reading speeds are left out because with 150 000 cycles they do not fit nicely on this graph.

The above graph shows the total time used $T(n)$ (measured in clockcycles) as a function of data size $n$ (in bytes). It is essentially the same data as shown in the previous graph. If $T(n)$ is the number of clockcycles it takes to transfer $n$ bytes (shown in this last graph), then the speed in bytes per second (shown in the first graph) is $n/(T(n)\cdot 600 \mathrm{MHz})$.

In this graph we can see that there are actually two writing speeds involved in the non-DMA transfer without burst mode (shown in green), corresponding to the two slopes of the line. The slope changes somewhere around 70 bytes. This means the first 70 bytes are written quickly (pushed out of the core into the network), and after that the process slows down. This might indicate that the connection between the cores and external memory can ‘hold’ up to 70 bytes of data before reaching a bottleneck. When this is filled up, the core has to wait and the speed decreases.

The above graph also clearly shows the startup cost of slightly less than 500 cycles for the DMA transfers. Note: this includes filling a struct that could in principle be reused which was not done in this benchmark. Without the DMA the core can transfer about 150 bytes of data in this time, showing that one should only use the DMA for external memory writes when the data size is larger than about 150 bytes.

Because of the blocking nature of the ldrd instruction, a single core (without using the DMA engine) can not utilize the full external memory bandwidth. The bottleneck is now the core itself and the speed is only 8.9 MB/s. As the cores spent most of their time in blocking state, reading with all cores simultaneously will increase the total transfer speed by a factor of almost 16, resulting in 8.3 MB/s per core. The DMA engine does not have this limitation, so indeed the nonbusy speed for external reads is much higher than the nonbusy read speed without DMA.

The slow host reading speeds can again be attributed to the non-cached memory access.

The speed versus size graph for external memory reads in the busy situation is shown below.

External memory reading speeds in MB/s in the busy state.

Note that in the busy state every core can measure a different speed during the same measurement, for the same chunk size. These different values are shown above in a single graph. The different read speeds can be attributed to the geometry of the chip in combination with the round-robin routing scheme used at each node. The routing scheme tells us how read packets from different directions are handled, and the round-robin scheme ensures that all directions are handled equally, one by one. This does however cause some cores to have slower read speeds because they located further away from the elink connection. Lets say for simplicity that the elink is located at the bottom-right corner of the 4x4 core grid. The round robin arbiter at that core would then ensure it alternates between packets coming from the north and west and from the bottom-right core itself. This would mean that in this case the bottom-right core gets a third of the available bandwidth. Different distances to the elink will then correspond to different speeds which is exactly what the above graph suggets.

## E-core memory analysis

Now we will comment on the second table, which is about the transfer speeds within the Epiphany chip. First of all note how all these speeds are higher than the external memory transfer speeds, since the data does not ever have to leave the Epiphany chip.

The table shows that the difference between busy and nonbusy data transfers is relatively small. This implies that the mesh network is efficient and does not get ‘clogged’ when handling many data transfers simultaneously. For data transfers without the DMA engine, we expect higher writing than reading speeds, just as in the external memory case. This is because the core can quickly send out many write packets without waiting for them to arrive at their destination. As soon as a packet leaves the core and is ‘in the network’, the next packet is handled. For read requests, the core has to wait for the resulting data to come back. The numbers in the table show that this is indeed the case: writes are a lot faster than reads. For DMA transfers however, we expect this effect to be less pronounced since the DMA engine does not have to wait for reading data to return (except for the last packet). Still, the table shows that DMA writes are many times faster than reads. This might indicate that we are now pushing the limits of the mesh network. Note that read request packets travel through a different network than the write packets and these networks have different speeds. See the Epiphany Architecture Reference for details on the difference between the different mesh networks (called xMesh, rMesh, cMesh).

The two graphs below show the writing speeds as a function of data size. Both graphs represent the same data, similar to the first two graphs shown in this blog post. The first graph shows that the average writing speed for the DMA engine greatly depends on the amount of data you send. The second graph shows that the actual speed (the slope of the lines in the second graph) is fairly constant, but including the 500 cycle startup cost, the average speed is much lower.

E-core memory write speeds in MB/s.

Clockcycle cost of writing to e-core memory.

The graphs also suggest that to win back the startup cost of the DMA, one needs to send more than roughly 300 bytes. Like noted before, this startup cost is a little lower if one reuses the DMA info struct, so that the time filling this struct is not included in the startup time.

# Conclusion

This blog post has covered some of the data transfer speeds on the Parallella board and should give developers an indication of whether or not their (future) programs are bottlenecked by one or more of the data channels of the Parallella. For a more detailed report on the Epiphany platform, we refer the reader to the following master thesis by Tom Vocke (not affiliated with Coduin), containing more elaborate benchmarks.