*Johannes
Lampel - Projects/bifurk*

A simple SIMD
application example using the nVec class for calculating*
f(n+1) = f(n) * (1-f(n) ) * u* ( where *u* is the x axis on the following
picture and f(n) the y axis ), the so called logistic growth function.

Since all iterations are the same for each *u*, a vector is intialized
with all those different *u*'s and then a couple of adjacent pixels of
the following pixels are calculated, according to the dimensionality of the
vector. Since a lot of those operations can be done inside the caches, therefore
the memory transfer rate to the RAM isn't that important, the calculation can
be speed up by more than 3 times. Don't use too small vector sizes, since this
would increase the overhead and lower the speedup which could have been done
by using SSE. But keep the data inside your cache. The SOM Simulator had e.g.
such big data sets that the SSE speedup wasn't noticeable, because the bottleneck
was the memory transfer from the CPU to the RAM and vice versa.

The following
picture shows the performance with different vector sizes, for the usual FLPT
calculations and those using SSE. For the calculations I used 32bit floating
point numbers, the system was a Pentium4 2.6Ghz.

With one dimensional vectors, the performance of the FLPT and the SSE version
is the same, as expected. At the graph of the FLPT calculation we can see a
significant drop from 12 to 14, i.e. at a vector size right above 2^12 * 4 byte
= 16kByte, which is the size of the P4's L1 data cache. The maximum of the SSE
graph might have it's reason in the page size of 4kB. At a vector size of 2^9
* 4 byte we have a speedup of more than the theoretical possible 4 times ( SSE
does 4 FLPT operations in parallel ) The reason could be that I used *_aligned_malloc*
for the SSE memory allocations, while I used the unaligned *new *for
the standard FLPT routines. Aligned allocation is better regarding cacheline
usage.

Source : bifurk.cpp