[Beowulf] What class of PDEs/numerical schemes suitable for GPUclusters
Michael Brown
spambox at emboss.co.nz
Thu Nov 20 11:01:31 PST 2008
Jeff Layton wrote:
>> offhand, I'd guess that adaptive grids will be substantially harder
>> to run efficiently on a GPU than a uniform grid.
>
> One key thing is that unstructured grid codes don't work as well.
> The problem is the indirect addressing.
Bingo. GPUs are still GPUs, and are still heavily optimized for coherent
data access patterns. If cell (x, y) depends on data at (x, y), then cell (x
+ 1, y) better depend on data at cell (x + 1, y) or performance will suffer
terribly. In C-speak:
x += C[i][j];
is good, and
x += C[Idx[i][j]];
is bad. Similarly bad is non-coherent branching, due to the thread grouping.
The ideal workload is one that had minimal or no branching, and can be
mapped into a computational model where you have a 1-, 2-, or 3-dimensional
arrangement of cells, where the computation (including the relative position
for any data lookups) for each cell does not change. IME, as soon as you
depart significantly from this workload, you often start to see order of
magnitude drops in performance. Additionally, the round-trip CPU->GPU->CPU
latency is horrific (in the order of 1 ms on my 8800GTX on Vista, though I'm
not sure about the newer cards or other OSes) so unless you can get a good
pipeline going, bouncing computation between the CPU and GPU can wreck the
overall performance. This also makes it very hard to scale out to more than
one card.
I've spent a fair amount of time tweaking a bit of software that at its core
is a RKF45 adaptive integrator on a number of independent entities, with
some other GPU-unfriendly code (very branchy and with PRNGs). The optimal
method that I've found for this code is to do the integration substeps on
the GPU, but all other processing on the CPU. The GPU doesn't worry if the
requested substep has excessive error, it just passes back the better
step-size to the CPU and doesn't update the data. The CPU then notices that
the returned "next" stepsize is smaller than the stepsize it sent, and
handles the situation correctly. Subdividing steps on the GPU (or simply
looping around with the smaller step sizes until the error is sufficiently
small) is a performance loss. Additionally, since the entities are
essentially independent, I can have multiple sets in progress at once. The
peak seems to be to break it into 4 sets, presumably corresponding to one
being sent to the GPU, one being processed on the GPU, one coming back from
the GPU, and one being processed on the CPU. The performance gain going from
1 set to 4 is about a factor of 2.5.
Cheers,
Michael
More information about the Beowulf
mailing list