[Beowulf] Multicore Is Bad News For Supercomputers
Michael Brown
spambox at emboss.co.nz
Fri Dec 5 12:36:44 PST 2008
Mark Hahn wrote:
>> (Well, duh).
>
> yeah - the point seems to be that we (still) need to scale memory
> along with core count. not just memory bandwidth but also concurrency
> (number of banks), though "ieee spectrum online for tech insiders"
> doesn't get into that kind of depth :(
I think this needs to be elaborated a little for those who don't know the
layout of SDRAM ...
A typical chip that may be used in a 4 GB DIMM would be a 2 Gbit SDRAM chip,
of which there would be 16 (total 32 Gbits = 4 Gbytes). Each chip
contributes 8 bits towards the 64-bit DIMM interface, so there's two
"ranks", each comprised of 8 chips. Each rank operates independently from
the other, but share (and are limited by) the bandwidth of the memory
channel. From here I'm going to be using the Micron MT47H128M16 as the SDRAM
chip, because I have the datasheet, though other chips are probably very
similar.
Each SDRAM chip internally is make up of 8 banks of 32 K * 8 Kbit memory
arrays. Each bank can be controlled seperately but shares the DIMM
bandwidth, much like each rank does. Before accessing a particular memory
cell, the whole 8 Kbit "row" needs to be activated. Only one row can be
active per bank at any point in time. Once the memory controller is done
with a particular row, it needs to be "precharged", which basically equates
to writing it back into the main array. Activating and precharging are
relatively expensive operations - precharging one row and activating another
takes at least 11 cycles (tRTP + tRP) and 7 cycles (tRCD) respectively at
top speed (DDR2-1066) for the Micron chips mentioned, during which no data
can be read from or written to the bank. Precharging takes another 4 cycles
if you've just written to the bank.
The second thing to know is that processors operate in cacheline sized
blocks. Current x86 cache lines are 64 bytes, IIRC. In a dual-channel system
with channel interleaving, odd-numbered cachelines come from one channel,
and even numbered cachelines from the other. So each cacheline fill requires
8 bytes read per chip (which fits in nicely with the standard burst length
of 8, since each read is 8 bits), coming out to 128 cachelines per row. Like
channel interleaving, bank interleaving is also used. So:
[] Cacheline 0 comes from channel 0, bank 0
[] Cacheline 1 comes from channel 1, bank 0
[] Cacheline 2 comes from channel 0, bank 1
[] Cacheline 3 comes from channel 1, bank 1
:
:
[] Cacheline 14 comes from channel 0, bank 7
[] Cacheline 15 comes from channel 1, bank 7
So this pattern repeats every 1 KB, and every 128 KB a new row needs to be
opened on each bank. IIRC, rank interleaving is done on AMD quad-core
processors, but not the older dual-core processors nor Intel's discrete
northbridges. I'm not sure about Nehalem.
This is all fine and dandy on a single-core system. The bank interleaving
allows the channel to be active by using another bank when one bank is being
activated or precharged. With a good prefetcher, you can hit close to 100%
utilization of the channel. However, it can cause problems on a multi-core
system. Say if you have two cores, each scanning through separate 1 MB
blocks of memory. Each core is demanding a different row from the same bank,
so the memory controller has to keep on changing rows. This may not appear
to be an issue at first glance - after all, we have 128 cycles between each
CPU hitting a particular bank (8 bursts * 8 cycles per burst * 2 processors
sharing bandwidth), so we've got 64 cycles between row changes. That's over
twice what we need (unless we're using 1 GB or smaller DIMMS, which only
have 4 pages so things become tight).
The killer though is latency - instead of 4-ish cycles CAS delay per read,
we're now looking at 22 for a precharge + activate + CAS. In a streaming
situation, this doesn't hurt too much as a good prefetcher would already be
indicating it needs the next cacheline. But if you've got access patterns
that aren't extremely prefetcher-friendly, you're going to suffer.
Simply cranking up the number of banks doesn't help this. You've still got
thrashing, you're just thrashing more banks. Turning up the cacheline size
can help, as you transfer more data per stall. The extreme solution is to
turn off bank interleaving. Our memory layout now looks like:
[] Cacheline 0 comes from channel 0, bank 0, row 0, offset 0 bits
[] Cacheline 1 comes from channel 1, bank 0, row 0, offset 0 bits
[] Cacheline 2 comes from channel 0, bank 0, row 0, offset 64 bits
[] Cacheline 3 comes from channel 1, bank 0, row 0, offset 64 bits
:
:
[] Cacheline 254 comes from channel 0, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 255 comes from channel 1, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 256 comes from channel 0, bank 0, row 1, offset 0 bits
[] Cacheline 257 comes from channel 1, bank 0, row 1, offset 0 bits
So a new row every 16 KB, and a new bank every 512 MB (and a new rank every
4 GB).
For a single core, this generally doesn't have a big effect, since the 18
cycle precharge+activate delay can often be hidden by a good prefetcher, and
in any case only comes around every 16 KB (as opposed to every 128 KB for
bank interleaving, so it's a bit more frequent, though for large memory
blocks it's a wash). However, this is a big killer for multicore - if you
have two cores walking through the same 512 MB area, they'll be thrashing
the same bank. Not only does latency suffer, but bandwidth as well since the
other 7 banks can't be used to cover up the wasted time. Every 8 cycles of
reading will require 18 cycles of sitting around waiting for the bank,
dropping bandwidth by about 70%.
However, with proper OS support this can be a bit of a win. By associating
banks (512 MB memory blocks) to cores in the standard NUMA way, each core
can be operating out of its own bank. There's no bank thrashing at all,
which allows much looser requirements on activation and precharge, which in
turn can allow higher speeds. With channel interleaving, we can have up to 8
cores/threads operating in this way. With independent channels (ala
Barcelona) we can do 16. Of course, this isn't ideal either. A row change
will stall the associated CPU and can't be hidden, so ideally we want at
least 2 banks per CPU, interleaved. Also, shared memory will be hurt under
this scheme (bandwidth and latency) since it will experience bank thrashing
and will only have 2 banks. To cover the activate and precharge times, we
need at least 4 banks, so for a quad core CPU we need a total of 16 memory
banks in the system, partly interleaved. 8 banks per core can improve
performance further with certain access patterns. Also, to keep good
single-core performance, we'll need to use both channels. In this case,
4-way bank interleaving per channel (so two sets of 4-way interleaves), with
channel interleaving and no rank interleaving would work, though again 8-way
bank interleaving would be better if there's enough to go around.
This setup is electronically obtainable in current systems, if you use two
dual-rank DIMMS per channel and no rank interleaving. In this case, you have
8-way bank interleaving, with channel interleaving and with the 4 ranks in
contiguous memory blocks. With AMD's Barcelona, you can get away with a
single dual-rank DIMM per channel if you run the two channels independently
(though in this case single-threaded performance is compromised, because
each core will tend to only access memory on a single controller). An
8-thread system like Nehalam + hyperthreading would ideally like 64 banks.
Because of Nehalem's wonky memory controller (seriously, who was the guy in
charge who settled on three channels? I can imagine the joy of the memory
controller engineers when they found out they'd have to implement a
divide-by-three in a critical path) it'd be a little more difficult to get
working there, though there's still enough banks to go around (12 banks per
thread).
However, I'm not sure of any OSes that support this quasi-NUMA. I'm guessing
it could be hacked into Linux without too much trouble, given that real NUMA
support is already there. It's something I've been meaning to look into for
a while, but I've never had the time to really get my hands dirty trying to
figure out Linux's NUMA architecture.
Cheers,
Michael
More information about the Beowulf
mailing list