[Beowulf] Teraflop chip hints at the future
Richard Walsh
rbw at ahpcrc.org
Fri Feb 16 14:17:20 PST 2007
Jim Lux wrote:
> At 07:03 AM 2/13/2007, Richard Walsh wrote:
>> Yes, but how much does it really abandon von Neumann. It is just a lot
>> of little von Neumann machines unless the mesh is fully programmable
>> and the DRAM stacks can source data for any operation on any cpu as
>> the application's data flows through the application kernel(s)
>> however it
>> is laid out across the chip. And in that case it is a multi-core
>> ASIC emulating
>> an FPGA ... why not just use an FPGA ... ;-) ... and avoid wasting
>> all those
>> hard-wired functional units that won't be needed for this or that
>> particular
>> kernel.
> In fact, modern high density FPGAs (viz Xilinx Virtex II 6000 series)
> have partitioned their innards into little cells, some with ALU and
> combinatorial logic and a little memory, some with lots of memory and
> not so much logic.
Hey Jim,
Yes, I do understand this although attention for double precision
ops on FPGAs is focused on the
Xilinx Virtex-5 at 65 nm. You can already get a PCIe card version I
think. My comments about
new 80-core/ASIC Intel chip were to suggest two things ... first was
that having the ability to
program your own (ala VHDL, Verilog, Mitrion-C, Handel-C, etc. )
core that is specific to your
kernel is more circuit-efficient in theory, so if you are going to
have multiple cores consider having them
be programmable. Its like the plumber that brings only and all the
tools he needs into to house to
do the job at hand.
The second point I was trying to make was that all cyclic
re-referencing of the same store (local or
remote) is a reflection of the von Neuman model (even to the stacked
DRAM in the new Intel chip).
When the processor cannot "swallow the kernel whole" it has to
consume it in von Neuman-like
bites which imply register, cache, and memory writes. Part of the
programmable core process is in
making the connections between upstream and downstream hardware in a
data-flow fashion that
replace some number of cyclic stores with in-line passes to the next
collection of functional units
required by the applications specific kernel.
In this way, the "diameter" of the re-reference cycle is enlarged
and the latency penalty is therefore reduced.
So while the ASIC-cores in the new Intel chip are not programmable
in the FPGA sense there is the
hope/expectation that the interconnect on the chip will give the
data flow benefits described. These are
the features of the multi-core TRIPS and Raw processors that allow
them to emulate ILP, TLP, and DLP oriented
architectures and applications. The extent to which FPGAs are more
flexible in this regard give
them an advantage over less "wire-exposed" multi-core ASIC
architectures.
There are obvious draw backs to FPGAs ... they are not commodity
enough, programmability is
poor, foriegn, and the improvements (Mitrion-C) generally consume 2x
the circuits and run at 1/2
the clock that the FPGA in use is capable of. Joe Landman pointed
out the large chunk of the device
that the interface architecture can consume, and for HPC size data
sets you still need to stream data
in and out to external memory (algorithms must be pipelined). Still
it seems like over the long
haul some of the FPGA advantages mentioned will creep into the HPC
space -- either on the chip
or via accelerators. Underwood at Sandia has nice a paper showing
that peak flop performance
on FPGAs exceed commodity CPUs in summer of 2004 (same time Intel
dropped the race
to the 4.0 GHz clock) ... although the data needs to be updated with
the Virtex-5 and the new
multi-core processors.
Here are some papers that I think you can Google that I have found
useful/interesting.
1. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay
Architecture for ILP and
Streams. Taylor, et al.
2. Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS
Architecture.
3. FPGAs vs CPUs: Trends in Peak Floating-Point Performance.
Keith Underwood.
4. Architectures and APIs" Assessing Requirements for
Delivering FPGA Performance to
Applications. Underwood and Hemmert
5. A 64-bit Floating-point FPGA Matrix Multiplications. Yong
Dou et al.
6. Scalable and Modular Algorithms for Floating-Point Matrix
Multiplication on FPGAs
Ling Zhuo and Viktor Prasanna
7. Computing Lennard-Jones Potentials and Forces wth
Reconfigurable Hardware
> I think that as a general rule, the special purpose cores (ASICs) are
> going to be smaller, lower power, and faster (for a given technology)
> than the programmable cores (FPGAs). Back in the late 90s, I was
> doing tradeoffs between general
Here you are arguing for an ASIC for each typical HPC kernel ... ala
the GRAPE processor. I will buy that ... but
a commodity multi-core, CPU is not HPC-special-purpose or low power
compared to an FPGA.
> purpose CPUs (PowerPCs), DSPs (ADSP21020), and FPGAs for some signal
> processing applications. At that time, the DSP could do the FFTs,
> etc, for the least joules and least time. Since then, however, the
> FPGAs have pulled ahead, at least for spaceflight applications. But
> that's not because of architectural superiority in a given process..
> it's that the FPGAs are benefiting from improvements in process
> (higher density) and nobody is designing space qualified DSPs using
> those processes (so they are stuck with the old processes).
Better process is good, but I think I hear you arguing for
HPC-specific ASICs again like the GRAPE ... if they
can be made cheaply, then you are right ... take the bit stream from
the FPGA CFD code I have written and tuned, and
produce 1000 ASICs for my special purpose CFD-only cluster. I can
run it at higher clock rates, but I may need a
new chip every time I change my code.
> Heck, the latest SPARC V8 core from ESA (LEON 3) is often implemented
> in an FPGA, although there are a couple of space qualified ASIC
> implementations (from Atmel and Aeroflex).
>
> In a high volume consumer application, where cost is everything, the
> ASIC is always going to win over the FPGA. For more specialized
> scientific computing, the trade is a bit more even ... But even so,
> the beowulf concept of combining large numbers of commodity computers
> leverages the consumer volume for the specialized application, giving
> up some theoretical performance in exchange for dollars.
Right, otherwise we would all be using our own version of GRAPE,
but we are all looking for "New, New Thing"
... a new price-performance regime to take us up to the next
level. Is it going to be FPGAs, GPGPUs, commodity
multi-core, PIM, or novel 80-processor Intel chips. I think we are
in for a period of extend HPC market
fragmentation, but in any case I think two features of FPGA
processing, the programmable core and data flow
programming model have intrinsic/theoretical appeal. These forces
may be completely overwhelmed by other
forces in the market place of course ...
Regards,
rbw
--
Richard B. Walsh
"The world is given to me only once, not one existing and one
perceived. The subject and object are but one."
Erwin Schroedinger
Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
rbw at ahpcrc.org | 612.337.3467
-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted. If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------
More information about the Beowulf
mailing list