[Beowulf] MPI2007 out - strange pop2 results?

Gilad Shainer Shainer at mellanox.com
Fri Jul 20 18:31:14 PDT 2007


Hi Scot,

I always try to mention exactly what I am comparing to, and not making
it what it is not. And in most cases, I use the exact same platform and
mention the details. This makes the information much more credible,
don't you agree?

By the way, in the presentation you had at ISC, you did exactly the same
as what my dear friends from Qlogic did... sorry, I could not resist...

G  

-----Original Message-----
From: Scott Atchley [mailto:atchley at myri.com] 
Sent: Friday, July 20, 2007 6:21 PM
To: Gilad Shainer
Cc: Kevin Ball; beowulf at beowulf.org
Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?

Gilad,

And you would never compare your products against our deprecated drivers
and five year old hardware. ;-)

Sorry, couldn't resist. My colleagues are rolling their eyes...

Scot

On Jul 20, 2007, at 2:55 PM, Gilad Shainer wrote:

> Hi Kevin,
>
> I believe that your company is using this list for pure marketing wars

> for a long time, so don't be surprise when someone responds back.
>
> If you want to put technical or performance data, and than to make 
> conclusions out of it, be sure to compare apples to apples. It is easy

> use the lower performance device results of your competitor and than 
> to attack his "architecture" or his entire product line. If this is 
> not a marketing war, than I would be interesting to know what you call

> a marketing war....
>
> G
>
>
> -----Original Message-----
> From: Kevin Ball [mailto:kevin.ball at qlogic.com]
> Sent: Friday, July 20, 2007 11:27 AM
> To: Gilad Shainer
> Cc: Brian Dobbins; beowulf at beowulf.org
> Subject: RE: [Beowulf] MPI2007 out - strange pop2 results?
>
> Hi Gilad,
>
>   Thank you for the personal attack that came, apparently without even

> reading the email I sent.  Brian asked about why the publicly 
> available, independently run MPI2007 results from HP were worse on a 
> particular than the Cambridge cluster MPI2007 results.  I talked about

> three contributing factors to that.  If you have other reasons you 
> want to put forward, please do so based on data, rather than engaging 
> in a blatant ad hominem attack.
>
>   If you want to engage in a marketing war, there are venues with 
> which to do it, but I think on the Beowulf mailing list data and 
> coherent thought are probably more appropriate.
>
> -Kevin
>
> On Fri, 2007-07-20 at 10:43, Gilad Shainer wrote:
>> Dear Kevin,
>>
>> You continue to set world records in providing misleading 
>> information.
>> You had previously compared Mellanox based products on dual 
>> single-core machines to the "InfiniPath" adapter on dual dual-core 
>> machines and claim that with InfiniPath there are more Gflops....
>> This
>
>> latest release follow the same lines...
>>
>> Unlike QLogic InfiniPath adapters, Mellanox provide different 
>> InfiniBand HCA silicon and adapters. There are 4 different silicon 
>> chips, each with different size, different power, different price and

>> different performance. There is the PCI-X device (InfiniHost), the 
>> single-port device that was deigned for best price/performance 
>> (InfiniHost III Lx), the dual-port device that was designed for best 
>> performance (InfiniHost III Ex) and the new ConnectX device that was 
>> designed to extend the performance capabilities of the dual port 
>> device. Each device provide different price and performance points
> (did I said different?).
>>
>> The SPEC results that you are using for Mellanox, are of the single 
>> port device. And even that device (that its list price is probably 
>> half of your InfiniPath) had better results with  8 server nodes than
> yours....
>> Your comparison of InfiniPath to the Mellanox single-port device 
>> should have been on price/performance and not on performance. Now, if

>> you want to really compare performance to performance, why don't you 
>> use the dual port device, or even better, ConnectX? Well... I will do
> it for you.
>> Every time I had compared my performance adapters to yours, your 
>> adapters did not even come close...
>>
>>
>> Gilad.
>>
>> -----Original Message-----
>> From: beowulf-bounces at beowulf.org [mailto:beowulf- 
>> bounces at beowulf.org] On Behalf Of Kevin Ball
>> Sent: Thursday, July 19, 2007 11:52 AM
>> To: Brian Dobbins
>> Cc: beowulf at beowulf.org
>> Subject: Re: [Beowulf] MPI2007 out - strange pop2 results?
>>
>> Hi Brian,
>>
>>    The benchmark 121.pop2 is based on a code that was already 
>> important to QLogic customers before the SPEC MPI2007 suite was 
>> released (POP, Parallel Ocean Program), and we have done a fair 
>> amount
>
>> of analysis trying to understand its performance characteristics.
>> There are three things that stand out in performance analysis on 
>> pop2.
>>
>>   The first point is that it is a very demanding code on the 
>> compiler.
>
>> There has been a fair amount of work on pop2 by the PathScale 
>> compiler
>
>> team, and the fact that the Cambridge submission used the PathScale 
>> compiler while the HP submission used the Intel compiler accounts for

>> some (the serial portion) of the advantage at small core counts, 
>> though scalability should not be affected by this.
>>
>>   The second point is that pop2 is fairly demanding of IO.  Another 
>> example to look at for this is in comparing the AMD Emerald Cluster 
>> results to the Cambridge results;  the Emerald cluster is using NFS 
>> over GigE from a single server/disk, while Cambridge has a much more 
>> optimized IO subsystem.  While on some results Emerald scales better,

>> for pop2 it scales only from 3.71 to 15.0 (4.04X) while Cambridge 
>> scales from 4.29 to 21.0 (4.90X).  The HP system appears to be using 
>> NFS over DDR IB from a single server with a RAID;  thus it should 
>> fall
>
>> somewhere between Emerald and Cambridge in this regard.
>>
>>   The first two points account for some of the difference, but by no 
>> means all.  The final one is probably the most crucial.  The code
>> pop2
>
>> uses a communication pattern consisting of many small/medium sized 
>> (between 512 bytes and 4k) point to point messages punctuated by 
>> periodic tiny (8b) allreduces.  The QLogic InfiniPath architecture 
>> performs far better in this regime than the Mellanox InfiniHost 
>> architecture.
>>
>>   This is consistent with what we have seen in other application 
>> benchmarking;  even SDR Infiniband based off of the QLogic InfiniPath

>> architecture performs in general as well as DDR Infiniband based on 
>> the Mellanox InfiniHost architecture, and in some cases better.
>>
>>
>> Full disclosure:  I work for QLogic on the InfiniPath product line.
>>
>> -Kevin
>>
>>
>> On Wed, 2007-07-18 at 18:50, Brian Dobbins wrote:
>>> Hi guys,
>>>
>>>   Greg, thanks for the link!  It will no doubt take me a little 
>>> while to parse all the MPI2007 info (even though there are only a 
>>> few submitted results at the moment!), but one of the first things I
>
>>> noticed was that performance of pop2 on the HP blade system was 
>>> beyond
>>
>>> atrocious... any thoughts on why this is the case?  I can't see any 
>>> logical reason for the scaling they have, which (being the first 
>>> thing
>>
>>> I noticed) makes me somewhat hesitant to put much stock into the 
>>> results at the moment.  Perhaps this system is just a statistical 
>>> blip
>>
>>> on the radar which will fade into noise when additional results are 
>>> posted, but until that time, it'd be nice to know why the results 
>>> are the way they are.
>>>
>>>   To spell it out a bit, the reference platform is at 1 (ok, 0.994) 
>>> on
>>> 16 cores, but then the HP blade system at 16 cores is at 1.94.  Not 
>>> bad there.  However, moving up we have:
>>>   32 cores   - 2.36
>>>   64 cores  -  2.02
>>>  128 cores -  2.14
>>>  256 cores -  3.62
>>>
>>>   So not only does it hover at 2.x for a while, but then going from
>>> 128 -> 256 it gets a decent relative improvement.  Weird.
>>>   On the other hand, the Cambridge system (with the same processors 
>>> and a roughly similar interconnect, it seems) has the follow scaling
>
>>> from 32->256 cores:
>>>
>>>    32 cores - 4.29
>>>    64 cores - 7.37
>>>   128 cores - 11.5
>>>   256 cores - 15.4
>>>
>>>   ... So, I'm mildly confused as to the first results.  Granted, 
>>> different compilers are being used, and presumably there are other 
>>> differences, too, but I can't see how -any- of them could result in 
>>> the scores the HP system got.  Any thoughts?  Anyone from HP (or
>>> QLogic) care to comment?  I'm not terribly knowledgeable about the 
>>> MPI
>>> 2007 suite yet, unfortunately, so maybe I'm just overlooking 
>>> something.
>>>
>>>   Cheers,
>>>   - Brian
>>>
>>>
>>> ____________________________________________________________________
>>> __ _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org To change your 
>>> subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org To change your subscription

>> (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org To change your subscription 
> (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf





More information about the Beowulf mailing list