Scyld: bad scaling
Donald Becker
becker at scyld.com
Thu Sep 27 13:49:46 PDT 2001
On Wed, 26 Sep 2001, Ivan Rossi wrote:
> recently i rebuilt our tiny 10 CPUs cluster using Scyld. Before i have been
> using RedHat 6.2 + LAM MPI. And i like it, it is easier to mantain.
> Unfortunately, after the rebuild, I found a marked performance degradation
> with respect to the former installation. In particular i found a
> disappointingly bad scaling for the application we use most, the MD program
> Gromacs 2.0.
>
> Now scaling goes almost exactly as the square root of the number of nodes,
> that is it takes 4 CPUs to double performance and nine CPUs to triple them.
>
> Since no hardware has been changed, in my opinion it must be either the
> pre-compiled Scyld kernel, bpsh or Scyld MPICH. So i hope that some fine
> tuning of them should solve the problem.
There isn't an inherent problem with Scyld and scaling.
(Obviously we wouldn't have released a product with a specific problem.)
Some things you should initially check
Verify that you are not seeing network errors
check /proc/net/dev for non-zero error counts
Verify that you are using the SMP kernel
CPU1 should show some activity with beostat.
Verify that jobs are being places on all nodes
beostat again.
For reference, the Scyld releases up through "-8" use MPICH as the
base. We modified the process initiation code to work with the Scyld system
(it's now much faster to start jobs), but not the code of the run-time
e.g. send/receive calls.
It's very easy to use LAM on Scyld, however that's beyond the limit of
our commercial support.
Donald Becker becker at scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993
More information about the Beowulf
mailing list