[Beowulf] Purdue Supercomputer
Alex Younts
alex at younts.org
Tue May 6 17:28:18 PDT 2008
So, more or less, the install day was a major success for us here at
Purdue. The party got started at 8:00AM local time..
We had around 40 peoples unboxing machines in the loading dock. After
an hour, they had gone through nearly 20 pallets of boxes. (We asked
them to take a break for some of the free breakfast we had..) The
bottleneck in getting machines racked was the limited isle space
between the two rack rows and needing to get the rails ahead of the
actual machines.
Around 12:00PM enough was unpacked, racked, and cabled to begin
software installation. By 1:00PM, 500 nodes were up and jobs were
running. By 4:00PM, everything was going. This morning, we hit the
last few mis-installs. Our DOA nodes were around 1% of the total order..
One of our nanotech researchers here got in a hero run of his code,
and pronounced the cluster perfect early this morning. Not a bad turn
around and a very happy costumer.
We were blown away by how quickly the teams moved through their jobs.
Of course, it wasn't surprising because we pulled a lot of the
technical talent from IT shops all around the University to work in
two hours shifts. It was a great time to socialize and get to know the
faces behind the emails. The massive preparation effort that took
place before hand brought the research computing group, the central
networking group and the data center folks together in ways that
hadn't happened before.
The physical networking was done in a new way for us.. We used a large
Foundry switch and the MRJ21 cabling system for it. Each racks gets 24
nodes, a 24 port passive patch panel, and 4 MRJ21 cables that run back
to the network switch. Then, there are just short patch cables between
the panel and each node in the rack (running through a side mounted
cable manager). Eventually, there'll be a cheap 24port 100mbps switch
in each rack to provide dedicated out of band management to each node.
Most of the cabling was done by two person teams. One person
unwrapping cables and the other running the cables in the rack. This
process wasn't the speediest, but things certainly look nice on the
backside..
The installation infrastructure was revitalized for this install. We
normally kickstart each node and then set up cfengine to run on the
first boot. Cfengine will go ahead and bring the node into the
cluster. To support this new cluster, we took five Dell 1850's and
turned them into an IPVS cluster. One was the manager, the others
serving bots. They ran cfengine and apache (providing both cfengine
and the kickstart packages).
Since we use RedHat Enterprise for the OS on each node, we upgraded
the campus proxy server from a Dell 2650 to a beefy Sun x4200. To keep
a lot of load off the proxy, we kickstarted using the latest release
of Rhel4.
So, there are some of the nitty details of what it took to get this
thing off the group in just a few hours.
--
Alex Younts
Jim Lux wrote:
> At 03:20 PM 5/6/2008, Mark Hahn wrote:
>>> We have built out a beefy install infrastructure to support a lot of
>>> simultaneous installs...
>>
>> I'm curious to hear about the infrastructure.
>>
>> btw:
>> http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=207501882
>
> Interesting...
>
> 1000 computers, assume it takes 30 seconds to remove from the box and
> walk to the rack. that's 30,000 seconds, or about 500 minutes.. call it
> 8 hours. Assume you've got 10 racks and 10 people, so you get some
> parallelism... an hour to unpack and rack one pile.
>
>
> What wasn't shown in the video.. all the plugging and routing of network
> cables?
>
> Jim
More information about the Beowulf
mailing list