[Beowulf] Purdue Supercomputer

Tue May 6 17:28:18 PDT 2008

So, more or less, the install day was a major success for us here at 
Purdue. The party got started at 8:00AM local time..

We had around 40 peoples unboxing machines in the loading dock. After 
an hour, they had gone through nearly 20 pallets of boxes. (We asked 
them to take a break for some of the free breakfast we had..) The 
bottleneck in getting machines racked was the limited isle space 
between the two rack rows and needing to get the rails ahead of the 
actual machines.

Around 12:00PM enough was unpacked, racked, and cabled to begin 
software installation. By 1:00PM, 500 nodes were up and jobs were 
running. By 4:00PM, everything was going. This morning, we hit the 
last few mis-installs. Our DOA nodes were around 1% of the total order..

One of our nanotech researchers here got in a hero run of his code, 
and pronounced the cluster perfect early this morning. Not a bad turn 
around and a very happy costumer.

We were blown away by how quickly the teams moved through their jobs. 
Of course, it wasn't surprising because we pulled a lot of the 
technical talent from IT shops all around the University to work in 
two hours shifts. It was a great time to socialize and get to know the 
faces behind the emails. The massive preparation effort that took 
place before hand brought the research computing group, the central 
networking group and the data center folks together in ways that 
hadn't happened before.

The physical networking was done in a new way for us.. We used a large 
Foundry switch and the MRJ21 cabling system for it. Each racks gets 24 
nodes, a 24 port passive patch panel, and 4 MRJ21 cables that run back 
to the network switch. Then, there are just short patch cables between 
the panel and each node in the rack (running through a side mounted 
cable manager). Eventually, there'll be a cheap 24port 100mbps switch 
in each rack to provide dedicated out of band management to each node.

Most of the cabling was done by two person teams. One person 
unwrapping cables and the other running the cables in the rack. This 
process wasn't the speediest, but things certainly look nice on the 
backside..

The installation infrastructure was revitalized for this install. We 
normally kickstart each node and then set up cfengine to run on the 
first boot. Cfengine will go ahead and bring the node into the 
cluster. To support this new cluster, we took five Dell 1850's and 
turned them into an IPVS cluster. One was the manager, the others 
serving bots. They ran cfengine and apache (providing both cfengine 
and the kickstart packages).

Since we use RedHat Enterprise for the OS on each node, we upgraded 
the campus proxy server from a Dell 2650 to a beefy Sun x4200. To keep 
a lot of load off the proxy, we kickstarted using the latest release 
of Rhel4.

So, there are some of the nitty details of what it took to get this 
thing off the group in just a few hours.

--
Alex Younts

Jim Lux wrote:
> At 03:20 PM 5/6/2008, Mark Hahn wrote:
>>> We have built out a beefy install infrastructure to support a lot of 
>>> simultaneous installs...
>>
>> I'm curious to hear about the infrastructure.
>>
>> btw:
>> http://www.eetimes.com/news/latest/showArticle.jhtml?articleID=207501882
> 
> Interesting...
> 
> 1000 computers, assume it takes 30 seconds to remove from the box and 
> walk to the rack. that's 30,000 seconds, or about 500 minutes.. call it 
> 8 hours.  Assume you've got 10 racks and 10 people, so you get some 
> parallelism... an hour to unpack and rack one pile.
> 
> 
> What wasn't shown in the video.. all the plugging and routing of network 
> cables?
> 
> Jim