solipsistnation

You're viewing

solipsistnation's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

You'd think that if you were to spend $150,000 on a giant multi-CPU system, it would be able to go for longer than 2 or 3 months without blowing out its own RAM.

That's _another_ 2 gig of RAM down the tubes.

So, at this point I hesitate to recommend SGI hardware, considering that in 2 or 3 months we've had multiple hardware problems resulting in multiple days of downtime.

Anyone had any experience with Sun VZ20m systems? I've heard good things, I have a couple coming in for evaluation purposes, and I plan to check them out pretty thoroughly. The idea of having a system with REAL remote management capabilities is very appealing.

But SGI? Sheeeit. Would have considered them earlier, but no longer.

Current Mood: grumpy

Flat | Top-Level Comments Only

From:

amymarr.livejournal.com

Which machine is this??

From:

solipsistnation.livejournal.com

toth.

From:

yehoshua.livejournal.com

I've been increasingly unimpressed with SGI the last few years. They used to at least have some good product, but now they seem to be trading almost exclusively on the fact that they were TEH SEXAEY about a decade ago. When my current crop of SGIs get end-of-lifed, they're being replaced with almost anything else.

From:

friode.livejournal.com

Isn't the cost of 2GB of RAM only a very small part of a $150k system?

Is what you're doing somehow not something that can be distributed across multiple machines? As far as I know, just about everything that MIT's Project Athena runs, and just about everything the MIT network group runs, that doesn't reasonably fit on a single computer with a four digit price tag, tends to be distributed across multiple servers (different users on different IMAP servers, multiple AFS servers each with a large RAID array, multiple machines for remote shell access, etc), and I think they tend to keep a spare copy of each kind of hardware running in case something breaks.

(This didn't prevent a longer than 24 hour outage once when they had almost filesystem corruption on one of the imap servers one summer, of course, but that's the only case I remember hearing about of an outage that lasted more than like half a day.)

From:

solipsistnation.livejournal.com

Yeah, it is, but the cost of replacing one of the 8 modules that make up the $150k system sure wasn't. This is far from the first random hardware failure...

We have a smallish cluster for distributable compute jobs (10xdual 2.4GHz Xeons, 2GB RAM each, private gigE network), but this system is for jobs that require large chunks of shared memory-- it's got 32GB total, and we certainly have people who occasionally use a good 24-30GB of that.

This system is solely for compute, though-- everything else is distributed across various smallish systems, although we do have a single system for IMAP service. We don't have enough users that multiple IMAP systems would be really worth the bother. (We've got about 5000 undergrads and maybe 1500 assorted other users.) I believe that's significantly smaller than the number of users athena supports.

We've got 10 shell/X systems and a dozen or so utility systems, plus dedicated kerberos and DNS servers.

In the future, we'll be using more smallish dual Opteron servers for all this stuff, with perhaps a quad Opteron with nice fast disk for a future mail server.

From:

friode.livejournal.com

Oh, OK. MIT doesn't have any sort of centralized infrastructure for doing random user supplied massively computationally intensive tasks. I think they've had a discovery process about such things once or twice and decided that for whatever reason it wasn't worth doing.

They don't throw in much warranty for your $150k?

From:

solipsistnation.livejournal.com

They do have good support, and the parts and the repair guy will be here tomorrow around 11, but still, 3 times in 2 months? That's not so good a track record...

We have had a few smallish compute clusters. We had an IBM SP system, 32 nodes of, I think, 400MHz Power CPUs, and we've got two stacks of dual Xeons for cluster stuff. WPI's relationship with high-performance computing is kind of schizophrenic, though-- on the one hand, WPI wants to be known as a big research institution, and we certainly have some decent research going on. On the other hand, we haven't thrown resources (space being the most important and the most lacking) at any kind of HPC effort. The 10-node cluster exists because we were replacing our old compute system, a 4-CPU Alpha, and we had a budget for a new 4-way Alpha but decided to buy Intel boxes instead. The SP was replaced by the SGI. The other cluster is politically separate but physically local (which is a long story) and I don't have anything to do with it.

The bar for entering into the HPC Top500 list is pretty low these days-- 256 3.0GHzish Xeons will get you there, or 400 2.0GHz older Opterons. That's not a lot of cash these days-- figure that's 200 dual 1RU Sun V20z systems, meaning 5 racks of 40. That's also something like $4500/system, but I'd guess that Sun would cut you a deal if they got to brag about your cluster later. So call it $4000 each. Now we're talking $200x4000, or only(!) $800,000. Round it off to a cool million bucks for site prep, power, cabling, gigE switches, and extra disk for a file server, and you've got almost a turnkey entry into the top end of HPC.

For less than half that, you could do something pretty impressive on its own. For even less, you could stack up 3GHz P4's and still get respectable performance.

I dunno. It's a neat thing, but our compute resources here are still just toys, I'm afraid, much as we like to brag about them. I'd like to build something real sometime.

From:

donnerjack.livejournal.com

I dunno dude, if you're just looking at whacking huge hunks of memory usage, I know you've had issues with Dells in the past, but the 6650s will support that much RAM I think. And for a significantly smaller price tag. And with, in my experience at least, a much lower incidence of hardware failure. Something to consider. Or some of the dual Opteron boards support 24G of RAM, there may be newer ones coming out with higher limits, it might be worth looking into for a ghetto-tech solution.

From:

solipsistnation.livejournal.com

I am SO sick of Dell at this point that I am basically done giving them money for as long as I have the option not to.

This is for a high-performance computing rig, too, so it's not just RAM, it's CPUs as well.

I think we're going Opteron from now on, though, and most likely Sun or somebody who knows how to make server-level hardware actually act like it's server-level hardware and not like a half-assed PC somebody threw together in their basement for kicks.