You'd think that if you were to spend $150,000 on a giant multi-CPU system, it would be able to go for longer than 2 or 3 months without blowing out its own RAM.
That's _another_ 2 gig of RAM down the tubes.
So, at this point I hesitate to recommend SGI hardware, considering that in 2 or 3 months we've had multiple hardware problems resulting in multiple days of downtime.
Anyone had any experience with Sun VZ20m systems? I've heard good things, I have a couple coming in for evaluation purposes, and I plan to check them out pretty thoroughly. The idea of having a system with REAL remote management capabilities is very appealing.
But SGI? Sheeeit. Would have considered them earlier, but no longer.
That's _another_ 2 gig of RAM down the tubes.
So, at this point I hesitate to recommend SGI hardware, considering that in 2 or 3 months we've had multiple hardware problems resulting in multiple days of downtime.
Anyone had any experience with Sun VZ20m systems? I've heard good things, I have a couple coming in for evaluation purposes, and I plan to check them out pretty thoroughly. The idea of having a system with REAL remote management capabilities is very appealing.
But SGI? Sheeeit. Would have considered them earlier, but no longer.
(no subject)
Date: 2005-02-22 02:37 pm (UTC)(no subject)
Date: 2005-02-22 04:43 pm (UTC)(no subject)
Date: 2005-02-22 02:51 pm (UTC)(no subject)
Date: 2005-02-22 04:39 pm (UTC)Is what you're doing somehow not something that can be distributed across multiple machines? As far as I know, just about everything that MIT's Project Athena runs, and just about everything the MIT network group runs, that doesn't reasonably fit on a single computer with a four digit price tag, tends to be distributed across multiple servers (different users on different IMAP servers, multiple AFS servers each with a large RAID array, multiple machines for remote shell access, etc), and I think they tend to keep a spare copy of each kind of hardware running in case something breaks.
(This didn't prevent a longer than 24 hour outage once when they had almost filesystem corruption on one of the imap servers one summer, of course, but that's the only case I remember hearing about of an outage that lasted more than like half a day.)
(no subject)
Date: 2005-02-22 04:50 pm (UTC)We have a smallish cluster for distributable compute jobs (10xdual 2.4GHz Xeons, 2GB RAM each, private gigE network), but this system is for jobs that require large chunks of shared memory-- it's got 32GB total, and we certainly have people who occasionally use a good 24-30GB of that.
This system is solely for compute, though-- everything else is distributed across various smallish systems, although we do have a single system for IMAP service. We don't have enough users that multiple IMAP systems would be really worth the bother. (We've got about 5000 undergrads and maybe 1500 assorted other users.) I believe that's significantly smaller than the number of users athena supports.
We've got 10 shell/X systems and a dozen or so utility systems, plus dedicated kerberos and DNS servers.
In the future, we'll be using more smallish dual Opteron servers for all this stuff, with perhaps a quad Opteron with nice fast disk for a future mail server.
(no subject)
Date: 2005-02-22 05:03 pm (UTC)They don't throw in much warranty for your $150k?
(no subject)
Date: 2005-02-22 06:43 pm (UTC)We have had a few smallish compute clusters. We had an IBM SP system, 32 nodes of, I think, 400MHz Power CPUs, and we've got two stacks of dual Xeons for cluster stuff. WPI's relationship with high-performance computing is kind of schizophrenic, though-- on the one hand, WPI wants to be known as a big research institution, and we certainly have some decent research going on. On the other hand, we haven't thrown resources (space being the most important and the most lacking) at any kind of HPC effort. The 10-node cluster exists because we were replacing our old compute system, a 4-CPU Alpha, and we had a budget for a new 4-way Alpha but decided to buy Intel boxes instead. The SP was replaced by the SGI. The other cluster is politically separate but physically local (which is a long story) and I don't have anything to do with it.
The bar for entering into the HPC Top500 list is pretty low these days-- 256 3.0GHzish Xeons will get you there, or 400 2.0GHz older Opterons. That's not a lot of cash these days-- figure that's 200 dual 1RU Sun V20z systems, meaning 5 racks of 40. That's also something like $4500/system, but I'd guess that Sun would cut you a deal if they got to brag about your cluster later. So call it $4000 each. Now we're talking $200x4000, or only(!) $800,000. Round it off to a cool million bucks for site prep, power, cabling, gigE switches, and extra disk for a file server, and you've got almost a turnkey entry into the top end of HPC.
For less than half that, you could do something pretty impressive on its own. For even less, you could stack up 3GHz P4's and still get respectable performance.
I dunno. It's a neat thing, but our compute resources here are still just toys, I'm afraid, much as we like to brag about them. I'd like to build something real sometime.
(no subject)
Date: 2005-02-22 11:10 pm (UTC)(no subject)
Date: 2005-02-23 12:17 am (UTC)This is for a high-performance computing rig, too, so it's not just RAM, it's CPUs as well.
I think we're going Opteron from now on, though, and most likely Sun or somebody who knows how to make server-level hardware actually act like it's server-level hardware and not like a half-assed PC somebody threw together in their basement for kicks.