Thursday, April 30, 2009

Don't know where to start

Major re-thinks going on regarding the following:

Storage Services
Clouds for the academic research community
Operational costs of IT (OpEx)
Is green the new red?

Some random thoughts:

How come I just found out about the 'Prisoner' remake, it's in post production right now?

No one really understands service levels.

No one ever thinks about risk in a way they can communicate about.

'IT' as a concept space has grown too large to manage. We already knew that as a terminology generator it exceeded the DOD some time ago. But now I find that I hear terms, read articles, read analysts reports and come up with high level internal architectural scaffolds with which I understand what I am talking about and asking for. It's just not likely that the people I am talking to or requesting product use the same scaffold and model. The net result is that I am constantly either disappointed or feeling abused and/or ripped-off.

Thursday, April 9, 2009

The little things strike again.

First, on the "my horse for a drink of water" front, our installation of close to a half million in hardware has been held up for lack of a 1 meter long fiber cable. Now have plenty of fiber cables, LC to LC and SC to SC, but what we don't have is SC to LC and that is what we need. The vendor was supposed to ship us one, but they don't have them either!

If you have not guessed, LC and SC are connectors! SC are bigger than LC and seem to be prevalent in switching gear, at least what we get from Cisco and Procurve. LC are prevalent at the server side of things, such as 10GbE cards or Fiber Channel cards. Yes, you guessed it, the 'enterprise' switch vendors are in an enterprise that consists of switching gear and no servers. How long have servers been needing switches to connect them to the network? As another example of this, just recently ProCurve announced a new switch line designed for the machine room, which mainly means it has front to back cooling, with the front defined as where all the connectors are. Up until now, both CISCO and Procurve switches had either side to back or back to front cooling while all servers have front to back cooling. Try mixing that up in your racks for some turbulent air flow. Of course, this is all a matter of perspective since while the 'front' of switches is the front of the rack for most installations, the 'front' of the server is bereft of any connectors aside from the occasional USB port. Servers put their connectors in the 'back'
So, as a guy who installs servers, I really don't care which perspective is right or 'better', just that everyone agree's!

As a footnote, not all server vendors put their connectors on the back, a rather small vendor (Capricorn) which supplies the Internet Archive with it's servers, put's all the connectors on the front. Seems like they must have visited a machine room......

Now, for something completely different......

The second thing is, you may remember our NFS problems? Look at past postings for more information. You may also remember our 10GbE issues (hardware, drivers) as well. We have now demonstrated on the 1/2 of 10 identical servers which can't transfer much NFS data, that they can't transfer much of any kind of data, FTP, SFTP, etc.....They can, however transfer data just fine over Infiniband and link aggregated 1 GbE links!

So we are certain that this problem is in the 10GbE path. Whether it's software, firmware, or chipset issues, we don't know yet.

Hey, didn't you say these servers were identical? Yes, they all came in on the same shipment, their BIOS and Firmware all seems to be the same and it doesn't matter which OS we load, official Solaris, OpenSolaris, NexentaStor the one's that have a problem have it with all three OS versions.

But, there must be some difference somewhere, more sleuthing is required. By the way, when these kinds of things happen, don't believe the vendors that they are ready to support you, you are on your own.

I know the enterprise switch vendors are trying to tell us that 10GbE is the server room fabric of the future, but personally, Infiniband is far more robust and even has better performance to boot! It's also cheaper! But eventually we need to get to Ethernet to get on the Internet, thanks CISCO and Procurve for ignoring Infiniband and leaving us small fry with endless grief.

Friday, April 3, 2009

LOST - in data storage confusion

Previously on LOST - The data center edition, we were having problems with our particular enterprise implementation of Active Directory and SUN's CIFS server. We are switching vendors in an attempt to keep this project on the rails as the derailment looming ahead would be the "BIG ONE" for us.

To accomplish that we switched our storage node inter-connect protocol from iSCSI to NFS(v3). We, being completely paranoid by now, decided to re-run our stress tests on the X4500 storage nodes using NFS this time. Well, things have gone from bad to worse. So far, out of 8 X4500's, we have had only 2 pass the stress test. The one's that fail, do so anywhere from 200GB to 385GB into the test. The two that work can transfer over a TeraByte with no errors. What we see are client side timeout errors which eventually lock the client side machine, forcing a reboot of it.

(Fast cut to another scene, in another part of the data center)

The Oracle systems folks are starting to scream, their systems are locking up, both client side and server side and nothing, not even a trip to the remote managment ILOM console or a mad dash to the data center to attach a serial cable, will bring these servers back to life. They must be power cycled. What's common here is that the Oracle servers are NFS mounting X4500's.

(Cut to the last scene, a despondent group of IT folks in a darkened conference room, staring at log data.)

The leader say's to everyone - does anyone know the difference between the servers that fail and those few that don't? The answer is not yet. What are our plans? Well, we need to exercise the vendor support path (we have two vendors we can try here and three options). We also decide to take a stab at running the venerable and slow moving official Solaris 10 release on one of our failing X4500's and re-run the tests. We also decide to try and swap in a SUN 10GbE card but we need to buy one and get fiber transceivers and packs for our switches.

The scene fades out with no joy in IT land, things are collapsing faster than we can send in the operatives to repair them.....deadlines are looming, the clock is ticking, we can see the broken tracks in our headlights, .... to be continued.