Friday, April 3, 2009

LOST - in data storage confusion

Previously on LOST - The data center edition, we were having problems with our particular enterprise implementation of Active Directory and SUN's CIFS server. We are switching vendors in an attempt to keep this project on the rails as the derailment looming ahead would be the "BIG ONE" for us.

To accomplish that we switched our storage node inter-connect protocol from iSCSI to NFS(v3). We, being completely paranoid by now, decided to re-run our stress tests on the X4500 storage nodes using NFS this time. Well, things have gone from bad to worse. So far, out of 8 X4500's, we have had only 2 pass the stress test. The one's that fail, do so anywhere from 200GB to 385GB into the test. The two that work can transfer over a TeraByte with no errors. What we see are client side timeout errors which eventually lock the client side machine, forcing a reboot of it.

(Fast cut to another scene, in another part of the data center)

The Oracle systems folks are starting to scream, their systems are locking up, both client side and server side and nothing, not even a trip to the remote managment ILOM console or a mad dash to the data center to attach a serial cable, will bring these servers back to life. They must be power cycled. What's common here is that the Oracle servers are NFS mounting X4500's.

(Cut to the last scene, a despondent group of IT folks in a darkened conference room, staring at log data.)

The leader say's to everyone - does anyone know the difference between the servers that fail and those few that don't? The answer is not yet. What are our plans? Well, we need to exercise the vendor support path (we have two vendors we can try here and three options). We also decide to take a stab at running the venerable and slow moving official Solaris 10 release on one of our failing X4500's and re-run the tests. We also decide to try and swap in a SUN 10GbE card but we need to buy one and get fiber transceivers and packs for our switches.

The scene fades out with no joy in IT land, things are collapsing faster than we can send in the operatives to repair them.....deadlines are looming, the clock is ticking, we can see the broken tracks in our headlights, .... to be continued.

No comments:

Post a Comment