Friday, January 23, 2009

Down the rabbit hole

It constantly amazes me when our IT works. The number of parts and the inter-dependencies among them is just staggering. Then there are the times when it does not work. I have been doing IT for over 25 years and it seems like the amount of failure has been growing. I have mostly been able to overcome and work around these failures, but I think I have met my match......in a few more weeks we will know for sure.

I have been mainly posting on our hardware travails. We install hardware in order to run IT systems. Laying behind most of the hardware troubles have been software troubles. Just to show you how inter-connected this all becomes and how time just expands beyond the patience of even the most understanding of managers, let me tell you one part of the on-going saga of building a large scale NAS that is cheap enough to attract technically savy bio-medical researchers.

This tale revolves around a feature of the NAS that we viewed as essential to attracting the largest pool of researchers. As you should understand from the way I am phrasing this, we have to recharge for storage and researchers are free to spend their grant money any way they like.

That feature was using our organizations central Active Directory (AD) for CIFS (CIFS is a microsoft network file service) authentication.

This sounds elementary for a NAS, no? So, the saga starts with us trying to join the NAS server software to the AD. That fails. We eventually discover (by using opensolaris truss debugging) the precise series of commands and failure responses. What it all boils down to is this:

We live within a OU which lies just below the O for the organization. This is a scenario that more than a few large, de-centralized organizations have. They have a structure that looks something like this:

O=BigCampus
OU=Medical School
OU=Engineering School
OU=Law School

and so on. (Our actual structure is a little more complicated because it is a true forest, but nevertheless, the parts that are important are the OU parts.)

Now the central IT group, whose responsibility it is to run the Active Directory (AD) reserve administrative rights at the top level to themselves, and provide administrative rights to each OU which are limited to that OU.

In the common, simple AD case, all servers have their own entry in a computers sub-tree which lives within the top level O. In the delegated case, the servers have entries in a computers sub-tree which lives within the OU.

Enter SUN CIFS. We are using a variant of opensolaris that includes SUN CIFS. SUN CIFS was designed assuming the simple case, and does not work in the delegated case. Our join failed, we got precise error messages, eventually a bug is filed on our behalf by our NAS software vendor at SUN. That bug is 6691539 and is discussed at length in the opensolaris forum here. As you can see this was back in April of 2008 and a fix was provided in a opensolaris release. Our vendor backported this fix to an earlier version of opensolaris due to version release controls of their own.

As we discovered, that fix got us further along but still did not address the entire problem. It is now approaching Febuary of 2009 and we still can't join our AD. There is currently one other open bug related to this and I suspect that it is the next problem in the chain of problems that we encountered. IF not, then there are more 'bugs' related to this use case.

I thought we did a good job of describing the overall scenario we have here. Such scenarios are often called use cases nowadays. So we had a use case. We had an initial fault in that use case, that specific fault was repaired, but the entire use case was not fixed, because one specific failure just lead to the next specific failure, all related to the same use case.

Maybe that explains why it's going on a year now with no success. Maybe there are other explanations. All I know is that at least three other NAS software sources that we are playing with fully accomodate this use case.

Maybe the next release of OS-X server will finally have ZFS in it and we can switch off of SUN to X-Serve running OS-X. Afterall, the hardware underpinnings of both are X86 platforms....

Maybe, but then maybe we just start the hardware woes and find other software problems. This is the rabbit hole.....

This also brings up another problem for an IT group - how long do you wait, when a fix is just another 'any day now' patch away? How many times can you allow where that patch leads to another bug which needs another patch due 'any day now'. Then complicate that with the onset of hardware problems.

When do you call it quits?

If you are a small IT shop and have committed a significant fraction of your budget, then calling it quits puts the shop in jeopardy of not having enough resources to continue.

This was my call and I think I got it wrong. I usually am able to judge these things and cut our losses early. This time, since I had already cut loose on three previous vendors and one implementation project, I believed that I could not afford to change directions yet another time. Further more, I believed that because the core of what we were doing was open source, this would work, eventually. Eventually is not good enough, in the world of organizations, one needs a plan and an estimated due date. Not having one is a death blow. Most managers are used to IT projects being delayed and over budget, often significantly, but they always have a plan with expected dates. While we could provide dates based on optimism, after we failed to meet a series of them, our dates became meaningless. Furthermore, as you can see, we have no dates on the core problem resolution.....

From the perspective of management, this situation looks like this:

You (meaning the IT group) said you would deliver a storage solution to us.

You have not done that in over a year (yes this saga has many more chapters in it).

Other IT groups have done this sucessfully.

We understand that your price target was much lower than the other IT groups. But time is money and your project has lost any savings you promised to deliver.

It's time for us, as managers, to move on.

No comments:

Post a Comment