Wednesday, January 28, 2009

We have seen the enemy in the mirror.

Yesterday, we had one of those events that ended up being a self inflicted denial of service attack on our entire infrastructure.

Our network provider is very security conscious and monitor's all outgoing traffic, turning off switch ports if a device starts exceeding a connection per second threshold. The idea here is that such an event is most likely a bad guy using one of our computers to launch an attack on someone else (you know, the zombie army scenario) and by turning off the port that attack get's shut down.

So we had a DNS mis-configuration that we didn't catch and it ended up generating enough network traffic to trigger the security filter. As it turns out, this particular switch port is where all our authentication traffic flows out to the authentication servers. So, nobody could log in to any of our services......

I have no knowledge of other sites, but I know that this is not the first time that a security measure has self inflicted a wound. I am beginning to wonder if all these fancy traffic monitors and protections have actually prevented more damage then they have inflicted on us. This is rather hard to measure. As an example of how hard it is, you can tell that my asteroid collision shield has been working quite well. In fact, I think you should all send me some money to help defray my costs of preventing the earth from being destroyed.

And this comment leads me to a quote from another blog post that I mentioned on a twitter earlier today. The blog is Lev Gonik's from Case Western Reserve University and the URL is:
http://blog.case.edu/lev.gonick/

Top 10 IT Trends for Higher Education in 2009


The post starts off this way "What happens when tough economic times combine with fatigue across the campus community hyping the latest 'killer app', and the growing intolerance of disruptions to services occasioned by security-related activities." So I think you can sort of see the relationship to my story above. This hyping of the latest killer app also rang a bell because today I also read another blog post from the Burton Group. I won't put the URL for this one since it's behind a pay wall. But the topic was a pithy dialog between Vitruvious, famous Roman architect and Socrates, famous Greek philosopher about the death of SOA. Seems like SOA is the IT architect's latest killer app.

Despite the cynicism about trendy technology, it seems we cannot stop holding up the flag of the latest technology and leading the charge. Gonik believes that clouds are the real deal. I tend to believe it as well. I think that tough economic times will result in one of two reactions within IT:

The first kind reaction will be retrenchment, stop projects which hold risk because they use newer technology and make our old tried and true (and already purchased) technology last a few more years.

The second kind of reaction will be done by the few, mostly smaller entities, and they will see an opportunity to take a risk and change the competitive landscape by using a transformational technology.

And that leads me back to clouds and why we will be seeing an uptick in cloud computing despite the downturn. I think the last really transformational technology in IT was the PC and it's getting pretty long in the tooth now.

We are starting to see embedded computing in the consumer market taking advantage of broadband services. We have DVD players, gaming consoles and now TV's themselves that can connect to the Internet and stream movies and television shows. We see lot's of creativity surrounding the iPhone and other broadband capable phones and devices. You can go to Best Buy and get an Internet Radio. Digital cameras will soon all have WiFi and GPS built in and will be able to stream up to the web.

What will drive all these devices is basically services built and hosted in the cloud. I watched as the PC transformed IT shops, and it was not because of those of us in the inside, it was because employee's acting as consumers could bring the technology in house themselves. And that phenomena is exactly what is driving cloud services right now, employee's acting as consumers are bringing in the end points of the cloud services right now to your organization.

So that's the main reason why I think cloud computing can be the next transformational technology in IT, because organized IT is not driving it!

Friday, January 23, 2009

Down the rabbit hole

It constantly amazes me when our IT works. The number of parts and the inter-dependencies among them is just staggering. Then there are the times when it does not work. I have been doing IT for over 25 years and it seems like the amount of failure has been growing. I have mostly been able to overcome and work around these failures, but I think I have met my match......in a few more weeks we will know for sure.

I have been mainly posting on our hardware travails. We install hardware in order to run IT systems. Laying behind most of the hardware troubles have been software troubles. Just to show you how inter-connected this all becomes and how time just expands beyond the patience of even the most understanding of managers, let me tell you one part of the on-going saga of building a large scale NAS that is cheap enough to attract technically savy bio-medical researchers.

This tale revolves around a feature of the NAS that we viewed as essential to attracting the largest pool of researchers. As you should understand from the way I am phrasing this, we have to recharge for storage and researchers are free to spend their grant money any way they like.

That feature was using our organizations central Active Directory (AD) for CIFS (CIFS is a microsoft network file service) authentication.

This sounds elementary for a NAS, no? So, the saga starts with us trying to join the NAS server software to the AD. That fails. We eventually discover (by using opensolaris truss debugging) the precise series of commands and failure responses. What it all boils down to is this:

We live within a OU which lies just below the O for the organization. This is a scenario that more than a few large, de-centralized organizations have. They have a structure that looks something like this:

O=BigCampus
OU=Medical School
OU=Engineering School
OU=Law School

and so on. (Our actual structure is a little more complicated because it is a true forest, but nevertheless, the parts that are important are the OU parts.)

Now the central IT group, whose responsibility it is to run the Active Directory (AD) reserve administrative rights at the top level to themselves, and provide administrative rights to each OU which are limited to that OU.

In the common, simple AD case, all servers have their own entry in a computers sub-tree which lives within the top level O. In the delegated case, the servers have entries in a computers sub-tree which lives within the OU.

Enter SUN CIFS. We are using a variant of opensolaris that includes SUN CIFS. SUN CIFS was designed assuming the simple case, and does not work in the delegated case. Our join failed, we got precise error messages, eventually a bug is filed on our behalf by our NAS software vendor at SUN. That bug is 6691539 and is discussed at length in the opensolaris forum here. As you can see this was back in April of 2008 and a fix was provided in a opensolaris release. Our vendor backported this fix to an earlier version of opensolaris due to version release controls of their own.

As we discovered, that fix got us further along but still did not address the entire problem. It is now approaching Febuary of 2009 and we still can't join our AD. There is currently one other open bug related to this and I suspect that it is the next problem in the chain of problems that we encountered. IF not, then there are more 'bugs' related to this use case.

I thought we did a good job of describing the overall scenario we have here. Such scenarios are often called use cases nowadays. So we had a use case. We had an initial fault in that use case, that specific fault was repaired, but the entire use case was not fixed, because one specific failure just lead to the next specific failure, all related to the same use case.

Maybe that explains why it's going on a year now with no success. Maybe there are other explanations. All I know is that at least three other NAS software sources that we are playing with fully accomodate this use case.

Maybe the next release of OS-X server will finally have ZFS in it and we can switch off of SUN to X-Serve running OS-X. Afterall, the hardware underpinnings of both are X86 platforms....

Maybe, but then maybe we just start the hardware woes and find other software problems. This is the rabbit hole.....

This also brings up another problem for an IT group - how long do you wait, when a fix is just another 'any day now' patch away? How many times can you allow where that patch leads to another bug which needs another patch due 'any day now'. Then complicate that with the onset of hardware problems.

When do you call it quits?

If you are a small IT shop and have committed a significant fraction of your budget, then calling it quits puts the shop in jeopardy of not having enough resources to continue.

This was my call and I think I got it wrong. I usually am able to judge these things and cut our losses early. This time, since I had already cut loose on three previous vendors and one implementation project, I believed that I could not afford to change directions yet another time. Further more, I believed that because the core of what we were doing was open source, this would work, eventually. Eventually is not good enough, in the world of organizations, one needs a plan and an estimated due date. Not having one is a death blow. Most managers are used to IT projects being delayed and over budget, often significantly, but they always have a plan with expected dates. While we could provide dates based on optimism, after we failed to meet a series of them, our dates became meaningless. Furthermore, as you can see, we have no dates on the core problem resolution.....

From the perspective of management, this situation looks like this:

You (meaning the IT group) said you would deliver a storage solution to us.

You have not done that in over a year (yes this saga has many more chapters in it).

Other IT groups have done this sucessfully.

We understand that your price target was much lower than the other IT groups. But time is money and your project has lost any savings you promised to deliver.

It's time for us, as managers, to move on.

Monday, January 19, 2009

One victory

Today, we swapped the 64GB of RAM to a spare server we had purchased and this server POSTs up 64GB! So we have a wonky MB in our other server. This new server follows the pattern of all but one of our X4240's in that the BIOS control of ILOM IP address does not work.

Maybe some more small victories will start coming our way as this week wears on. Or maybe our good fortune had something to do with the SUN being out all day for the first time in a week.

Hardware sucks

Sucks in many ways. the X4240 is sucking manpower like mad, it is sucking our reputation away with our customers, it is sucking my patience to the limit.

The latest:

PCI kernel panic and reboot! - Using the Myricom CX4 card (long story why this card) and doing some testing here is what happened:

fault.io.pciex.device-interr dev:////pci@0,0/pci10de,377@a/pci14c1,9@0 faulted but still in service

and this is what is at the pci location:

/pci@0,0/pci10de,377@a/pci14c1,9@0"

So, big problems, can't bring this hardware combination into production. Can't wait either, will have to roll back to our older PCI-X platform.

On a completely different server, but still an X4240, the BIOS posts 60GB of memory with 64GB installed. After the SUN tech swapped memory, the problem remained. He attempted to swap the motherboard, but the "new" MB failed to boot. We are still waiting for a resolution.

So, yet another server, another customer, can't go into production.

Oh my. I have often been accused of pulling the trigger to fast on switching vendors when things like this happen. I have been told that these things happen and we just have to work through them.

There is something to that. In the last five years I have tried the following vendors, and moved away from each of them due to intractable problems:

HP, IBM, Dell, Supermicro, Tyan and now SUN.

Is this way it is? Am I cursed? Either way, it sucks.

p.s. As if this was not enough, on our Thumper platform, the X4500, we have 6 in production as iSCSI servers. These have been active for less than a year. That add's up to 288 one TB SATA drives. We just had our second drive failure! That's a .7% failure rate and a full year is not year finished on these! It's a good thing we planned on mulitple sources of redundancy here.

Saturday, January 17, 2009

Isn't hardware fun - episode three.

Revenge of the SSD's.

Previously .... our Intel X25E and M's actually degraded our performance in ZFS.

But, the SAS drives in the same backplane and on the same controller card also suffered performance degradation. So, we thought there must be something to SUN's refusal to support mixed SAS and SATA in this system.

Then the quest for a separate controller to run the SATA SSD's started. This quest was complicated by the fact that we needed low profile, PCI-E and opensolaris support. I should have remembered that all I had to do was visit Joe Little's Little Notes blog here: http://jmlittle.blogspot.com/2008/06/recommended-disk-controllers-for-zfs.html . That got me to the LSI 3442ER.

Now the next quest began, what to put the 2.5" SSD's into? Not trusting the X4240 backplane, although it does have two SAS cables running to it, suggesting that it might be a split backplane I started looking around for 2.5" SAS JBOD enclosures. This was harder than I thought. I ended up finding only two. One by AIC which is nice and inexpensive but only comes from resellers we have no previous relationship with or *BANG THE DRUM* HP! Wow, HP has two JBOD's, the MSA50 and the MSA70. Interestingly enough under OS support, HP claims Solaris works on the MSA70 but makes no such claim for the MSA50. A quick e-mail with our friendly HP storage engineer revealed that we could indeed install and 2.5 SATA/SAS drive we wanted and that the carriers for such disks come with the empty JBOD. ( I had to check, so many vendors make sure their stuff only works with drives supplied in carriers only available from them, and the carriers are not available separately - hint to SUN)

Well, when all this stuff arrives and assuming it actually works on the X4240, I will post on whether our attempt to build a SSD enabled ZFS system accomplished anything.

Oooh, for those who would say why don't we just use the SUN Amber Road series which already has this stuff working in it. Well, we have already made a substantial committment to older SUN hardware (Thumpers) and since SUN won't put their latest software developments on their legacy hardware, we thought we would give it a try by re-using a X4240 we already had as a NAS headend to the Thumpers.

Below zero and the datacenter

Isn't hardware fun, episode 2 - Attack of the cold zones.

Today we found out that all three of our diesel generators are reporting low fuel temp and one refuses to run at all. I suppose that our low of -14F had something to do with it. I wonder how widespread this kind of thing is? Of course, our vendor for the gen's is just troubleshooting as normal, no hint that perhaps we have the wrong installation for our climate.

The X4240 saga continues

Previously, on this channel, the Adaptec SAS controller in the X4240 (or perhaps the SAS backplane itself) can talk to both SAS and SATA at the same time, just with no performance whatsoever.


In this episode another SUN X4240 with 64GB POSTs at 60GB. Memory swap doesn't change this, a motherboard swap was a total failure and the new MB would not boot.
Isn't hardware fun.

Thursday, January 15, 2009

twitterfeed

Thanks to following John Halamka's twitter I discovered twitterfeed which will take this blog entry and enter the the title into twitter and provide a tinyURL back to this blog entry.

It was either twitterfeed or switching over to the Flock browser. Flock still has lot's of interesting things to offer, but in the never ending browser shuffles it's getting harder and harder for me to leave Firefox. I have too many, probably really too many, extensions loaded and I don't have the time to vet them in Flock.

My earlier comments about openid were because twitterfeed is openid enabled and frankly I am tired of creating accounts right and left, seemingly anytime I want to explore anything new.

Setting up a twitter feed

Well, this was supposed to be easy, but I got sidetracked with openid. I finally decided to create a myopenid ID because I want to make a distinction between what is called a custodial ID and other ID's. To choose a custodial ID I actually decided to trust Microsoft. They list three openid providers for their HealthVault application and myopenid was one of them.