Friday, June 12, 2009

Catch-up

Well, it was bound to happen. I have really slacked off in my postings. Am I just lazy, not inclined to on-line rambling, or something else?

Reason #1 is in the eye of the beholder.
Reason #2 is clearly not true, once I get started I perhaps ramble on too much.
Reason #3 has some legs in it.

I never really made my mind up what I was going to be when I grew up. Here my progression:

Starting off in college I was going to be an academic in abstract math and logic. During the idealistic late 60's I changed that to being an academic in real world changing (ecology and resource planning). My research agenda started with the quite modest goal of understanding how humanity came to be where I perceived us to be, on the precipice of self-destruction. To understand humanity I thought I needed to start with Australopithecus and work forward. Needles to say, I didn't get very far before being embroiled in the hot debates of the day:

Were Proanthropus and Australopithecus different species or just different ends of genetic expression?

Nature versus Nurture

Did language cause evolution or is it a consequence of it?

Gee what fun, now I had to understand just what a species is. Start with Darwin and LaMarck, switch over to paleo-botany and try to understand the Cretaceous angiosperm explosion.

Try to understand humanity from the inwards direction, study religion and psychology.

Go to talks by Noam Chomsky to find out about language but instead find out about another view of the politics of the day.

Okay, Okay, enough already......but you can see the pattern there. Lot's of research, forks in all different directions and pretty soon you run out of patience and time. In today's academic world, you can't be an expert in the big picture, you can spend an entire career on very narrow fork in the road.

So this continues to this day. I have a lot of things I would like to blog about, but I just have to do a little bit more research down this avenue, what I find there just might change everything I want to say.......but the avenues never end and side roads are abundant and the obscure lanes and alleyway's look enticing.

Stay tuned, I think I see some light up ahead.

Tuesday, May 26, 2009

The Obsolete Blues

After struggling for over a year to get a VMWare solution up and running robustly and then priced out for a recharge service, I find that our central IT group has done the job already and for not much more than we can do it for. Thus, there is really no reason to be in this particular business.

I did see this coming, it is pretty clear that cloud computing is the real next wave. VMWare is what established IT shops do, and will do, to make those 'private' clouds.

The real cloud computing outside of the private world, will be based on higher level abstractions than the Operating System. Since I have been an advocate of abolishing any user interface into an operating system (after all, are we not really trying to perform application logic?) it should come as no surprise that I would also advocate for cloud computing using interfaces or API's well above the operating system.

What people are going to want at the most primitive level is going to be Ruby on Rails servers, PHP servers, Java servers or some other programming abstraction (Hadoop?). Then they are going to want data management, but not file managment, I mean contextually relevant data management. That could be via database systems or it could be via something else. This is the developer level access to clouds.

But what even more people (non-developers) are going to want are applications that just work and do useful things. Contact managment systems, billing systems, mail systems, customer relationship systems, social networking, media sharing and purchasing, etc.....

So, my new career, should I choose to accept it, will most likely be trying to show end users in academic research, how to get what they want out of clouds. I would like to think that I will be part of building a private/public cloud to facilitate the transition, but I don't think that a group of my size can ever effectively be in the infrastructure business. I don't know what I was ever really thinking in trying to do that anyway....vanity maybe.

Friday, May 8, 2009

Everything you know is wrong (again)

More people are finding me on facebook now than twitter. I still don't really know how to effectively use facebook or twitter, so I continue to write this blog since writing lots of words, whether I succeed at communicating or not, is what I seem to be good at.

Yesterday, I signed the petition to encourage my federal representatives to support the use of VISTA as the core of a proposed new bill "Health Information Technology Public Utility Act of 2009"

Many, in the technical communities I travel in, find VISTA's core use of MUMPS as reason enough to ignore it. This is because MUMPS is an 'old' technology and everyone knows that old technology can't be as good as 'new' technology. Or at least we have built a market based on that, with computers 'lasting' just 3 to 5 years before they need to be replaced. That sort of technological imperative thinking is just too simplistic for me anymore.

(Bet you were wondering if I would get back to the title of this post :)

So, what else do we know that's wrong? One thing that really strikes me is the nearly universal notion that 'we have to get this economy back on it's feet'. I take that to mean, at it's simplest, that we need to get back to the way things were! You can see this everywhere: Banks are now making money so those high paid executives who created that innovative engine of growth, financial derivatives (say sub-prime mortgage's), need to be rewarded again. The automotive market simply has to re-structure itself for lower operating costs, as if alternative living, working and transportation arrangements that are demonstrably better along many important dimensions (health, energy consumption) no longer need to be encouraged. Finally, we need to spend a lot more 'stimulus' money to get all those retarded health care practitioners to adopt the latest technology.

Thursday, April 30, 2009

Don't know where to start

Major re-thinks going on regarding the following:

Storage Services
Clouds for the academic research community
Operational costs of IT (OpEx)
Is green the new red?

Some random thoughts:

How come I just found out about the 'Prisoner' remake, it's in post production right now?

No one really understands service levels.

No one ever thinks about risk in a way they can communicate about.

'IT' as a concept space has grown too large to manage. We already knew that as a terminology generator it exceeded the DOD some time ago. But now I find that I hear terms, read articles, read analysts reports and come up with high level internal architectural scaffolds with which I understand what I am talking about and asking for. It's just not likely that the people I am talking to or requesting product use the same scaffold and model. The net result is that I am constantly either disappointed or feeling abused and/or ripped-off.

Thursday, April 9, 2009

The little things strike again.

First, on the "my horse for a drink of water" front, our installation of close to a half million in hardware has been held up for lack of a 1 meter long fiber cable. Now have plenty of fiber cables, LC to LC and SC to SC, but what we don't have is SC to LC and that is what we need. The vendor was supposed to ship us one, but they don't have them either!

If you have not guessed, LC and SC are connectors! SC are bigger than LC and seem to be prevalent in switching gear, at least what we get from Cisco and Procurve. LC are prevalent at the server side of things, such as 10GbE cards or Fiber Channel cards. Yes, you guessed it, the 'enterprise' switch vendors are in an enterprise that consists of switching gear and no servers. How long have servers been needing switches to connect them to the network? As another example of this, just recently ProCurve announced a new switch line designed for the machine room, which mainly means it has front to back cooling, with the front defined as where all the connectors are. Up until now, both CISCO and Procurve switches had either side to back or back to front cooling while all servers have front to back cooling. Try mixing that up in your racks for some turbulent air flow. Of course, this is all a matter of perspective since while the 'front' of switches is the front of the rack for most installations, the 'front' of the server is bereft of any connectors aside from the occasional USB port. Servers put their connectors in the 'back'
So, as a guy who installs servers, I really don't care which perspective is right or 'better', just that everyone agree's!

As a footnote, not all server vendors put their connectors on the back, a rather small vendor (Capricorn) which supplies the Internet Archive with it's servers, put's all the connectors on the front. Seems like they must have visited a machine room......

Now, for something completely different......

The second thing is, you may remember our NFS problems? Look at past postings for more information. You may also remember our 10GbE issues (hardware, drivers) as well. We have now demonstrated on the 1/2 of 10 identical servers which can't transfer much NFS data, that they can't transfer much of any kind of data, FTP, SFTP, etc.....They can, however transfer data just fine over Infiniband and link aggregated 1 GbE links!

So we are certain that this problem is in the 10GbE path. Whether it's software, firmware, or chipset issues, we don't know yet.

Hey, didn't you say these servers were identical? Yes, they all came in on the same shipment, their BIOS and Firmware all seems to be the same and it doesn't matter which OS we load, official Solaris, OpenSolaris, NexentaStor the one's that have a problem have it with all three OS versions.

But, there must be some difference somewhere, more sleuthing is required. By the way, when these kinds of things happen, don't believe the vendors that they are ready to support you, you are on your own.

I know the enterprise switch vendors are trying to tell us that 10GbE is the server room fabric of the future, but personally, Infiniband is far more robust and even has better performance to boot! It's also cheaper! But eventually we need to get to Ethernet to get on the Internet, thanks CISCO and Procurve for ignoring Infiniband and leaving us small fry with endless grief.

Friday, April 3, 2009

LOST - in data storage confusion

Previously on LOST - The data center edition, we were having problems with our particular enterprise implementation of Active Directory and SUN's CIFS server. We are switching vendors in an attempt to keep this project on the rails as the derailment looming ahead would be the "BIG ONE" for us.

To accomplish that we switched our storage node inter-connect protocol from iSCSI to NFS(v3). We, being completely paranoid by now, decided to re-run our stress tests on the X4500 storage nodes using NFS this time. Well, things have gone from bad to worse. So far, out of 8 X4500's, we have had only 2 pass the stress test. The one's that fail, do so anywhere from 200GB to 385GB into the test. The two that work can transfer over a TeraByte with no errors. What we see are client side timeout errors which eventually lock the client side machine, forcing a reboot of it.

(Fast cut to another scene, in another part of the data center)

The Oracle systems folks are starting to scream, their systems are locking up, both client side and server side and nothing, not even a trip to the remote managment ILOM console or a mad dash to the data center to attach a serial cable, will bring these servers back to life. They must be power cycled. What's common here is that the Oracle servers are NFS mounting X4500's.

(Cut to the last scene, a despondent group of IT folks in a darkened conference room, staring at log data.)

The leader say's to everyone - does anyone know the difference between the servers that fail and those few that don't? The answer is not yet. What are our plans? Well, we need to exercise the vendor support path (we have two vendors we can try here and three options). We also decide to take a stab at running the venerable and slow moving official Solaris 10 release on one of our failing X4500's and re-run the tests. We also decide to try and swap in a SUN 10GbE card but we need to buy one and get fiber transceivers and packs for our switches.

The scene fades out with no joy in IT land, things are collapsing faster than we can send in the operatives to repair them.....deadlines are looming, the clock is ticking, we can see the broken tracks in our headlights, .... to be continued.

Monday, March 30, 2009

UCS, Clouds and HPC

How many buzz terms can I squeeze into a blog title? I wanted to add a few more, like ARRA but enough is enough :)

Cisco's positioning of UCS is as a complete data center infrastructure. They say that clouds are driving data centers towards UCS. However, it's not really a complete data center infrastructure because it leaves out most of the facility components. That means it is designed to be dropped in place into existing data centers. I don't think that's where clouds are going, in fact, I think they are going to the companies that do the most vertical integration all the way to the power generation facility.

So, Cisco is trying to sell a radically new IT architecture into established IT shops (those that have data centers fully or partially populated now and are looking for incremental expansion or replacement).

Given the pre-recession resource realities (oil at $150/barrel and heading higher) the incremental improvements in operating expense might have been enough to sway these IT shops. But that world is now gone, at least for awhile, it will come back.

So the cloud vendors might use UCS but probably few will because if you want to survive in the cloud market you need to innovate from top to bottom (think power generation, think building construction) and drive costs as low as they can go.

Clouds and HPC have been the subject of a certain amount of academic debate. Most of the naysayers are those that want to have the last ounce of performance, whatever the cost. As you can tell from my previous paragraph that's not in the clouds...(pun intended). So what is in the clouds is that researchers need to learn how to effectively use cloud resources. In one sense, they are already doing that with grids, like the Teragrid. But in the large scale roll out world, I continually run into researchers and small research teams who follow the 'build it yourself' model of HPC and it's close relative - higher an integrator to build it for you. This is the 100 to 1000 'core' market and while some of it needs the last ounce of performance, the fact is that many of these users can't extract the maximum performance from what they have in the first place. Parallel programing for HPC is just too hard or too obscure. And here is where both the solution and the problem comes for cloud vendors. Showing these 1000's of small research teams how to effectively use cloud resources. How to permanently move large data sets into the cloud infrastructure and thus avoid the nasty performance issues (not to mention billing issues) of multi-Terabyte data set access. How to use functional programming to solve their algorithmic problems, such as MapReduce.

If this sounds more like consulting, then you are right, it is more like consulting and less like buying generic off-shelf X86 servers. And that is also a sign of the issues here, because the paradigm of buying generic, off the shelf, PC inspired X86 servers is well cemented into the bulk of the research community that has not yet moved to HPC.

So far, my experience in trying to set up an HPC support team for Bio-Informatics that is software focused has been met with apathy. Nearly everyone want's to talk hardware and which processor are you using? If I say it really doesn't matter as long as you can get the job done in an acceptable time frame at acceptable costs, well they turn away and go back to browsing the hardware vendor web sites. You just gotta have the latest processor, not using Nehalem yet? well get with it, it's so much better.

I do think these things will change and here is why. Scientific research is a competitve arena as much as building and selling products or services is. If cloud computing can deliver the productivity gains that I believe it can for research computing, then those who adopt cloud computing will soon be out-competing those that don't (of course this assumes that at least some of the world renowned researchers who get big grants sign on to this). It also means, that just like in IT, a small startup (think smaller school here) that is funded well enough to land a top scientist, will probably come to the clouds first. That is because they are not hobbled with existing infrastructure, either in data centers or older HPC systems.

The big guys (think large academic research univerisites) will also 'get it' but claim to not see the demand for cloud computing. I suppose that in the more fully de-centralized universities, less funded researchers might see the opportunity and use it to do research that at one time could have only been done at a few high end places......

Hope springs eternal.

Friday, March 20, 2009

RDBMS, Scientific data and Open Source.

OK, I know I recently said I gave up on open source. I must confess that I meant that comment only in the context of US Health Care software, where powerful entities have too much vested in the current state of affairs. When you have a market that is very large, and Health Care software in the US is a plus $80 Billion industry, then it's going to get a lot of attention. But in research, especially academic research, things are different. Since the current focus of a lot of academic health care research is on 'translational' or bench to bedside, i.e. technology transfer, one must wonder if market forces will stand in the way? But in a blog post on the ACM's website, I found another possible outcome - what will be needed for some of the largest scientific problems of our day, simply will not be developed by the market!

It's been a long time since I let my ACM membership lapse. The press of getting things done with commercial systems sort of made most of CACM irrelevant. But in time, good ideas should make their way from the lab to the product, but in this blog entry by Michael Stonebraker, he thinks that won't happen with scientific data management and furthermore he thinks the problem is too big for any one academic institution - hence an organized open-source project.

http://www.cacm.acm.org/blogs/blog-cacm/22489-dbmss-for-science-applications-a-possible-solution/fulltext

For those who don't know Stonebraker, he is one of the "academic to industry" pioneers in RDBMS land, having been behind the creation of Ingress which later morphed into MS SQLServer.

Tuesday, March 17, 2009

MIcrosoft in the Clouds.

Interview with Dan Reed:
http://www.hpcwire.com/features/Twins-Separated-at-Birth-41173917.html?viewAll=y

The scoop from Microsoft:
http://research.microsoft.com/en-us/news/features/ccf-022409.aspx

I confess, I have not digested nor read fully through all of this.
But when Microsoft exec's make statements like this:

"The specific aspect relative to HPC is that cloud services are game changers, just as commodity clusters were a decade ago and graphics accelerators have been recently. This is not the future, this is the present."

Now add in Azure and just about anything written to .NET can be turned into a software as a service.

Other than nagging doubts about security and robustness, which the big guys will just hammer away at with 'look how big we are, we can't be wrong' approaches, there really is the possibility that the PC as we know it has reached the end of it's popular life, much like an Atari Game console or an Eight Track tape. They still work, but who cares?

Monday, March 16, 2009

E-Bay scams

Well, as if having to wait for my money to clear was not bad enough, I now have to wait for my buyer to be re-assured by e-Bay that I am not a fraud before he will pay me. This could take up to 7 days.......
So, here is what happened. I listed an item, it sold. Shortly thereafter, the buyer received an e-mail from someone else claiming to be the buyer. That e-mail looked like this (I have removed the buyers identifing info):

"Hi buyer, My name is Ramsey Connor, I am the seller of the Item#xyz.

Unfortunately, immediately after the listing closed I got a Notification from eBay and I must warn you that my account is unavailable and you cannot pay for the item through Paypal as usual. I sincerely apologize for the inconvenience this has caused and I would be more than willing to help you to complete the transaction immediately. Now I am located in United Kingdom because I have to sign a contract with a company and I know the mainspring of the paypal problem was caused because my account was accessed from different locations, while travelling to UK. Anyway we have a solution, payment approved by eBay for this transaction is Western Union Money Transfer so you can send the payment in a few minutes.

To send a wire transfer you need my full name: RAMSEY CONNOR and country: United Kingdom. You will deduct the transfer fees from the total and you will send the remaining amount. If you want to send the payment with your credit card check the following link for more details:

https://wumt.westernunion.com/WUCOMWEB/staticMid.do?"




So, this URL is embedded and seems to actually go to westernunion.com. Which means someone is actually linked up to a WU account. I can only assume that they are setting up and tearing down accounts rapidly, so that they can keep on doing this scam and WU will know who this account belonged to, at least briefly. Maybe, the account itself is hijacked...But somehwere the money trail via account numbers can be followed and there is a person on the other end.

Social Networking site's all blur together

Ok, so now that I have blogger linked to twitter and then I think I just linked twitter to facebook I am hoping that the twitter like part of facebook will actually come from twitter now!

In researching how to link all these things up, I ran into another strategy besides this feeder strategy that I am using now. That strategy to to signup for a social networking 'mega' site. This is a web mashup or something like a mashup that will consume all your other social networking sites. I don't know, but there is something about these mega sites that reminds of the day's when your browser vendor or search engine vendor wanted to offer you a 'portal'. I finally had to turn all those dang portal pages off, as waiting for the weather in Katmandu to load was plain stupid :)

So, anyway, back to the serial linking excercise. Now the link from blogger to twitter is done via a third party and is using openid. However, the link from facebook to twitter lives inside facebook (they must have a SDK or something) and asks you for your password. Now that is a very bad thing. Facebook, blogger and twitter themselves should all use openid and then we can all stop storing our passwords around the internet with companies large and small, who might someday go out of business and find a going away bonus for someone by selling a password database!

Tuesday, March 10, 2009

The little guys win again

Yeah, we finally had success joining the organization's multi-domain forest from a linux server.  Now the standard SAMBA stuff would probably work as well, but we found a company called LikeWise (likewise.com) that specializes in connecting unixy OS's to Active Directory.  The core component has been open sourced and I know the SAMBA people are looking to add many of it's functions.  It can really replace the stock Winbind and obsolete the idmapper.  It can eliminate our complex configuration of needing to talk to both AD and a separate LDAP server, or to extend the AD schema for unix.

In other words, the LikeWise folks understand that many of us are stuck out in the leaves of the organization and really can't make changes to the core infrastructure.  We have had vendors ask us to change registry settings on the Domain controllers - Ha, fat chance of that happening without a full code review, security analysis and Microsoft's blessing (that's what big organizations do to cover themselves).  

So, there is still a lot of work left to do, but at least we can see light at the end of the Active Directory tunnel.  

Now for the bad news.  We need to switch to linux and abandon opensolaris and ZFS.  I won't be sorry to see opensolaris go, but ZFS is quite nice.  Well, there is one other development that I can't talk about yet, but a white knight might come riding in and take us out of build it yourself mode.

Tuesday, March 3, 2009

Interesting virutalization cost analysis

While I am not going to present the raw data and calculations, they are private property of my employer, I am going to share the rough 'bottom line' from a customer perspective.

Bottom line: Physical server ~$4,000, equivalent virtual server ~$3,000.

(Update: We continue our memory analysis and with the workloads we have - java app servers, web servers, vendor applications that used to one/server - we are seeing with 100 active slices average memory utilization of 12% of our available real memory and peak (several months of data so far) of just under 25% of real memory). That allows us to at least double our estimates of the number of slices - even if I leave a 20% safety margin, our costs can be lowered to about $270/slice/year! That changes the above vm comparison from $3,000 to ~ $2200 )


So two things need to be kept in mind here:

The costs are just about anything that can be identified as devoted to the servers which includes machine room operating costs (these are actuals since we are getting bill's every month), labor and a 5 year equipment depreciation. The calculations are audited by a financial group outside of my organization and they insist on pretty good documentation of the costs.

We are also getting some discounts on the hardware and software, but not so much that a very large ISP couldn't get the same or perhaps do even better.

So, without further ado:

A 8GB quadcore, dual socket server run's us a little under $4100/year to keep running.

A 640MB, single CPU virtual server 'slice' costs a little under $370/year.

Cutting the main server's memory down to about 2 GB, which is what most applications we have seen typically need, will really not change the cost's very much. Say it's goes to $4,000/yr. To make the equivalent of this server in 'slices' would take 3 slices to make up the memory (which only gives us 3 CPU's) or 8 slices to make the CPU's equivlent (which gives us closer to a 4GB server). So let's use 8 as the number. 8 X $370 = $2,960.

So, it's pretty easy to see why virtualization is a big win in operating expense (OPEX). But virtualization has much more going for it than that, labor resources get stretched very much further and limited floor space is much better utilized.....

SUN CIFS update

It's been about two weeks since I last posted. I have written several things up in my head, but the better part of caution has kept me from posting until now.

As you can find out by looking back into past postings, we have been unable to join our SUN CIFS server to our campus/enterprise Active Directory for going on to 6 months now.

To do a sanity check, we decided to attempt a join to our own, test, Active Directory. That works just fine. So we can conclude that the problem lies in the enterprise configuration that we are trying to use.

Let's have a re-cap of the elements of this enterprise configuration, since I believe that many others may face this same environment. The AD we are working with is designed to allow distributed management to various units within the organization. It does this by using a hierarchical implementation which in general jargon is called an OU. So our structure is basically this: O=Corporate_name ---- OU=departments. To compare this to our local AD, we have no OU's and our hierarchy is just one level, the root: O=department.

So, SUN's CIFS server can join at the root level, but not at the OU level.

I did take a look at the various posting threads in SUN's CIFS-discuss list. Beside our trail of woe, I also found a trail of woe from Indiana University.

So, do we have any workarounds? Well, there are two possibilities that spring to mind. They both involve taking our existing, local AD and hooking it up to the enterprise AD. In one method, we can join up as a 'forest' member - but enterprise policies won't allow that. The other method is to set up a one way trust between our local AD and the enterprise AD. That is still under policy discussion, but should be ok since the trust is from us to them and not the other way around, we presumeably could not do anything 'bad' to compromize the enterprise.....

Whether any of these work arounds will actually work is yet to be determined.

Tuesday, February 17, 2009

Raw insanity

Raw digital camera formats were created unique to each manufacturer to drive highly technical photographers insane trying to find raw 'development' nirvana. About every six months I select a couple of my recent photographs and run them through all the available raw processor's I can get demo's for or own. I then 'pixel peep' the results which means examining the results in great detail at very large blow-ups, often to the point where one can see the pixels.

Since each of these raw developers are software applications they have vastly different ideas from their engineer creators as to how any of us want to process our photo's. Then each engineer or software team has their own idea's as to what we understand and how they can deliver controls which will help us get the optimal results.

I can have two strategies for running each program's adjustments. I can take the programs 'default' settings or I can try to make each program produce an optimal result by adjusting the numerous and not always named the same thing controls.

I am then left with multiple 'versions' of the same image. They are very hard to compare, because despite my best attempts to make them at least have the same level of brightness, contrast and dynamic range they don't match. And then there is the process of 'comparing'. Absolutely the best software for doing comparing is for Windows, the FastStone image viewer which allows one to compare up to 4 photo's. Your screen is split into even sections for each photo and then you can scroll around and blow up the photo's, each action is synced in each window, so you are always looking at the same thing in each of 4 different windows. FastStone is only one of two software packages I have found, Windows, Mac or Linux that does this with up to 4 photo's. The other is also Windows based and is a asset management system called IdImager.

Back to the comparison strategies. There are arguments for either of them depending upon what kind of photographer you are. If you take 100's of shots at a time, which digital cameras really encourage, then doing the comparisons using 'default' settings might be a good strategy. However if you either take a few photo's or are very good at pruning your 100's into a few, then trying for 'optimal' results might be a good comparison approach. I do both, of course, since I can't make my mind up :)

If you are looking for some 'winner' from me you are not going to get it. I find a different winner each time I do the tests and sometimes a different winner with each photo. So maybe it doesn't really matter. But this is exactly the kind of endless technical manipulation inside a GUI, which changes parameters sometimes understandably and sometimes not, but always with noticeable results, that is obsessive but ultimately leads to madness.

Oh well, go to go now and check out the latest 'gui' for dcraw....

Friday, February 13, 2009

The Empire Strikes Back, part 10

Or at least it feels that way. My previous post on our CIFS breakthrough would the analogous, in movie plot terms, to Star Wars, A New Hope. This week, it's the empire strikes back as we still can't use our system. This is because of a, so far, unknown idmapper problem.

Of course, the empire here is our Active Directory. The reason I say part 10 is (to continue the movie metaphor's) we seem to be stuck in a plot that mixes Groundhog Day with the first two (chronologically released) Star War's movies.

We start out as a small band of highly technically equipped rebels trying to restore the Republic by integrating all OS's into a common's of functional equality. We struggle with small set backs and then a big revelation is made to us that puts us into despair. And then it starts all over again. We never do get to a play out of the "Return of the Jedi".

The "force" of course is computer technology, very powerful and fairly mysterious. Full of young masters who can manipulate it without quite knowing why and old masters who think the why is embedded in the past events they were part of.

Ok, I have stretched this analogy about as far it can go without becoming too ridiculous. One good thing did come out of this latest round and that is we found more people documenting what they are doing with OpenSolaris, CIFS and ZFS here:
http://tinyurl.com/b3uy7o and here: http://tinyurl.com/ak8mxg .

This is promising as it shows the opensolaris ecosystem is growing. We still don't have anything like an ubuntu or Centos for it, but those took a long time to develop in the linux ecosystem. It's a matter of numbers, how many people are using this stuff and how many of them will contribute back to the community. In this regard SUN has to tread a fine line by keeping the licensing open enough to attract a fair portion of open source advocates and putting back their own developments, while still finding a way to package software up such that customers are willing to pay for it, ala RedHat.

Friday, February 6, 2009

a last-second, half-court prayer...and drained it! Nothin' but net!

These are the words our technical guru used to describe Nexenta's delivery of a fix for CIFS bug: CR 6764696. Only one other detail remained - that was doing the Domain join using a newly created ID with full rights over a pre-created computer container for the server. Why the OU admin ID, which creates this new id can't do the join is a mystery that maybe Nexenta will reveal.

So, congratulations and deep thanks go to Nexenta for this fix.

I didn't explain the last-second allusion though. And it was this. I had given our technical team one week more to look for solutions among the SAMBA and SUN CIFS server options. This morning, in about one hour, we are having a meeting (called by my boss) which is scheduled to discuss what we would do given failure......the options here were ugly.

Wednesday, January 28, 2009

We have seen the enemy in the mirror.

Yesterday, we had one of those events that ended up being a self inflicted denial of service attack on our entire infrastructure.

Our network provider is very security conscious and monitor's all outgoing traffic, turning off switch ports if a device starts exceeding a connection per second threshold. The idea here is that such an event is most likely a bad guy using one of our computers to launch an attack on someone else (you know, the zombie army scenario) and by turning off the port that attack get's shut down.

So we had a DNS mis-configuration that we didn't catch and it ended up generating enough network traffic to trigger the security filter. As it turns out, this particular switch port is where all our authentication traffic flows out to the authentication servers. So, nobody could log in to any of our services......

I have no knowledge of other sites, but I know that this is not the first time that a security measure has self inflicted a wound. I am beginning to wonder if all these fancy traffic monitors and protections have actually prevented more damage then they have inflicted on us. This is rather hard to measure. As an example of how hard it is, you can tell that my asteroid collision shield has been working quite well. In fact, I think you should all send me some money to help defray my costs of preventing the earth from being destroyed.

And this comment leads me to a quote from another blog post that I mentioned on a twitter earlier today. The blog is Lev Gonik's from Case Western Reserve University and the URL is:
http://blog.case.edu/lev.gonick/

Top 10 IT Trends for Higher Education in 2009


The post starts off this way "What happens when tough economic times combine with fatigue across the campus community hyping the latest 'killer app', and the growing intolerance of disruptions to services occasioned by security-related activities." So I think you can sort of see the relationship to my story above. This hyping of the latest killer app also rang a bell because today I also read another blog post from the Burton Group. I won't put the URL for this one since it's behind a pay wall. But the topic was a pithy dialog between Vitruvious, famous Roman architect and Socrates, famous Greek philosopher about the death of SOA. Seems like SOA is the IT architect's latest killer app.

Despite the cynicism about trendy technology, it seems we cannot stop holding up the flag of the latest technology and leading the charge. Gonik believes that clouds are the real deal. I tend to believe it as well. I think that tough economic times will result in one of two reactions within IT:

The first kind reaction will be retrenchment, stop projects which hold risk because they use newer technology and make our old tried and true (and already purchased) technology last a few more years.

The second kind of reaction will be done by the few, mostly smaller entities, and they will see an opportunity to take a risk and change the competitive landscape by using a transformational technology.

And that leads me back to clouds and why we will be seeing an uptick in cloud computing despite the downturn. I think the last really transformational technology in IT was the PC and it's getting pretty long in the tooth now.

We are starting to see embedded computing in the consumer market taking advantage of broadband services. We have DVD players, gaming consoles and now TV's themselves that can connect to the Internet and stream movies and television shows. We see lot's of creativity surrounding the iPhone and other broadband capable phones and devices. You can go to Best Buy and get an Internet Radio. Digital cameras will soon all have WiFi and GPS built in and will be able to stream up to the web.

What will drive all these devices is basically services built and hosted in the cloud. I watched as the PC transformed IT shops, and it was not because of those of us in the inside, it was because employee's acting as consumers could bring the technology in house themselves. And that phenomena is exactly what is driving cloud services right now, employee's acting as consumers are bringing in the end points of the cloud services right now to your organization.

So that's the main reason why I think cloud computing can be the next transformational technology in IT, because organized IT is not driving it!

Friday, January 23, 2009

Down the rabbit hole

It constantly amazes me when our IT works. The number of parts and the inter-dependencies among them is just staggering. Then there are the times when it does not work. I have been doing IT for over 25 years and it seems like the amount of failure has been growing. I have mostly been able to overcome and work around these failures, but I think I have met my match......in a few more weeks we will know for sure.

I have been mainly posting on our hardware travails. We install hardware in order to run IT systems. Laying behind most of the hardware troubles have been software troubles. Just to show you how inter-connected this all becomes and how time just expands beyond the patience of even the most understanding of managers, let me tell you one part of the on-going saga of building a large scale NAS that is cheap enough to attract technically savy bio-medical researchers.

This tale revolves around a feature of the NAS that we viewed as essential to attracting the largest pool of researchers. As you should understand from the way I am phrasing this, we have to recharge for storage and researchers are free to spend their grant money any way they like.

That feature was using our organizations central Active Directory (AD) for CIFS (CIFS is a microsoft network file service) authentication.

This sounds elementary for a NAS, no? So, the saga starts with us trying to join the NAS server software to the AD. That fails. We eventually discover (by using opensolaris truss debugging) the precise series of commands and failure responses. What it all boils down to is this:

We live within a OU which lies just below the O for the organization. This is a scenario that more than a few large, de-centralized organizations have. They have a structure that looks something like this:

O=BigCampus
OU=Medical School
OU=Engineering School
OU=Law School

and so on. (Our actual structure is a little more complicated because it is a true forest, but nevertheless, the parts that are important are the OU parts.)

Now the central IT group, whose responsibility it is to run the Active Directory (AD) reserve administrative rights at the top level to themselves, and provide administrative rights to each OU which are limited to that OU.

In the common, simple AD case, all servers have their own entry in a computers sub-tree which lives within the top level O. In the delegated case, the servers have entries in a computers sub-tree which lives within the OU.

Enter SUN CIFS. We are using a variant of opensolaris that includes SUN CIFS. SUN CIFS was designed assuming the simple case, and does not work in the delegated case. Our join failed, we got precise error messages, eventually a bug is filed on our behalf by our NAS software vendor at SUN. That bug is 6691539 and is discussed at length in the opensolaris forum here. As you can see this was back in April of 2008 and a fix was provided in a opensolaris release. Our vendor backported this fix to an earlier version of opensolaris due to version release controls of their own.

As we discovered, that fix got us further along but still did not address the entire problem. It is now approaching Febuary of 2009 and we still can't join our AD. There is currently one other open bug related to this and I suspect that it is the next problem in the chain of problems that we encountered. IF not, then there are more 'bugs' related to this use case.

I thought we did a good job of describing the overall scenario we have here. Such scenarios are often called use cases nowadays. So we had a use case. We had an initial fault in that use case, that specific fault was repaired, but the entire use case was not fixed, because one specific failure just lead to the next specific failure, all related to the same use case.

Maybe that explains why it's going on a year now with no success. Maybe there are other explanations. All I know is that at least three other NAS software sources that we are playing with fully accomodate this use case.

Maybe the next release of OS-X server will finally have ZFS in it and we can switch off of SUN to X-Serve running OS-X. Afterall, the hardware underpinnings of both are X86 platforms....

Maybe, but then maybe we just start the hardware woes and find other software problems. This is the rabbit hole.....

This also brings up another problem for an IT group - how long do you wait, when a fix is just another 'any day now' patch away? How many times can you allow where that patch leads to another bug which needs another patch due 'any day now'. Then complicate that with the onset of hardware problems.

When do you call it quits?

If you are a small IT shop and have committed a significant fraction of your budget, then calling it quits puts the shop in jeopardy of not having enough resources to continue.

This was my call and I think I got it wrong. I usually am able to judge these things and cut our losses early. This time, since I had already cut loose on three previous vendors and one implementation project, I believed that I could not afford to change directions yet another time. Further more, I believed that because the core of what we were doing was open source, this would work, eventually. Eventually is not good enough, in the world of organizations, one needs a plan and an estimated due date. Not having one is a death blow. Most managers are used to IT projects being delayed and over budget, often significantly, but they always have a plan with expected dates. While we could provide dates based on optimism, after we failed to meet a series of them, our dates became meaningless. Furthermore, as you can see, we have no dates on the core problem resolution.....

From the perspective of management, this situation looks like this:

You (meaning the IT group) said you would deliver a storage solution to us.

You have not done that in over a year (yes this saga has many more chapters in it).

Other IT groups have done this sucessfully.

We understand that your price target was much lower than the other IT groups. But time is money and your project has lost any savings you promised to deliver.

It's time for us, as managers, to move on.

Monday, January 19, 2009

One victory

Today, we swapped the 64GB of RAM to a spare server we had purchased and this server POSTs up 64GB! So we have a wonky MB in our other server. This new server follows the pattern of all but one of our X4240's in that the BIOS control of ILOM IP address does not work.

Maybe some more small victories will start coming our way as this week wears on. Or maybe our good fortune had something to do with the SUN being out all day for the first time in a week.

Hardware sucks

Sucks in many ways. the X4240 is sucking manpower like mad, it is sucking our reputation away with our customers, it is sucking my patience to the limit.

The latest:

PCI kernel panic and reboot! - Using the Myricom CX4 card (long story why this card) and doing some testing here is what happened:

fault.io.pciex.device-interr dev:////pci@0,0/pci10de,377@a/pci14c1,9@0 faulted but still in service

and this is what is at the pci location:

/pci@0,0/pci10de,377@a/pci14c1,9@0"

So, big problems, can't bring this hardware combination into production. Can't wait either, will have to roll back to our older PCI-X platform.

On a completely different server, but still an X4240, the BIOS posts 60GB of memory with 64GB installed. After the SUN tech swapped memory, the problem remained. He attempted to swap the motherboard, but the "new" MB failed to boot. We are still waiting for a resolution.

So, yet another server, another customer, can't go into production.

Oh my. I have often been accused of pulling the trigger to fast on switching vendors when things like this happen. I have been told that these things happen and we just have to work through them.

There is something to that. In the last five years I have tried the following vendors, and moved away from each of them due to intractable problems:

HP, IBM, Dell, Supermicro, Tyan and now SUN.

Is this way it is? Am I cursed? Either way, it sucks.

p.s. As if this was not enough, on our Thumper platform, the X4500, we have 6 in production as iSCSI servers. These have been active for less than a year. That add's up to 288 one TB SATA drives. We just had our second drive failure! That's a .7% failure rate and a full year is not year finished on these! It's a good thing we planned on mulitple sources of redundancy here.

Saturday, January 17, 2009

Isn't hardware fun - episode three.

Revenge of the SSD's.

Previously .... our Intel X25E and M's actually degraded our performance in ZFS.

But, the SAS drives in the same backplane and on the same controller card also suffered performance degradation. So, we thought there must be something to SUN's refusal to support mixed SAS and SATA in this system.

Then the quest for a separate controller to run the SATA SSD's started. This quest was complicated by the fact that we needed low profile, PCI-E and opensolaris support. I should have remembered that all I had to do was visit Joe Little's Little Notes blog here: http://jmlittle.blogspot.com/2008/06/recommended-disk-controllers-for-zfs.html . That got me to the LSI 3442ER.

Now the next quest began, what to put the 2.5" SSD's into? Not trusting the X4240 backplane, although it does have two SAS cables running to it, suggesting that it might be a split backplane I started looking around for 2.5" SAS JBOD enclosures. This was harder than I thought. I ended up finding only two. One by AIC which is nice and inexpensive but only comes from resellers we have no previous relationship with or *BANG THE DRUM* HP! Wow, HP has two JBOD's, the MSA50 and the MSA70. Interestingly enough under OS support, HP claims Solaris works on the MSA70 but makes no such claim for the MSA50. A quick e-mail with our friendly HP storage engineer revealed that we could indeed install and 2.5 SATA/SAS drive we wanted and that the carriers for such disks come with the empty JBOD. ( I had to check, so many vendors make sure their stuff only works with drives supplied in carriers only available from them, and the carriers are not available separately - hint to SUN)

Well, when all this stuff arrives and assuming it actually works on the X4240, I will post on whether our attempt to build a SSD enabled ZFS system accomplished anything.

Oooh, for those who would say why don't we just use the SUN Amber Road series which already has this stuff working in it. Well, we have already made a substantial committment to older SUN hardware (Thumpers) and since SUN won't put their latest software developments on their legacy hardware, we thought we would give it a try by re-using a X4240 we already had as a NAS headend to the Thumpers.

Below zero and the datacenter

Isn't hardware fun, episode 2 - Attack of the cold zones.

Today we found out that all three of our diesel generators are reporting low fuel temp and one refuses to run at all. I suppose that our low of -14F had something to do with it. I wonder how widespread this kind of thing is? Of course, our vendor for the gen's is just troubleshooting as normal, no hint that perhaps we have the wrong installation for our climate.

The X4240 saga continues

Previously, on this channel, the Adaptec SAS controller in the X4240 (or perhaps the SAS backplane itself) can talk to both SAS and SATA at the same time, just with no performance whatsoever.


In this episode another SUN X4240 with 64GB POSTs at 60GB. Memory swap doesn't change this, a motherboard swap was a total failure and the new MB would not boot.
Isn't hardware fun.

Thursday, January 15, 2009

twitterfeed

Thanks to following John Halamka's twitter I discovered twitterfeed which will take this blog entry and enter the the title into twitter and provide a tinyURL back to this blog entry.

It was either twitterfeed or switching over to the Flock browser. Flock still has lot's of interesting things to offer, but in the never ending browser shuffles it's getting harder and harder for me to leave Firefox. I have too many, probably really too many, extensions loaded and I don't have the time to vet them in Flock.

My earlier comments about openid were because twitterfeed is openid enabled and frankly I am tired of creating accounts right and left, seemingly anytime I want to explore anything new.

Setting up a twitter feed

Well, this was supposed to be easy, but I got sidetracked with openid. I finally decided to create a myopenid ID because I want to make a distinction between what is called a custodial ID and other ID's. To choose a custodial ID I actually decided to trust Microsoft. They list three openid providers for their HealthVault application and myopenid was one of them.