Social Media
GitHub
Navigation
Powered by Squarespace

Entries in Troubleshooting (3)

Wednesday
Sep152010

Please Stop This Thing, I Want to Get Off: Living the Merry-Go-Round of FAIL

There's this pattern of application failure I've ended up dealing with a lot over the years. Stop me if you've heard this one before.

Our scene opens with a multi-tiered client-server application, let's say, for the purpose of argument, that it's a web app. There are web servers in the front, usually with some sort of load balancer in front of them, then maybe a middle tier application server (SOAP, J2EE, that kind of thing), and some kind of shared state/storage at the back, let's call it a SQL database.

The app is functional, but not fast. Every once in a while, some developer might be assigned to try to make something perform a little better, but mostly everyone's piling on new features.

Then one day, the app just dies. No one can use it, customers are complaining, everyone starts looking around, people poke some, database traces are initiated, and eventually they get it to come back on line. Everyone pats each other one the back, they send off an email to management explaining how they fixed the problem, and live happily ever after.

That is until tomorrow, or that afternoon, or the next week, and then it happens again. And someone (often several someones) make changes that seem like a good idea in the heat of the crisis - they weren't caught completely by surprise this time, so they notice it earlier and have more time to poke and change. The site comes back up. They declare victory and move on. Until it happens again. Each time, they make the change that fixes the problem. Each time, they assign a root cause and justify why we really fixed it this time. Each time it comes back again.

Here's what's happening:

The system is at what physicists call an unstable equilibrium. Many enterprise IT applications live their lives this way. Think of a pencil balanced on its point - it's possible for it to balance that way, but as soon as something blows on it, over it goes.

All it takes is one little push - a network hiccup, a SAN slowdown, temporary traffic spike, some user runs a report during prime time that they usually only run at night - could be any one of dozens of things. This starts the wheel going around. Things start to get slower, and because there's a shared back-end, when the back-end gets slow, everything that touches the back-end (which often everything) gets slow.

Next phase, normal things that used to be able to run just fine as each other, can't. Maybe the query runs longer which means it's holding its rows locked longer which causes the next query to block, etc. Maybe the network had to retransmit the data more than once because a pipe got full and a packet got dropped. Maybe the disk queue got too large, causing disk access to take too long, which caused the queue to get larger. It can be different every time. But what is happening at this point is called a positive feedback loop. The slower things get, the more it makes things slower. This is fatal.

At some point in the downward spiral, a threshold is hit. Maybe the load balancer declares all the web servers dead and stops sending them traffic. Maybe the database gets slow enough that the web servers lose their connections to the DB. Maybe the DB reboots. Could be anything.

Now that there's no more traffic, everything that's fighting for a resource will either time out and die, or get the resource it wants. The resources are no longer being fought over. Things become calm again.

The front end servers reconnect. Everything seems fast again. People believe that whatever they did fixed the problem.

But usually, nothing really got fixed and it's just a mater of time before it starts again. And quite often, all the cowboys running around changing things during the outage without testing or evidence (often saying "well, we can't make it worse, right?") just make the equilibrium a little less stable, so it takes even less of a push to start it rolling again. Continue this for a while, and things will get so unstable that there's no equilibrium, and the system can't even take what used to be normal traffic.

The only long-term solution I've found, is patience. when the site comes back up, collect whatever information you gathered during the outage, decide on 2 things:

1) What is the ONE AND ONLY ONE THING you are going to change if this happens again, and

2) What other data to you want to try to collect on the next outage?

Then you get your scripts ready and hope it's really fixed, but if you're wrong, you know you still have a problem, you make your SINGLE SOLITARY CHANGE, gather what data you can, and once it comes back up, you hope your one change fixed it. If not, you better have a new plan for your next change queued up by then. It might even get worse (you are measuring your app's performance, right)? In which case, you need to reverse your one change that you made (you did make a back-out plan, right)?

One more tip: Try not to focus on the thing that started the slow down in step 1. A lot of teams spend hours or days trying to identify what the SAN hiccup was caused by. They think of this as the root cause - but it's not - it's just the catalyst. The root cause is the inherent instability of the system. That's less pleasant to want to believe, because it's much harder to fix, but the sooner you start facing reality, the sooner you can apply the resources to the right things and break the cycle.

This is a survivable situation, both as an individual and as a team. It sucks, and if you make changes all over the place, you'll have no control over whether things get better or worse. So calm down, think, measure, change ONE THING AT A TIME measure again, and roll back if you need to. You CAN get off the Merry-Go-Round of FAIL - you just have to be patient and deliberate.

And whatever you do, try not to fall off the horsey, will you? It's unseemly.

 

Monday
Sep132010

The Rules - At Least As I See Them (Well, the First Two)

Since I've been dealing with computers, I've developed some rules of thumb.  The first rule seems obvious, although I'm constantly surprised by the people that break it.  It is:

Rule 1: Never run a command on a computer that affects the communications path through which you are connected to that machine.

This is slightly more complicated than it sounds - especially when configuring routing protocols in routers.  You change things such that you lose your routes from where you are to that machine, and it's time for Plan B.  However, for the most part, it's straightforward, which is why it became the first rule.

The second is less obvious, and more controversial, although it's potentially more important.  It is:

Rule 2: System add-ons included solely for the purpose of reducing downtime by means of failover or redundancy will cause more downtime due to bugs or misconfigurations than would have been caused by hardware failures if those add-ons had not been included, unless the level of diligence and effort is greatly increased.

Let's take that a piece at a time.  There are a lot of pieces of tech in the world that people use to protect against hardware failures, like SANs and clusters.  That function, protecting against hardware failures, is inherently complicated, and that means that those pieces of tech have to be inherently complicated.  And Murphy's Law (and experience) tell us that the more things there are that could go wrong, the more likely something will.  In fact, I contend that, unless you go to extraordinary efforts to test every last possible thing that can go wrong, problems with those many complicated bits will cause more problems than would have been caused if you'd just left them off.

I've seen a lot of network outages in my time that were caused by routers that got confused and sent out (or listened to) the wrong routes.  Network engineers have many names for this phenomenon - my favorite is "flapping", and it's a very common happening.  I have seen many fewer network outages caused by router hardware that just dies - and most of those have been routers that spent time in places with very dirty electrical power.  Now of course, I have seen networks that lose routers without any hiccups at all, but those are generally the networks that require "pull tests" (where you unplug routers and make sure things fail over as you expect) after every non-trivial configuration change and periodically on a regular basis.

Likewise, I've dealt a lot lately with a network that has regular issues due to "automatic spanning-tree reconfigurations" and a database cluster that blue-screens when the underlying SAN hiccups.

Think about it for a second - there are many different ways that a system can go wrong - many different pieces that can fail in different ways.  What are the odds that the code that is supposed to deal with that specific failure is going to behave exactly as you want it to the very first time that section of code is executed in your environment?

I'm not saying "never use any High-Availability add-on", I'm saying "if you use an High-Availability add-on, either spend far more effort configuring and testing it than you would spend on the non-HA version, or expect it to cause you more problems than you would have had if you'd gone with the non-HA version."

It's okay if you don't believe me.  A lot of vendors have spent a lot of money trying to get you to believe that it isn't true.  But think about it, and start paying more attention to what's causing your enterprise more problems.  After that, I think it will become clear.

Tuesday
Sep072010

A Tale of Two Table Views - my UISearchBar Race Condition that I finally found

OK, so I finally found my race condition, I'd talked about here and here.

So, in my KidChart app, I have a UITableView that has a list of all behaviors that people can pick from:

 

 

and in the search box above, people can start typing to narrow down existing behaviors and then click on one so they don't have to scroll as much. As soon as the UISearchBar gets focus, it does this:

 

Now that tableview that has the 3 items that contain "o" is a different tableview than the one above. It's the tableview that belongs to the UISearchBar, as opposed to the one in my view.

Turns out, though that under some circumstances, core data events get triggered. And those events cause the UITableView to get refreshed. Shouldn't have been the end of the world, except it causes *Both* UITableView to get refreshed, both the filtered one that belongs to the UISearchBar, and the underlying, currently hidden one, that was there before the UISearchBar got focus. And the mistake I made was that I was reusing the same NSFetchedResultsController behind the scenes for both UITableViews (since I naively expected only one to be updated at a time). Most of the time, it seems one would get completely refreshed before the other refresh started, but sometimes (and more often on the iPhone 4), the calls to cellForRowAtIndexPath would get interleaved, causing unpredictable results, and the occasional crash.

Thank you, automated testing.

(And thank you Mike Uricaru who suggested a way for me to be able to use UIASearchBar.setValue() in the UIAutomation Javascript - without the race condition, it does work, and it's much faster to run than taping each key by hand - although I still use the keystroke at a time to test that I've gotten the filtering behavior correctly).