Lessons in Application Performance (Part 1)

Reading time ~ 3 minutes

At 5PM one evening about 10 years ago  I had a call from the head of application performance at the company I was working for…

“Captain, I just thought you’d like to know you’re currently responsible for the most expensive piece of SQL in the company – worldwide… “

It was using about 95% of the CPU resource on 2 Superdomes. This was a production showstopper; no going home tonight…

Now this was embarrassing. I and one of my team were responsible for a recent complete refactoring of a key component on one of our global systems. We’d tuned the hell out of this thing. Despite it trawling a decade’s worth of live data – millions of records, it was so efficient that it returned in under half-a second every time with minimum memory footprint, minimum disk I/O and minimum CPU load. We’ve profiled it like crazy under all kinds of data shapes and volumes. We’d taken production transaction volumes, built in contingency and ploughed it through those and it was just screaming away, no problems.  In fact it was one of the best bits of refactoring & performance tuning I and my co-developer had ever done.

Something was seriously wrong. There was no disputing the logs.

As always with these major production issues, I and my development parter pulled an all-nighter on calls to various application groups in the US.

We found an anomaly.

One query was being called about 2,000,000 times a minute!

All our profiling, production comparisons and performance data had been based on a peak load of about 1,000 hits a minute. As a key component of a worldwide order system we knew it would be heavily used and put a reasonable load on the system but this was way beyond our most unrealistic expectations.

All our tuning had been based on a known forecast usage through our standard order management system. The calculation engine was typically called once or twice for every order placed worldwide. (We’d forecast about 500 orders per minute and easily had capacity for twice that volume)

Further investigation revealed the source.

Earlier that week, a new application from another team had been integrated into production.

There were rumors that the VPs of each team didn’t speak to each other. Either way, although we knew members of the other team, they were on the other side of the world and we didn’t collaborate very often.

This new application was dependent on the same calculation engine. We’d spent some time training their developers on how to interface to it and they were really pleased with the results they were seeing. Once our knowledge transfer was done, that was the end of the story as far as we were concerned.

What we didn’t know was that their initial effort were so successful that they had integrated it into their product as an auto-calculation system.

Every time a user tabbed from one field to the next on the order line details, it was performing a recalculation.

Now a typical order at this company contained about 100 order lines. And each order line contained ~20 fields. Our calculation engine was being put under nearly 2,000 times more load than had ever been expected!

Needless to say we had the team fix it fast.

They removed the “auto-price” feature and replaced it with a “recalculate” button on the shopping cart.

Lessons…

1: When you’re writing a system that will be integrated with multiple systems or uncontrolled third-parties, make sure there’s a mandatory part of the interface in place that requires the caller to be clearly identifiable and that your logging is user-friendly. Putting all blame aside, this immediately allows you to identify and isolate unique integration issues.

2: Don’t just train your users/customers how to use your system. Help review their approach and processes and once they’re up & running, get them to show you the results and walk you through what they did. Chances are they will try to do something exotic that you weren’t expecting and would make you squirm. Better to find it early and deal with it than allow it to become a production issue. Remember it’s up to you to be a role model in this collaboration.

3: Identify, set your performance constraints, expectations and limitations up-front. Consider even building them into the system in some configurable way and tell your user/customers what they are. In the days of denial-of-service attacks, secure coding requires us to put throttles into our systems to prevent them running away with resources. (anyone ever tried to buy Glastonbury tickets online?) Even in smaller internal systems, it’s worth having some idea on volume and usage. Performance defects are notoriously difficult to resolve, are usually showstoppers and often require major architectural rework – We were lucky this time!

4: Many defects are found when your software is used in ways that you and your requirements didn’t foresee. Often that flexibility is a major asset or differentiator but if it’s not, consider putting up safety rails to limit your system to intended usage only.  If it “might be useful” to work another way in future, remember YAGNI – You aren’t gonna need it. If you feel you must have that potential flexibility, at least consider putting a lock on it that requires your customer or user to call up or understand and recognize the impact first.

Blame

Reading time ~ < 1 minutes

When your project goes wrong because you’re dependent on another team, whose “fault” is it really?

  • When did you identify the risk?
  • What did you do to mitigate it?
  • What relationship did you build with the team you’re dependent on?

When did you start really collaborating to resolve the dependency?

Too often we throw our problems over the wall and blame the people that aren’t there to catch it for our own laziness.

  • Are we too lazy?
  • Are we too busy? (what else is more important?)
  • Are we secretly looking for someone to take ownership of our problem?
  • Are we cynically looking for a scapegoat?
  • Are we just incompetent?

Think about government lobbyists – they spend their entire time fighting for what they need to achieve…

If something is that important to the success of your goals, why aren’t you sat next to the people you need it from making sure it’s on the top of their priority list too?

YOU ARE RESPONSIBLE FOR YOUR OWN COLLABORATION FAILURES

Agile 20xx – It Takes a Team

Reading time ~ 3 minutes

Inspired by this post I just read from Seth Godin. As an average joe, hob-nobbing with CEOs isn’t so likely for most of us but the interpersonal interaction certainly is worth making time for. However…

If you’re paying to learn and thinking of attending Agile 2011 (or any other really big conference) take a team!

For the last couple of years I’ve attended the big US Agile Alliance conferences. I love going although I miss my family; I really enjoy the social interaction, spending time with (on the surface) a bunch of smart, similar-minded people sharing a week of learning, collaboration and fun. There are of course undercurrents (see this old post from Jean Tabaka) but I still find attending these conferences a rewarding experience.

For anyone that’s not been, the Agile 20xx conferences are huge and seem to be getting bigger every year. The number of parallel tracks – all with great presenters means you really have to make a plan.

Each year I’ve attended, I’ve had the luxury of going out as part of a team. My advice to future attendees is the same – take a team and collaborate on what you want to see.

My goal each year is to walk away with at least 3 key pieces of new thinking that would add value to my teams. The travel and conference fee for just 3 ideas might be excessive but as I said, I also go to learn, collaborate and have fun.

If even 1 of the ideas I bring back is sticky enough to be introduced successfully to a group of nearly a thousand engineers worldwide then I think that’s worthwhile!

So which sessions do you attend?

Personally I tend to steer away from the “rock star” sessions. Most of what they’re presenting is covered in their current or forthcoming books (and I read a lot of those already) so I don’t get much from them. Other members of my team don’t read so much and so will go along to some of these. Get the team to strike a balance but make sure the motivation is to learn and share, not just to meet “famous people“.

The stuff out there from people without book deals is rarer so I make the most of those sessions but less-experienced or well-known presenters may mean a higher risk of a not-so-good session.  Also keep an eye out for the open space sessions and lightning talks. You’ll find a lot of up-and-coming talent and thinking out there but again, your mileage may vary.

I also tend to look for things that are relevant to my current context, and future direction, not just my current role. A big part of my interest is in viewing things from other people’s and roles perspectives – for example I’ve never formally had the job title of “tester” but “Agile Testing” is one of the best books I’ve read that really “gets” cross-role pairing and test strategies – relevant to developers and managers, not just testers!

Others may prefer to spend the entire week in “UX” or “Testing” tracks but generally I strongly discourage spending the entire week in a single track in your own domain – you just don’t learn enough of the other interesting gems. Maybe it’s just me but agile isn’t just about your own specialist area, it’s about the whole team and organization.

Having presented at Agile Cambridge last year; I and my team have submitted a couple of proposals for Agile 2011 as we’re hoping we can start to give something back. There’s a lot of presentations out there so our chances are slim however despite the number that don’t make it, there are hundreds of great ones that do.

When planning your sessions, take a team and go for a knowledge portfolio – balance risk & reward to get maximum return on your investment. You won’t regret it.