At 5PM one evening about 10 years ago I had a call from the head of application performance at the company I was working for…
“Captain, I just thought you’d like to know you’re currently responsible for the most expensive piece of SQL in the company – worldwide… “
It was using about 95% of the CPU resource on 2 Superdomes. This was a production showstopper; no going home tonight…
Now this was embarrassing. I and one of my team were responsible for a recent complete refactoring of a key component on one of our global systems. We’d tuned the hell out of this thing. Despite it trawling a decade’s worth of live data – millions of records, it was so efficient that it returned in under half-a second every time with minimum memory footprint, minimum disk I/O and minimum CPU load. We’ve profiled it like crazy under all kinds of data shapes and volumes. We’d taken production transaction volumes, built in contingency and ploughed it through those and it was just screaming away, no problems. In fact it was one of the best bits of refactoring & performance tuning I and my co-developer had ever done.
Something was seriously wrong. There was no disputing the logs.
As always with these major production issues, I and my development parter pulled an all-nighter on calls to various application groups in the US.
We found an anomaly.
One query was being called about 2,000,000 times a minute!
All our profiling, production comparisons and performance data had been based on a peak load of about 1,000 hits a minute. As a key component of a worldwide order system we knew it would be heavily used and put a reasonable load on the system but this was way beyond our most unrealistic expectations.
All our tuning had been based on a known forecast usage through our standard order management system. The calculation engine was typically called once or twice for every order placed worldwide. (We’d forecast about 500 orders per minute and easily had capacity for twice that volume)
Further investigation revealed the source.
Earlier that week, a new application from another team had been integrated into production.
There were rumors that the VPs of each team didn’t speak to each other. Either way, although we knew members of the other team, they were on the other side of the world and we didn’t collaborate very often.
This new application was dependent on the same calculation engine. We’d spent some time training their developers on how to interface to it and they were really pleased with the results they were seeing. Once our knowledge transfer was done, that was the end of the story as far as we were concerned.
What we didn’t know was that their initial effort were so successful that they had integrated it into their product as an auto-calculation system.
Every time a user tabbed from one field to the next on the order line details, it was performing a recalculation.
Now a typical order at this company contained about 100 order lines. And each order line contained ~20 fields. Our calculation engine was being put under nearly 2,000 times more load than had ever been expected!
Needless to say we had the team fix it fast.
They removed the “auto-price” feature and replaced it with a “recalculate” button on the shopping cart.
1: When you’re writing a system that will be integrated with multiple systems or uncontrolled third-parties, make sure there’s a mandatory part of the interface in place that requires the caller to be clearly identifiable and that your logging is user-friendly. Putting all blame aside, this immediately allows you to identify and isolate unique integration issues.
2: Don’t just train your users/customers how to use your system. Help review their approach and processes and once they’re up & running, get them to show you the results and walk you through what they did. Chances are they will try to do something exotic that you weren’t expecting and would make you squirm. Better to find it early and deal with it than allow it to become a production issue. Remember it’s up to you to be a role model in this collaboration.
3: Identify, set your performance constraints, expectations and limitations up-front. Consider even building them into the system in some configurable way and tell your user/customers what they are. In the days of denial-of-service attacks, secure coding requires us to put throttles into our systems to prevent them running away with resources. (anyone ever tried to buy Glastonbury tickets online?) Even in smaller internal systems, it’s worth having some idea on volume and usage. Performance defects are notoriously difficult to resolve, are usually showstoppers and often require major architectural rework – We were lucky this time!
4: Many defects are found when your software is used in ways that you and your requirements didn’t foresee. Often that flexibility is a major asset or differentiator but if it’s not, consider putting up safety rails to limit your system to intended usage only. If it “might be useful” to work another way in future, remember YAGNI – You aren’t gonna need it. If you feel you must have that potential flexibility, at least consider putting a lock on it that requires your customer or user to call up or understand and recognize the impact first.