Lessons in Application Performance (Part 2)

Reading time ~ 3 minutes

“Captain, we have a customer crisis, get your passport, we need you on the ground tomorrow.”

This would have been fine if it weren’t Thursday night, that I’d been pulling 14 hour days for about a month already and had promised to take my family out at the weekend to make up for it. – Sounds like something out of the goal, doesn’t it!

A couple of years after my first major lesson in application performance, now working for a different company; one of our key customers in South America was trying to get their staff through critical money laundering certifications before a fixed deadline. If this was missed, huge fines were on the cards.

Trouble-was, hardly anyone could get onto the system, it kept going down.

Our implementation team had taken our profile and scaling recommendations, added some contingency and specified the systems for the customer. The customer wasn’t happy with the cost so they’d gone for a cheaper processor and database option which made us uncomfortable but this was a huge deal for us. (It was an early version of SQL Server on Windows 2000 and they were trying to scale to nearly 10,000 users).

I arrived at the customer site. A really great, friendly bunch of people and one of the most bizarre office locations I’d ever been to. The office had a transient community of something like 1,000 staff with no other businesses anywhere nearby. Surrounding it was an entire cluster of lanchonetes that opened at lunchtime – solely to serve this building.

They’d taken the system offline for the weekend and were running their own testing using load runner. Having done scalability tests for customers before this seemed pretty simple. Ramp users up in waves to desired levels, test, ramp up more, test again, tune, retest etc. We recommended a transaction rate of about half a transaction per user per second as a “heavy” load for this type of application.

We reviewed the log files, it was dropping at about 1,000 users – way below where it should have been. I took a dig into their application servers, modified the JVM parameters for optimum heap size, garbage collection etc.  – I’m pretty sure this will work.

Tests showed nothing abnormal. We’re looking good to move forward.

Let’s get the users back on…

10:30 AM Monday morning – The system goes down again.

We take a look at the logs.

Within the space of about a minute, nearly 5,000 users tried to simultaneously log into the system.

Back then, large-scale e-commerce event ticketing systems were not common. For an internal system like this, none of our team ever expected we’d have that kind of load just on login. Even 1,000 closely-packed logins seemed wildly unlikely.

What were they doing?

We talked to the managers about the users. We established that staff in all branches had 2 30 minute breaks per day in 2 patterns (half the staff at a time).

OK, so what…

This was the only time the staff were able to use their systems for this certification. The rest of the time they were serving customers!

These people’s breaks were precious, they had a deadline to meet for mandatory certification this was the only time they could do it.

Solution:  Longer term we fixed the performance of the login routines so it’d never happen again but in order to achieve their short-term goals, the management team staggered break times for staff during the certification period.  (Oh and I’d got the sequencing of one of the JVM hotspot parameters mixed up but that wasn’t so relevant).

Problem solved!

Lessons:

1: When designing systems for performance & scalability (even internal ones), you need a good understanding of peak load based on real usage profiles – including extreme ones. This might seem obvious now but to us back then it wasn’t. We thought we knew how the system was going to be used.

2: Spend a day with the real users, not just the “customer”, Make a point of understanding their goals, and constraints. If you’ve not done so before, I guarantee you’ll learn something unexpected.

3: Sometimes the best solution isn’t technical. In this case, a simple work scheduling change saved a potential fortune to our customer in hardware & licenses.

4: Just as in my first performance lesson; if load is problematic in some areas, build safety margins into the system.

Lessons in Application Performance (Part 1)

Reading time ~ 3 minutes

At 5PM one evening about 10 years ago  I had a call from the head of application performance at the company I was working for…

“Captain, I just thought you’d like to know you’re currently responsible for the most expensive piece of SQL in the company – worldwide… “

It was using about 95% of the CPU resource on 2 Superdomes. This was a production showstopper; no going home tonight…

Now this was embarrassing. I and one of my team were responsible for a recent complete refactoring of a key component on one of our global systems. We’d tuned the hell out of this thing. Despite it trawling a decade’s worth of live data – millions of records, it was so efficient that it returned in under half-a second every time with minimum memory footprint, minimum disk I/O and minimum CPU load. We’ve profiled it like crazy under all kinds of data shapes and volumes. We’d taken production transaction volumes, built in contingency and ploughed it through those and it was just screaming away, no problems.  In fact it was one of the best bits of refactoring & performance tuning I and my co-developer had ever done.

Something was seriously wrong. There was no disputing the logs.

As always with these major production issues, I and my development parter pulled an all-nighter on calls to various application groups in the US.

We found an anomaly.

One query was being called about 2,000,000 times a minute!

All our profiling, production comparisons and performance data had been based on a peak load of about 1,000 hits a minute. As a key component of a worldwide order system we knew it would be heavily used and put a reasonable load on the system but this was way beyond our most unrealistic expectations.

All our tuning had been based on a known forecast usage through our standard order management system. The calculation engine was typically called once or twice for every order placed worldwide. (We’d forecast about 500 orders per minute and easily had capacity for twice that volume)

Further investigation revealed the source.

Earlier that week, a new application from another team had been integrated into production.

There were rumors that the VPs of each team didn’t speak to each other. Either way, although we knew members of the other team, they were on the other side of the world and we didn’t collaborate very often.

This new application was dependent on the same calculation engine. We’d spent some time training their developers on how to interface to it and they were really pleased with the results they were seeing. Once our knowledge transfer was done, that was the end of the story as far as we were concerned.

What we didn’t know was that their initial effort were so successful that they had integrated it into their product as an auto-calculation system.

Every time a user tabbed from one field to the next on the order line details, it was performing a recalculation.

Now a typical order at this company contained about 100 order lines. And each order line contained ~20 fields. Our calculation engine was being put under nearly 2,000 times more load than had ever been expected!

Needless to say we had the team fix it fast.

They removed the “auto-price” feature and replaced it with a “recalculate” button on the shopping cart.

Lessons…

1: When you’re writing a system that will be integrated with multiple systems or uncontrolled third-parties, make sure there’s a mandatory part of the interface in place that requires the caller to be clearly identifiable and that your logging is user-friendly. Putting all blame aside, this immediately allows you to identify and isolate unique integration issues.

2: Don’t just train your users/customers how to use your system. Help review their approach and processes and once they’re up & running, get them to show you the results and walk you through what they did. Chances are they will try to do something exotic that you weren’t expecting and would make you squirm. Better to find it early and deal with it than allow it to become a production issue. Remember it’s up to you to be a role model in this collaboration.

3: Identify, set your performance constraints, expectations and limitations up-front. Consider even building them into the system in some configurable way and tell your user/customers what they are. In the days of denial-of-service attacks, secure coding requires us to put throttles into our systems to prevent them running away with resources. (anyone ever tried to buy Glastonbury tickets online?) Even in smaller internal systems, it’s worth having some idea on volume and usage. Performance defects are notoriously difficult to resolve, are usually showstoppers and often require major architectural rework – We were lucky this time!

4: Many defects are found when your software is used in ways that you and your requirements didn’t foresee. Often that flexibility is a major asset or differentiator but if it’s not, consider putting up safety rails to limit your system to intended usage only.  If it “might be useful” to work another way in future, remember YAGNI – You aren’t gonna need it. If you feel you must have that potential flexibility, at least consider putting a lock on it that requires your customer or user to call up or understand and recognize the impact first.