“Captain, we have a customer crisis, get your passport, we need you on the ground tomorrow.”
This would have been fine if it weren’t Thursday night, that I’d been pulling 14 hour days for about a month already and had promised to take my family out at the weekend to make up for it. – Sounds like something out of the goal, doesn’t it!
A couple of years after my first major lesson in application performance, now working for a different company; one of our key customers in South America was trying to get their staff through critical money laundering certifications before a fixed deadline. If this was missed, huge fines were on the cards.
Trouble-was, hardly anyone could get onto the system, it kept going down.
Our implementation team had taken our profile and scaling recommendations, added some contingency and specified the systems for the customer. The customer wasn’t happy with the cost so they’d gone for a cheaper processor and database option which made us uncomfortable but this was a huge deal for us. (It was an early version of SQL Server on Windows 2000 and they were trying to scale to nearly 10,000 users).
I arrived at the customer site. A really great, friendly bunch of people and one of the most bizarre office locations I’d ever been to. The office had a transient community of something like 1,000 staff with no other businesses anywhere nearby. Surrounding it was an entire cluster of lanchonetes that opened at lunchtime – solely to serve this building.
They’d taken the system offline for the weekend and were running their own testing using load runner. Having done scalability tests for customers before this seemed pretty simple. Ramp users up in waves to desired levels, test, ramp up more, test again, tune, retest etc. We recommended a transaction rate of about half a transaction per user per second as a “heavy” load for this type of application.
We reviewed the log files, it was dropping at about 1,000 users – way below where it should have been. I took a dig into their application servers, modified the JVM parameters for optimum heap size, garbage collection etc. – I’m pretty sure this will work.
Tests showed nothing abnormal. We’re looking good to move forward.
Let’s get the users back on…
10:30 AM Monday morning – The system goes down again.
We take a look at the logs.
Within the space of about a minute, nearly 5,000 users tried to simultaneously log into the system.
Back then, large-scale e-commerce event ticketing systems were not common. For an internal system like this, none of our team ever expected we’d have that kind of load just on login. Even 1,000 closely-packed logins seemed wildly unlikely.
What were they doing?
We talked to the managers about the users. We established that staff in all branches had 2 30 minute breaks per day in 2 patterns (half the staff at a time).
OK, so what…
This was the only time the staff were able to use their systems for this certification. The rest of the time they were serving customers!
These people’s breaks were precious, they had a deadline to meet for mandatory certification this was the only time they could do it.
Solution: Longer term we fixed the performance of the login routines so it’d never happen again but in order to achieve their short-term goals, the management team staggered break times for staff during the certification period. (Oh and I’d got the sequencing of one of the JVM hotspot parameters mixed up but that wasn’t so relevant).
Problem solved!
Lessons:
1: When designing systems for performance & scalability (even internal ones), you need a good understanding of peak load based on real usage profiles – including extreme ones. This might seem obvious now but to us back then it wasn’t. We thought we knew how the system was going to be used.
2: Spend a day with the real users, not just the “customer”, Make a point of understanding their goals, and constraints. If you’ve not done so before, I guarantee you’ll learn something unexpected.
3: Sometimes the best solution isn’t technical. In this case, a simple work scheduling change saved a potential fortune to our customer in hardware & licenses.
4: Just as in my first performance lesson; if load is problematic in some areas, build safety margins into the system.