A few weeks ago the Company I work for celebrated its 13th birthday. As part of the celebrations we were each given the latest copy of the “BoRG” – the company book. On reading through the pages I found one of the teams I’m responsible for received an award.
This isn’t quite as positive as it sounds and I’m pretty sure the incident will become lgendary within the company. The lessons learned, what we did afterward and the forward thinking attitude of our senior management are however truly worth celebrating.
The (rough) Story
Over the summer one of our teams was working on some updates to our product deployment tools (we deploy upwards of 50 new releases across our product portfolio every month). Part of the automated process involves uploading a packaged installer for our software to a download location and updating our web site to point to the update.
Due to a mix-up between environments and configurations, one of our internal tests made it into the outside world. The problem was spotted and resolved fast but something was clearly wrong for this to have been possible.
This alone would have been rather embarrassing however this was about the 6 or 7th significant incident that had come from our operations teams in as many weeks. We’d recently restructured team ownership of parts of the codebase, were making a large number of significant infrastructure, library, test and build changes to our systems – mostly legacy code (code without sufficient tests). Moreover, we added a whole new team onto the codebase with a very different remit in terms of approach and pace, the volume of churn in the code had massively increased.
This was all during the height of holiday and conference season so many of us weren’t fully aware of the inner carnage that had been occurring. Handing over from one manager to the next repeatedly meant we’d not seen the bigger picture.
I returned from Agile 2012 to a couple of mails from my boss (who was now on holiday) to fill me in on what had happened with the (paraphrased) words: “Can I leave this safe in your hands”.
Over the first 2 hours of my return I was briefed by managers and team members on the situation. Everything was sorted, no problems were in the wild but the team’s credibility had taken a beating.
I’d seen similar things happen in other companies and had always been certain of the right course of action. This was the first time I actually felt safe to lead what I knew was right.
I donned my “Lean” hat and started my nemawashi campaign with our senior managers.
I spoke to each manager individually – they were already well aware of the problems which made things much easier. I simply said.
“These problems can’t continue, we’re going to ‘stop the line’. All projects are going to stop until we’re confident that we can progress again safely.”
I went a step further and set expectations on timescales.
We’d be stopping development work for nearly 20 staff for at least a week. We’d monitor progress daily, and approval to continue would be on the condition that we were confident problems would not reoccur.
By lunchtime I had unanimous support. It was described as a “brave” thing to do by our CEO but all agreed it was right.
A side-benefit of Lean is the shared language it provides. In every case when I approached our management team and explained that I wanted to “stop the line” they immediately understood what I meant plus the impact, value and message behind such an action.
Now of course you can’t prevent new problems with hindsight but you can identify patterns of failure and address these. In our case I had a good understanding of what had been going on.
Initially I was strongly against performing a full root-cause analysis. There were half a dozen independent incidents and a strong chance of finger-pointing if we’d gone through these. I was already “pretty sure” where our problems lay. The increased pace had led to a fall in technical discipline coupled with an increased pressure to deliver faster and a lack of sufficient safety net (insufficient smoke tests).
I divided the group into 3 teams to focus on 3 areas.
- “before release” – technical practices
- “at the point of release” – smoke tests
- “after release” – system monitoring
With an initial briefing and idea workshop I stepped back and left the 3 teams to deliver.
The technical practices team developed a team “technical charter”. We brought all participants together for a review, revised and then published this. Individuals have since signed up to follow this charter and we review it regularly to ensure it’s working.
The smoke testing team developed a battery of smoke tests for the most critical customer-facing areas (shopping cart, downloads etc). These are live and running daily.
The monitoring team developed a digital dashboard (that I can still see from my desk every day). This shows the status of the last run of smoke tests (and history), build status, system performance metrics and a series of alerts for key business metrics that would indicate a potential problem with the site – e.g. a tail-off in volume of downloads or invoices.
They also implemented some server-side status monitoring and alerts that we subscribe to via email.
Since these have been in place we *have* had a couple more incidents but in every case we’ve spotted and resolved it early.
Subsequently a couple of the teams have self-selected to perform a root-cause analysis on a couple of issues. This is exactly the behaviour I love to see, it’s not a management push, they simply wanted to ensure we’d pinned things down and done the right thing. Moreover, they published the results to the whole company.