Escaping the Oubliette (Part 3) – Bug Blitz

Reading time ~ 3 minutes

Every product team I’ve ever worked in had a bug blitz at some point and often one every year or two.

There’s no arguing that a decent bug blitz is a powerful way of getting the numbers down and clearing all the bugs in a good, solid drive feels good but the necessity for them is caused by a buildup from somewhere.

If you find you’re needing a bug blitz on every release, take a look at your defect and debt prevention activities and make sure you’ve got some topping & tailing practices going on.

Usually bug blitzes are performed at the end of a project but if you’ve not done so before, try having a 4-8 week blitz at the beginning instead. You’ll run quicker afterward and (if you keep things under control), you won’t have to worry about having time to mop up at the end.

Once you’ve got the numbers down, set your maximum defect threshold (or ratchet) at this level for the remainder of the project and keep this new lower level sustained throughout development.

What’s good about a bug blitz?

A blitz is particularly useful if you can focus on areas of the product you’ll be working with soon. It’ll get your team working together (particularly if they’re newly formed) and familiar with these areas before all the major work starts.

Couple this up with developing some decent automated tests in those areas as part of the defect fixing and you’ll be developing a much safer scaffold for your new work and reduce regression risks during your next release.

You could take things further and perform some refactoring but I suggest keeping different types of activities separate at this point and just stick to straight bugs. If you have a specific functional area needing a real overhaul, it doesn’t fit the bug blitz mold. I’ll cover this aspect  in more depth when I talk about “sponsorship” for debt reduction.

I use bug blitz approaches when training up new staff. I review the defect backlog for a particularly grubby functional area and have a pair of staff take custody of it as caretakers. We pipeline the defects so that they can start with some simple introductory ones (usually low severity, noise or cosmetic stuff) and once we get confident in these we expand out into adjacent areas – it usually takes a month or two for them to really get warmed up.

The bug blitz is also a great opportunity to start new release development with a clean slate. It means no mixing types of work during feature development and gives you an opportunity to scrub the grime out and get a few new scaffolding tests in place.

What about the down-side?

First; they cost time and money. If you dedicate a team (or most of a team) to a blitz you’re not delivering anything new. This is obvious but important. What’s the impact of a month’s delay to your next release? (assuming you can ship with those bugs in there). And what will that delay do to your stakeholder relationships?

Be mindful of the impact a bug blitz can have on your customers. A high level of churn on existing functionality can be really dangerous. You might need to make a point of jumping through a few hoops to retain backwards compatibility.

Don’t be too hasty to refactor if customers are expecting certain things. If you don’t have decent automated regression tests, you really need to ensure you write at least a couple that hit the same area before you fix any defects. (I usually expect my developers to deliver at least 2 or 3 new automated tests with each bug fix).

Worse still, I’ve seen a batch of fixes in a functional area radically change behavior “for the better” according to the developers that raised the original internal defects who then discovered that customers were actually depending on existing behavior or had developed their business processes around the issues.

Sometimes, even if you think it’s ugly, it might be that way for a good reason.

With enterprise software, you’re often looking at heavily customized implementations, some of which have taken years and millions of dollars to evolve.  Smacking these with a mountain of churn on existing functionality can be painful and expensive for your customers. Whilst it makes your numbers look good, consider whether some things should really be touched.

In summary. Bug blitzes are a great way of starting with a cleaner work area and ramping up teams but beware of backwards compatibility, customer impact, time and cost pitfalls.

Read part 4 – “the litter patrol

Patterns For Collective Code Ownership

Reading time ~ 5 minutes

Following the somewhat schizophrenic challenges of dealing with person issues on collective code ownership, here we focus on the practices and practical aspects. How do we technically achieve collective ownership within teams?

We’re using the “simple pattern” approach again. For each pattern we have a suitable “anchor name”, a brief description and nothing more. This should be plenty to get moving  but feel free to expand on these and provide feedback.

Here’s the basic set of patterns to consider:

Code Caretaker

Let’s make it a bit less personal, encourage the team to understand what the caretaker role for code entails. It’s not a single person’s code – in fact the company paid us to write it for them. Code does still need occasional care and feeding – Enter the “Caretaker”.  Be wary though that caretakers may become owners.

Apprenticeship

Have your expert spend time teaching & explaining. An apprentice learns by doing – the old-fashioned way under the tutelage of a master craftsman guiding them full-time. This is time and effort-intensive for the pair involved but (unless you have a personality or performance problem) is a sure-fire way to get the knowledge shared.

If you need to expand to multiple team members in parallel,  skip forward to feature lead & tour of duty.

Ping-Pong Pairing

Try pair programming with an expert and novice together. Develop tests, then code, swap places. One person writes tests, the other codes. Depending on the confidence of the learner, they may focus more on writing the tests and understanding the code rather than coding solutions. Unequal pairing is quite tricky so monitor this carefully, your expert will need support in how to share, teach & coach.

Bug Squishing

Best with large backlogs of low-severity defects to start with. Cluster them into functional areas. Set a time-box to “learn by fixing” – delivery is not measured on volume fixed but by level of understanding (demonstrated by a high first-time pass rate on peer reviews). If someone needs help they learn to ask (or try, then ask) rather than hand over half-cut. This approach works with individuals having peer review support and successfully scales to multiple individuals operating on different areas in parallel.

A Third Pair of Eyes (Secondary Peer Review)

With either an “initiate” (new learner) or an experienced developer working on the coding, have both a learner and an expert participate in peer reviews on changes coming through – to learn and provide a safety net respectively. The newer developer should be encouraged to provide input and ask questions but has another expert as “backstop” for anything they might miss.

Sightseeing / Guided Tour

Run a guided walk-through for the team – typically a short round-table session. Tours are either led by the current SME (subject matter expert) or by a new learner (initiate). In the case of a new learner, the SME may wish to review their understanding first before encouraging them to lead a walk-through themselves.

Often an expert may fail to identify and share pieces of important but implicit information (tacit knowledge) whilst someone new to the code may spot these and emphasize different items or ask questions that an expert would not. Consider having your initiate lead a session with the expert as a backstop.

Feature Leads

The feature lead approach is one of my favourites. I’ve successfully acted as a feature lead on many areas with teams of all levels of experience where I needed to ramp up a large group on a functional area together.

By introducing a larger team, your expert will not have the capacity to remain hands-on and support the team at the same time. They must lead the team in a hands-off approach by providing review and subject/domain expertise only. This also addresses the single point of failure to “new” single point of failure risk seen with “apprenticeship”.

This is also a potential approach when an expert or prior owner steps in too frequently to undo or overwrite rather than coach other members activities. By ensuring they don’t have the bandwidth to get too involved you may be able to encourage some backing-off but use this approach with care in these situations to avoid personal flare-ups.

Cold Turkey

For extreme cases where “you can’t possibly survive” without a single point of failure, try cold turkey. Force yourself to work without them.

I read an article some years ago, (unfortunately I can’t track it down now) where the author explained how when he joined a project as manager, the leaders told him he must have “Dave” on the project as nobody else knew what Dave did. He spent a week of sleepless nights trying to figure out how to get Dave off the project. After removing Dave from the project, the team were forced to learn the bottleneck area themselves and delivered successfully.

(Apologies to any “Dave”s I know – this isn’t about any of you)

Business Rule Extraction

Get your initiate to study the code and write a short document, wiki article or blog defining the business rules in the order or hierarchy they’re hit. This is usually reserved for absolute new starters to learn the basics of an area and show they’ve understood it without damaging anything. It is however also a good precursor to a test retrofit.

Test Retrofit

A step up from business rule extraction whilst delivering some real value to the team. Encourage your learner to write a series of small functional tests for a specific functional area by prodding it, working out how to talk to it, what it does and what we believe it should do. Save the passing tests and get the results checked, peer reviewed and approved.

Chances are you might find some bugs too – decide whether to fix them now or later depending on your safety margin with your initiate.

Debt Payment

After a successful test retrofit, you can start refactoring and unit testing. As refactoring is not without risk this is generally an approach for a more experienced developer but offers a sound way of prizing out functionality and learning key areas in small parts.

This is also a natural point to start fixing any bugs found during a test retrofit.

National Service (also known as Jury Duty)

Have a couple of staff on short rotation (2-4 weeks) covering support & maintenance even on areas they don’t know. This is generally reserved for experienced team members who can assimilate new areas quickly but a great “leveller” in cross-training your team in hot areas.

Risks are generally around decision-making, customer responsiveness, turnaround time and overall lowering of performance but these are all short-term. I’ve often used this approach on mature teams to cross-train into small knowledge gaps or newly acquired legacy functional areas.

My preferred approach is to stagger allocation so that you have one member on “primary” support whilst another provides “backup” each sprint. For the member that covered primary in sprint one, they’re on backup in sprint 2. (expand this to meet your required capacity).

3-6 months of a national service approach should be enough to cover the most critical functional gaps on your team based on customer/user demand

Tour of Duty

Have your staff work on longer term rotations (1-6 months) working on a functional area or feature as part of a feature team. This couples up with the “feature lead” approach.

Again, this is an approach that is very useful for mature agile teams that understand and support working as feature teams but can be introduced on less-experienced teams without too much pain.

The tour of duty can also work beyond an individual role. For example a developer taking a 3-6 month rotation through support or testing will provide greater empathy and understanding of tester and user needs than simply rotating through feature delivery. (My time spent providing customer support permanently changed my attitude toward cosmetic defects as a developer)

Adjacency

This takes a lot more planning and management than the other approaches described but is the most comprehensive.

From the areas each team member knows; identify where functional or technical adjacent areas lay and then use a subset of the approaches described here to build up skills in that adjacency.

Develop a knowledge growth plan for every team member sharing and leveraging their growth across areas with each other. Although this is an increase in coordination and planning, the value here is that your teams get to see the big picture on their growth and have clear direction.

All these patterns have a selection of merits and pitfalls. some won’t work in your situation and some may be more successful than others. There are undoubtedly more that could be applied.

Starting in your next sprint, try some of what’s provided here to developing shared ownership and knowledge transfer for your teams. Pick one or more patterns, figure out the impacts and give them a try.

(Coming soon “Building a Case for Collective Code Ownership” – when solving the technical side isn’t enough)

Escaping the Oubliette (Part 2) – Tailing

Reading time ~ 4 minutes

Following my previous articles on topping and debt prevention, we’ll now focus on the easier parts. How to clear old defects and debt. we’ll cover 4 simple strategies; 2 for defects and 2 for debt. (there are paired similarities across these).

For defects, we’ll focus on:

For debt we’ll focus on:

Unlike prevention and stop the bleeding activities which often require significant effort and commitment to accomplish, defect and debt removal have some simple and relatively low effort options. There are some high-effort/impact alternatives that we’ll cover later.

Today we’ll look at…

Tailing & Ratcheting

When you have a build up of defects over time you tend to have a “tail” of really old defects and issues. These are usually low or medium severity/priority items (with the occasional blip) and are often in functionally gray areas requiring difficult decisions or significant rework.

They age because we don’t want to touch them and would rather forget about them (the oubliette again). A typical defect age distribution looks something like this:

right-skewed beta distribution

right-skewed beta distribution

It’s one of the most common statistical patterns I see in software development (and I’ll return to it in future) but for the purposes of this article, I want to look at the “tail” of this curve- that last 5-10% of your defect population – all your oldest items.

Addressing the tail is pretty straightforward. You can either set yourself a target “maximum defect age by a given date” or simply focus on continuous improvement. Whilst the target approach gives you a clear goal, you risk setting yourself up for failure or under-commitment. Defects are considered notoriously hard to size (I’ll cover this myth in future) but chances are if they’re old they’re probably a bit tricky too so being predictable about dates is something you might prefer to avoid for starters.

Whether you aim for a specific target or improving every day/week/month you’ll need the same stepped approach.

Identify your oldest defect, pick it off the end of the queue and commit to closing it quickly and fairly.

Here’s the first thing to accept… Closing doesn’t mean fix it in every case. Because you’ve not touched it for a while, chances are someone is expecting a solution so you’ll be looking at a difficult conversation if you don’t fix it but that might be far easier than “fixing” something that shouldn’t be changed or will derail your team and product. Make open, fair, honest decisions in each case.

  • If it is something you think you should be fixing, get on with it.
  • If it’s not – close it

Sound familiar?

Aim to close at least one of your oldest defects every week or every sprint.

If you have a lot of items that are considered old, consider increasing your capacity on these in the short term to get the ball rolling. (at the cost of other delivery activities)

If we look back to your distribution of defects over time, when you do close out your oldest defects, put a ratchet mechanism in place that sets a continuously reducing maximum age.  In very bad weeks the ratchet may not improve but don’t let the numbers get worse again or the effort and good faith from your teams and management will have been wasted.

With a ratchet, remember each “notch” is a step change from the last point. For example if your last point was 1000 days(!), clearing everything older than that should leave some defects nearly 1,000 days old. Set the next ratchet point to 950 days (rather than 997), determine what falls in that next block (950-999) and fix those. Ensure each ratchet point is a larger time window than your execution period otherwise you’ll end up stationary. (I usually go for a ratchet up of 50 days improvement per week or sprint).

Here’s what I mean…

Defects Tail

Ratcheting out the oldest defects each week

3 months of this tailing practice every week will dramatically reduce the average and maximum age of issues in your queue. It’ll make you all feel better and it usually helps to draw the heat off on escalations for a while. (Bear in mind you should be doing this in addition to your “topping” approaches).

Occasionally you’ll reopen some old wounds this way but paying these some attention now stops them coming back up when the timing isn’t under your control.

Keep this up for six months or a year and you’ll reach a stable point where nothing gets too old and your team becomes adept at having difficult conversations with your customers early and backs them up with some successes as well.

Eventually in order to keep the ratcheting improving you’ll have to commit more people. This probably means you’re approaching your natural level of control or entitlement where you have aging defects levelling off and others being addressed in a reasonably consistent manner. It’s up to you whether you aggressively pursue the numbers down further or sustain them at this level.

In case this sounds too obvious or simple to work, I’m working with a team right now that introduced this ratcheting approach less than a month ago. We’ve reduced our oldest defect age by over 25% and customer escalations are consistently lower already. Perhaps most important to the team; we’re no longer at the top of the charts when our leaders review the defect stats. Although harsh-sounding. Sometimes, having someone else drawing the heat for a while lets us get back to focusing on value and priorities.

In part 3 we’ll look at an old favourite – the “bug blitz” or feel free to skip to part 4 – “the litter patrol“.

Escaping the Oubliette (Part 1) – Topping

Reading time ~ 2 minutes

As promised, how to Escape The Oubliette

So here you are, part of a team at the bottom of the defect & debt pit. What now?

The simple fact is, there’s no one answer but there are a selection of tools and there is a key.

The key comes courtesy of a comment from Jim Highsmith during one of his sessions at Agile 2010 – my paraphrasing below…

“Teams facing technical debt need both debt reduction and debt prevention strategies.”

When teams face technical debt they pick away, removing it piecemeal. After all, you have to eat an elephant one bite at a time right?

Trouble is, that only works if the elephant isn’t growing as fast as you’re chewing.

You need a two-pronged attack – what I like to call “topping and tailing”. The “top” is all the incoming issues and build-up of new debt. This is where you need to stop the bleeding. The “tail” is the backlog of existing debt. It’s already there – gently rotting in the corner of your product.

Topping

If you’ve read my post on stopping the bleeding, this is the crux of “topping”. You may have a variety of strategies for this but here’s the fundamentals.

Incoming Customer Defects

Ringfence enough capacity to address defects as soon as they come in. Make sure your burn rate matches your incoming rate. If you’re strapped, it’s your call whether you only cover high severity issues but for my money, I’d hit them all. A multi-year build-up of low severity issues makes products ugly, no pleasure for the users and a series of minor problems rapidly becomes a major burden and a big debt buildup.

Addressing every incoming defect doesn’t mean fixing every one. Whilst I generally recommend you do, when trying to dig yourself out of an oubliette the bigger issue is how to keep your goal achievable.

Be critical. The big, nasty bugs are usually pretty obvious but what about the rest?

    • Is it low severity and not important?
      • Is it quick & cheap to fix? – Just fix it.
      • If it’s not – Can you put a safety rail around it? Document it as a limitation or even ignore it?

Negotiate with your customers and make an explicit decision to fix now or never. Once decisions are made, communicated and agreed, close the defects in your backlog.

Getting these conversations happening immediately makes customers far happier than pretending they’ll just go away and never making a decision. (but not quite as happy as fixing their issues).

Incoming Internal Defects

The story here is the same. I’ve called it out independently as many companies and teams do.

Those incoming internal defects speak volumes about your approach to quality.

Get the discipline in place to fix them when you find them or be brutal and admit which ones will never be fixed but don’t pussyfoot around!

  • If you provide a service to other internal teams, treat them as customer defects.
  • If they’re raised by your own team, are they on new code or exposing existing issues?
    • On new code, fix them! – No excuses .
    • On existing code, the problems are already there. You need to decide if what you’ve done has made it more visible or harder to work with than before and make a judgement call. Fix now or never, resolve or close.

In the next article in this series I’ll cover debt prevention

Stop The Bleeding

Reading time ~ 2 minutes

One of my favourite verbal patterns is “Stop The Bleeding”. I heard the term come back to me in a management meeting last week in the right context and with the right response – Success!

I’m not exactly sure where I heard this first but it was sometime in the last 4 years. Mike Cohn references it in the context of agile testing but until I searched for the term just now I’d never read that particular article so I’m sure it’s also used elsewhere…

Name: “Stop The Bleeding” or “Plug The Hole”.

Concept: Consider the “flow” of work and issues. Many software activities generate or receive some sort of flow. Whilst you can use a Kanban approach to manage the flow, you need to address the site of the problem first to prevent it getting any worse and look upstream to consider prevention activites.

Analogies: You have a patient on a stretcher with a series of broken bones and a torn major artery, where do you focus your attention first? – or- You’re in a boat with a leak in the hull. You either stop rowing and start bailing or start rowing and start sinking. Maybe it’s time to fix the leak?  Once you’ve fixed the leak, how about staying away from the rocks?

Usage: The most common times I’ve used this term are in test automation and defect management.

I’ve seen major investments in writing automated regression test suites for existing functionality or old manual tests. Skilled testers are usually in limited supply. Why are we using their valuable capacity on a static problem?

We need to take that capacity and apply it to the issues we continue to create every day – the ones where we’re still bleeding. Write automated tests for our new functionality and changes that we’re introducing right now. These become your regression tests as soon as they start passing!

In defect management a common challenge is customer escalations. These usually happen because we failed to act on a problem in a timely manner. Stopping the bleeding here means addressing new incoming defects promptly and properly. For example; try replicating issues immediately and then negotiating when your customers want a solution rather than just focusing on traditional severity / impact numbers.

Once your flow is under control and you’re working to when customers want a fix, escalations for delinquent issues will trail off fast. Now you can start addressing your other priorities.  Note: The “bleeding” in this second example is customer escalations, not incoming defects – Stopping incoming defects is a subject in itself!