The Joy of Peer Reviews (Part 1 – Code)

Reading time ~ 2 minutes

Pair programming replacing peer reviews is a myth in the same way that “agile projects have no documentation”.

From my experience peer reviews continue to hold a vital place in agile development and software craftsmanship. Unfortunately they are often misunderstood or misapplied.

“Humanizing Peer Reviews” by Karl Wiegers is the best primer I’ve read so far on peer reviews so I’m not going to duplicate Karl’s efforts – I strongly recommend a thorough read. In fact, print a copy and give one to every member of your delivery team to read and discuss.

Like everything else in modern software development, peer reviews are a collaborative team learning experience. Reviewing code properly means both the reviewer and reviewee walk away having learned something and improved their craft.

With code reviews; beyond reviewing for functional correctness, (the simplest, most obvious and potentially quickest part of a review) there’s a selection of considerations I expect reviewers to look for. (There are plenty more).

  1. No code without tests
  2. Good variable naming
  3. Correct use of classes and interfaces
  4. Small methods
  5. Adherence to standards and conventions
  6. Style consistency across the team
  7. Readability
  8. Good test coverage from small tests
  9. Test code follows the same quality standards as production code
  10. Tests describe expected behaviours
  11. All tests pass
  12. Evidence of some test preparation to consider boundary cases, failure modes and exceptions
  13. Clear exception handing, failure modes and explicit boundaries
  14. Sensible error messages
  15. Code smells
  16. Performance risks or tuning opportunities
  17. Security or other “ility” issues.
  18. Opportunities to learn new tricks
  19. New good practices & patterns
  20. Functional correctness

Once you have cultural acceptance on the breadth of technical peer reviews develop your own checklist that everyone supports and remember the goal of a review is to share improvement opportunities, not for lazy coders to have someone else find their bugs for them or for staff to step on each other.

Now…

If you think peer review of code is problematic for your teams, I guarantee that document peer reviews will be a whole lot worse! – see part 2.

The Oubliette

Reading time ~ 2 minutes

An oubliette is a particularly unpleasant dungeon characterized by a well-like opening. During medievil times these were used “to forget” (oublier -in French) about “unwanted guests”.

Imagine how you’d feel stuck at the bottom of an oubliette?

It’s probably one of the least inspiring places to be. Expect feelings of despair & hopelessness. You might find yourself asking “What do I do?”, “Why do I bother?”, “It’d be easier to just curl up and let it end”.

I’ve seen development teams end up in similar situations. Untamed defects build up around them over a period of years until it’s too late. The backlog is so deep there’s no hope of getting out. Even committing the entire team to just fixing bugs for over a year won’t save them.

It starts with a brittle codebase. A prior combination of poor architecture, lack of clear standards, lack of debt management and years of too much corner cutting. Often the underlying culprit is repetetive “business pressure” with a team that have not been empowered to say no. This mountain of cut corners and poor decisions offends the sensibilities of all but the most cavalier of developers but once the pit starts to get deep and squishy, what incentive is there to improve?

Your business cannot see delivery of features sacrificed to refactor the codebase, it adds no business value!

The gradual drip drip of quality problems continue as your ability to keep up with demand for new features slowly leaches away, the team slows down further, it costs more and takes longer to develop as the business piles on the pressure for even more new features in less time.

The oubliette spirals towards oblivion and like a prisoner at the bottom of the well, your team and codebase become starved of quality.

Don’t let this happen to you.

(Here’s the first part of how to escape and prevent the oubliette)

Stop Working With Blunt Tools

Reading time ~ 2 minutes

Clarke Ching introduced me to a story of 2 woodcutters – one worked furiously but finished late whilst another stopped frequently to sharpen his tools and finished early.  (He tells it better than I do) – I think it’s actually based on an Abraham Lincoln quote:

“If I had eight hours to chop down a tree, I’d spend six hours sharpening my ax.”

Let’s think more about developing software the Lincoln way

What tool sharpening should you do before you start chopping?

If you paid off or prevented some of the debt you were facing before new work started, would it enable your project to run faster, more smoothly, with reduced risk, a lower chance of defects or a lower maintenance cost?

A previous employer had a great strategy. Every new release of the product had 2 top priority named features on the priority list for “cleaning up” and “levelling up”.

Cleaning up:

  1. Get all unit and regression tests passing (and keep them there).
  2. Address all build failures and warnings (and keep them under control).
  3. Delete all functionality, code and tests that will have been deprecated for more than 3 releases. (and add alerts to functionality that will be removed in the next release)
  4. Fix all defects that put us below releasable quality before we’ve even started (and keep them there).

Levelling up:

  1. Raise the tools we use to those that are best supported, newest in the market or offer improvements to our working conditions.
  2. Raise the libraries we use to the latest supported versions and address any issues.
  3. Raise the platform versions we’ll support when the product ships and address any issues. (and remove support for obsolete or out of date platforms).

No major release started full-tilt on functional work until these were cleared.

Like all good practices, this isn’t new thinking, it correlates to 3 components of the lean 5S strategy; sort (seiri), straighten (seiton) and shine (seiso). Rather than just describing what was done, here’s some tangible benefits for the clean up & level up approach…

  1. There were no unpleasant surprises for our customers on a new release. We had a standard platform, support and deprecation policy and kept to it. Our customers liked it when we did predictable things.
  2. For the development teams levelling up was a common risk. Addressing this at the beginning of a project was a valuable de-risking activity. Where we hit critical problems, we could make clear, early upgrade decisions and where there were fewer issues we would develop full-time on updated versions throughout the project with regression on older supported platforms available from prior development cycles.
  3. Removing old parts of the product made life significantly easier for the teams. The reduced testing, regression and maintenance load allowed us to speed up development – much like scraping barnacles off the hull of a boat to help it run faster. Cleaning up also allowed us to take some sensible baseline code metrics before any new work started.
  4. Giving teams space and time to clean and level up before starting functional work had a positive impact on morale. We felt that we were trusted to “do the right thing” rather than “just ship it”. This empowered us to continue doing the right thing throughout the rest of our work.

Give your teams some time out to sharpen their tools and sort, straighten & shine the workshop before new work starts. It will make a difference to the performance of your team and the quality of the end result.

Lessons in Application Performance (Part 2)

Reading time ~ 3 minutes

“Captain, we have a customer crisis, get your passport, we need you on the ground tomorrow.”

This would have been fine if it weren’t Thursday night, that I’d been pulling 14 hour days for about a month already and had promised to take my family out at the weekend to make up for it. – Sounds like something out of the goal, doesn’t it!

A couple of years after my first major lesson in application performance, now working for a different company; one of our key customers in South America was trying to get their staff through critical money laundering certifications before a fixed deadline. If this was missed, huge fines were on the cards.

Trouble-was, hardly anyone could get onto the system, it kept going down.

Our implementation team had taken our profile and scaling recommendations, added some contingency and specified the systems for the customer. The customer wasn’t happy with the cost so they’d gone for a cheaper processor and database option which made us uncomfortable but this was a huge deal for us. (It was an early version of SQL Server on Windows 2000 and they were trying to scale to nearly 10,000 users).

I arrived at the customer site. A really great, friendly bunch of people and one of the most bizarre office locations I’d ever been to. The office had a transient community of something like 1,000 staff with no other businesses anywhere nearby. Surrounding it was an entire cluster of lanchonetes that opened at lunchtime – solely to serve this building.

They’d taken the system offline for the weekend and were running their own testing using load runner. Having done scalability tests for customers before this seemed pretty simple. Ramp users up in waves to desired levels, test, ramp up more, test again, tune, retest etc. We recommended a transaction rate of about half a transaction per user per second as a “heavy” load for this type of application.

We reviewed the log files, it was dropping at about 1,000 users – way below where it should have been. I took a dig into their application servers, modified the JVM parameters for optimum heap size, garbage collection etc.  – I’m pretty sure this will work.

Tests showed nothing abnormal. We’re looking good to move forward.

Let’s get the users back on…

10:30 AM Monday morning – The system goes down again.

We take a look at the logs.

Within the space of about a minute, nearly 5,000 users tried to simultaneously log into the system.

Back then, large-scale e-commerce event ticketing systems were not common. For an internal system like this, none of our team ever expected we’d have that kind of load just on login. Even 1,000 closely-packed logins seemed wildly unlikely.

What were they doing?

We talked to the managers about the users. We established that staff in all branches had 2 30 minute breaks per day in 2 patterns (half the staff at a time).

OK, so what…

This was the only time the staff were able to use their systems for this certification. The rest of the time they were serving customers!

These people’s breaks were precious, they had a deadline to meet for mandatory certification this was the only time they could do it.

Solution:  Longer term we fixed the performance of the login routines so it’d never happen again but in order to achieve their short-term goals, the management team staggered break times for staff during the certification period.  (Oh and I’d got the sequencing of one of the JVM hotspot parameters mixed up but that wasn’t so relevant).

Problem solved!

Lessons:

1: When designing systems for performance & scalability (even internal ones), you need a good understanding of peak load based on real usage profiles – including extreme ones. This might seem obvious now but to us back then it wasn’t. We thought we knew how the system was going to be used.

2: Spend a day with the real users, not just the “customer”, Make a point of understanding their goals, and constraints. If you’ve not done so before, I guarantee you’ll learn something unexpected.

3: Sometimes the best solution isn’t technical. In this case, a simple work scheduling change saved a potential fortune to our customer in hardware & licenses.

4: Just as in my first performance lesson; if load is problematic in some areas, build safety margins into the system.

Lessons in Application Performance (Part 1)

Reading time ~ 3 minutes

At 5PM one evening about 10 years ago  I had a call from the head of application performance at the company I was working for…

“Captain, I just thought you’d like to know you’re currently responsible for the most expensive piece of SQL in the company – worldwide… “

It was using about 95% of the CPU resource on 2 Superdomes. This was a production showstopper; no going home tonight…

Now this was embarrassing. I and one of my team were responsible for a recent complete refactoring of a key component on one of our global systems. We’d tuned the hell out of this thing. Despite it trawling a decade’s worth of live data – millions of records, it was so efficient that it returned in under half-a second every time with minimum memory footprint, minimum disk I/O and minimum CPU load. We’ve profiled it like crazy under all kinds of data shapes and volumes. We’d taken production transaction volumes, built in contingency and ploughed it through those and it was just screaming away, no problems.  In fact it was one of the best bits of refactoring & performance tuning I and my co-developer had ever done.

Something was seriously wrong. There was no disputing the logs.

As always with these major production issues, I and my development parter pulled an all-nighter on calls to various application groups in the US.

We found an anomaly.

One query was being called about 2,000,000 times a minute!

All our profiling, production comparisons and performance data had been based on a peak load of about 1,000 hits a minute. As a key component of a worldwide order system we knew it would be heavily used and put a reasonable load on the system but this was way beyond our most unrealistic expectations.

All our tuning had been based on a known forecast usage through our standard order management system. The calculation engine was typically called once or twice for every order placed worldwide. (We’d forecast about 500 orders per minute and easily had capacity for twice that volume)

Further investigation revealed the source.

Earlier that week, a new application from another team had been integrated into production.

There were rumors that the VPs of each team didn’t speak to each other. Either way, although we knew members of the other team, they were on the other side of the world and we didn’t collaborate very often.

This new application was dependent on the same calculation engine. We’d spent some time training their developers on how to interface to it and they were really pleased with the results they were seeing. Once our knowledge transfer was done, that was the end of the story as far as we were concerned.

What we didn’t know was that their initial effort were so successful that they had integrated it into their product as an auto-calculation system.

Every time a user tabbed from one field to the next on the order line details, it was performing a recalculation.

Now a typical order at this company contained about 100 order lines. And each order line contained ~20 fields. Our calculation engine was being put under nearly 2,000 times more load than had ever been expected!

Needless to say we had the team fix it fast.

They removed the “auto-price” feature and replaced it with a “recalculate” button on the shopping cart.

Lessons…

1: When you’re writing a system that will be integrated with multiple systems or uncontrolled third-parties, make sure there’s a mandatory part of the interface in place that requires the caller to be clearly identifiable and that your logging is user-friendly. Putting all blame aside, this immediately allows you to identify and isolate unique integration issues.

2: Don’t just train your users/customers how to use your system. Help review their approach and processes and once they’re up & running, get them to show you the results and walk you through what they did. Chances are they will try to do something exotic that you weren’t expecting and would make you squirm. Better to find it early and deal with it than allow it to become a production issue. Remember it’s up to you to be a role model in this collaboration.

3: Identify, set your performance constraints, expectations and limitations up-front. Consider even building them into the system in some configurable way and tell your user/customers what they are. In the days of denial-of-service attacks, secure coding requires us to put throttles into our systems to prevent them running away with resources. (anyone ever tried to buy Glastonbury tickets online?) Even in smaller internal systems, it’s worth having some idea on volume and usage. Performance defects are notoriously difficult to resolve, are usually showstoppers and often require major architectural rework – We were lucky this time!

4: Many defects are found when your software is used in ways that you and your requirements didn’t foresee. Often that flexibility is a major asset or differentiator but if it’s not, consider putting up safety rails to limit your system to intended usage only.  If it “might be useful” to work another way in future, remember YAGNI – You aren’t gonna need it. If you feel you must have that potential flexibility, at least consider putting a lock on it that requires your customer or user to call up or understand and recognize the impact first.

10 Minute Test Plans

Reading time ~ 2 minutes

Struggling to introduce TDD or Acceptance Testing? Here’s a really powerful simple first step that you can perform immediately – How to get your development teams “thinking” about tests and how to test with little or no negative impact and a few seconds setup time.

This assumes you’ve already broken out your story/feature/activity into developer-sized tasks.

  1. Push the keyboard away and get a sheet of A4 or Letter paper and a pen or pencil.
  2. Draw 2 lines to Divide the paper into 3 roughly similar sized areas.
  3. Pair up with the developer who’s going to work on a task.
  4. Start a timer –  you have 10 minutes.
  5. Spend 3 minutes listing out as many “pass” cases as you can think of between you – all those simple “happy path” things.
  6. Spend 3 minutes listing out as many “fail” cases as you can think of – bad inputs, exceptions etc.
  7. Spend 4 minutes thinking of all the boundary cases you can think of – e.g. date/time/time zone, rounding, location etc. Often these boundaries depend on your problem domain.
  8. That’s it – you’re done!

OK, this isn’t going to give you perfect code and it does take a bit of practice. Try not to agonize too much about whether these are unit, functional or acceptance tests for now. We’re still learning to fly here.

Much as standing up collaborating at a whiteboard uses a different part of your brain to coding; paring up, brainstorming and writing with a good old pencil & paper does a similar job. The discipline of not touching the code for a short period of time and thinking about how you’re going to test and what could go right or wrong will significantly improve the quality of what you produce – it usually makes the coding easier too.

Better still – you should now have your first unit tests to write! – pick the simplest ones you have on there and get them working! (I usually start with input validation. I know it’s not directly related to end user value but once I get a good grasp on inputs the rest fits in my head more easily.)

Just Enough Design

Reading time ~ 3 minutes

Some years ago at a prior employer I had the luxury of working with a team delivering a large green-field Java & Oracle project. The requirements were complex and the interfaces, APIs and business logic all needed some pretty exotic thinking to make everything work.

Prior to that project we’d delivered plenty of relatively simple work and been through requirements, design, code, unit test, integration, system test, documentation etc many times. We were generally a “pretty good” team.

We hired a new member – a very experienced and hands-on architect. He brought a whole load of knowledge we were looking for to the team and more…

After being on board about 2 weeks he called a meeting with the entire team. Hauled us into a room and pointed out just how poor we were at proper design. Moreover he took control of the situation, developed a series of design spec templates, guidance and examples, got the team fully ramped up on UML, capturing design decisions, practices, patterns – the works.

Using our new design knowledge and tools, we moved onto the first critical phase of our green-field project in 2 groups.

Group 1 had to get a working proof of concept to the customer in a matter of weeks, group 2 needed to start designing the way more complex second round of features.

For group 1 (a small pilot team of 2), one of the team did about a week’s research, wrote up the basics and hit the ground running (no real design). Group 2 were not allowed to touch a line of code until the designs were complete!

From memory, start to finish; that first phase took about 3 months.

After the initial work was completed, both groups 1 & 2 progressed onto the next round of features based on the design efforts group 2 had completed.

After about 2 weeks we realized we were having to sacrifice one of the team (our feature lead!) almost full-time to maintain the designs. Coding was completed in a total of about 6 weeks. The fastest coding turnaround we’d ever had for something of this scale and the functionality was way harder than the first round of work.

After our crash-course in the pain of full-on software design,  our architect reconvened the team to lead a design practices brainstorming session.

“OK, now you know how to do proper software design; of the tools, practices and documents you used, which do you want to keep and which do you want to ditch?”

Our management seemed to have had the foresight to allow our architect this social experiment knowing full-well that the net result would be a major overall team improvement (the same manager also helped us develop successful business cases for major refactoring efforts – a pretty forward thinking guy).

So what did we keep and what was our philosophy?

Philosophy first:

The greatest value in design after the fact is not in what was implemented but why we chose to do it that way (and why not another way).

The second greatest value in design after the fact is for team members (especially new joiners or maintainers) to get a foothold into the codebase and be able to navigate around safely.

With these cornerstones in mind we kept a few things…

1: High level architecture – a verbal or pictorial summary of the general concept and approach. Often just a photo of some legible whiteboard sketches. – our first foothold

2: Top level flow – a sequence diagram defining the overall flow of responsibility between actors. – our main “ladder”into the codebase.

3: Design decisions and rejections. (in a wiki/threaded discussion) – why did we choose to do things and why did we chose not to do others. – since learning the “why & why not” approach we saved days of ramp-up and maintenance pain on projects.

4: Complex algorithm annotations – for the really gnarly bits. (Avoid this where possible) – draw pictures for these where you can.

5: Public interfaces – as peer and tech-author-reviewed Javadoc post-implementation. I like public interfaces – they’re a great long-term commitment to communicate in a given stable way. In this case they were also a commitment to our customer. Doing a decent job on these saved a world of support pain later.

6: Unit & functional tests – yes, these are design too!

That’s it! – we ditched a whole world of class diagram hell and parameter definitions. We could still sketch out basic class diagrams when needed but not the level of depth needed to generate code from a CASE tool. We ditched all the noise and blurb and we made it clear why the product was written and behaved the way it did.

So – give your teams an easy leg-up into your code and then explain why it does things rather than telling people what it should do.