TDD in context #3: Deadline pressure

Previously

This is the third in a series about TDD’s value to humans, following

Keeping my job, in which I lost my manager’s trust, and
A last-minute feature, in which I regained some

Learning the domain

The software I was hired to develop acted as user-facing glue to several non-user-facing infrastructural systems in a highly complex environment. On arrival, I didn’t see the whole picture, and knew I couldn’t understand the implications of changing any given bit of code. Whether or not I tried to consider the systems outside ours, I didn’t understand what our software was aiming to do. The Unix command-line tool’s “create” subcommand would have made sense, except that there was also an “insert” which was slightly different. I thought I grokked “update” until I noticed other subcommands for specific kinds of updates. And I was about to believe I had a handle on “assert” when I discovered that “accuse” was a way to speed it up. Yikes!

To (be able to) go fast, (be able to) go well

Making things worse, the code I and my lone teammate had inherited was freshly minted legacy code. If we’d been instructed to go fast, we’d have made dangerous mistakes, and we still wouldn’t be going anywhere near fast enough to make those mistakes tolerable. If we’d been instructed to go safely, we wouldn’t have been able to deliver much, plus some dangerous mistakes would still have slipped through. If we wanted to move quickly and safely, we needed to make a concerted effort to change the properties of our code. In A last-minute feature, we did, and it began to pay off.

Now please go fast

The big refactoring-etc. in that post had been driven in part by the anticipated business need for a graphical UI for non-Unix users, which in turn derived from the demonstrated business need to manage identities for non-Unix platforms in the same way our successful proof of concept was managing them for Unix. For reasons I may never have been privy to, and certainly don’t remember, the need to manage mainframe identities congealed into a hard deadline.

We had 1.5 weeks.

The code was built on Unix- and Kerberos-specific assumptions scattered throughout — some of them hard-coded, some of them invisible, and as yet relatively few of them documented by any kind of automated tests. Discovering and reacting to all those assumptions by building support for our first additional platform, while having it be an unfamiliar platform with its own weird yet-to-be-discovered rules, appeared rather unlikely to be doable at all, and exceedingly unlikely to be doable safely.

And don’t screw up

But maybe. And we had to try. So I notified my teammate and our manager that in order to avoid office distractions and enable strategic napping, I’d be working from home, around the clock, until stated otherwise. At midnight, I emailed them a summary of the day’s progress and my tasks for the following day that, if completed, would prevent the deadline from becoming obviously un-meetable. Any given day, things might not have gone according to plan, and that would’ve been the end of that; but every night, I kept being able to report that they had, and here’s what has to happen tomorrow to keep the string going. A week later, with a couple days to go, we were still — against all reasonable expectation for a schedule with zero slack — on track.

The next day, I went into the office to knowledge-share with my teammate, who had researched the important platform-specific behavior we’d need to implement. I took my understanding, turned it into unit tests, and asked Robert (as I’d asked Bill about that last-minute feature) to validate that they were testing for the desired behavior and only the desired behavior. He caught and fixed a few mistakes in the assertions. Then I went off and made the tests pass.

What size shoe, and when?

That night, our release scheduled for the following evening, I slept deeply. We’d done everything in our power. We would go into production with a credible effort that might prove good enough. Even so, I was worried that it might not. I knew many of my refactorings had outstripped our still limited test coverage, for which I attempted to compensate by carefully making one small mechanical change at a time, reasoning about what to manually check, and doing lots of manual checking. I was pretty sure the system as a whole wasn’t fundamentally broken, but I was also pretty sure it wouldn’t be thoroughly fine. When we got to production, there was no question whether the other shoe would drop. We hadn’t worked with sufficient discipline to avoid it.

A small shoe, immediately

During the course of the Operations team’s usual post-release system checks, they found a big regression: one of my less rigorous refactorings had broken a common usage in an important third-party client I’d forgotten to check. I immediately recalled a smell I’d ignored while doing that refactoring and knew an expedient way to fix the problem. A half hour later, our updated release indeed fixed the regression, and I’d made myself a note to add a few integration tests with that client.

A big shoe, later

I felt certain we hadn’t seen the last of the fallout. Some months later it hit us. I don’t remember how we found it, but there was one remaining invisible assumption in the code that hadn’t been flushed out and fixed, and it was a potentially very bad one: a SQL DELETE statement with an insufficiently constraining WHERE clause. Not only had we forced Ops to scour several months of application logs and database backups for signs of wrongly deleted data, but also we’d forced the business to operate for several months unknowingly without the data integrity they were relying on. Against all reasonable expectation, again, the damage was very low.

Conclusion

It proved to have been crucial that, by the time this deadline popped up, we’d already greatly improved the health of our code and the extent of our tests and knowledge. If either had been any worse, we’d have missed the deadline, made more costly mistakes, or both. Instead I was able to test-drive sometimes, manually test fairly effectively, and tolerate — as a time-boxed exception — working mostly alone at an unsustainable pace with an unsustainable lack of discipline. And because we had made ourselves able to do those things, we delivered what we understood the business to need, when they needed it.

But only because we lucked out. The unseen, unmanaged risk we incurred by working in this way, even for a short time, could have come back to bite us much harder than it did; since it happened not to, I earned more trust. When I was given responsibility for managing the product half a year later, I was told that this particular outcome had been the one that sealed the deal. In other words, the career opportunity to learn about software product development from an entirely different perspective came as a direct result of having chosen to pursue TDD. And when I landed in that role, I looked for ways to manage risk without having to be so lucky.

Test-driven ways.

About Amitai

Speaking

Writing

Music

Code

Elsewhere