software development

Test Trade-Offs

TL;DR: Software developers often decide what tests to write based on technical considerations. Instead they should decide what tests to write based on what feedback is missing. The Test Trade-Offs Model can be used to make better decisions around what tests to write. Useful dimensions to use when deciding what type of tests you should write are primarily: speed of feedback, coverage and variation.

The current thinking in the software development industry is to have a lot of low-level unit tests, fewer integration tests, and even fewer higher-level tests like system or end-to-end tests. The Test Pyramid, shown below, is a common model used to describe the relative amounts or ratios of the different types of tests we should aim for.

Traditional Test Pyramid

This kind of thinking generally focuses on how quickly the tests run – i.e. speed of feedback – and also how easy it is to write the different types of tests. Both of these are technical considerations. The problem I have with this thinking is it ignores the primary reason we have tests in the first place – to get feedback about our system. If technical considerations govern the types of tests we have, there may be a large number of tests we will never write, and thus a lot of feedback we’re not getting. For example, having lots of low-level unit tests doesn’t give us any information about how the system works as a whole. Evidence of this phenomenon is the multitude of memes around unit testing not being enough. Some of my favourites (click the pictures for the original Tweets):

Unit testers be like

Focusing on technical considerations only leads us to make blind trade-offs: we’re not even aware of other dimensions we should be considering when deciding which tests to write. The Test Trade-Offs Model was developed so that teams trade-offs when deciding which tests to write, by making other trade-of dimensions explicit. The model is predicated on the idea that different tests are valuable to different audiences at different times, for different reasons.

The dimensions currently in the model are:

  • Speed: How quickly does the test execute? How long do we have to wait to get the feedback the test gives us?
  • Coverage: How much of the system (vertically) does the test exercise? In general, the higher the coverage, the more confident we are about the behaviour of the system as whole, since more of the system is being exercised. Coverage is also known as scope or depth.
  • Variation: How many near-identical variations of the test are there? E.g. if a test has lots of inputs, there may be very many combinations of inputs, with each combination requiring its own test. (This article is useful for more on this idea.)

In an ideal world, our tests would execute instantaneously, cover the entire system, and would deal every combination of inputs and states as well. Therefore, the ideal test would score very highly in all dimensions. Unfortunately this is not possible in the real world since some of the dimensions have an inverse affect on others. The image below is a causal loop diagram showing the causal relationships between dimensions.

Causal Loop Diagram
  • An increase in Coverage generally leads to a decrease in speed of feedback. This is because the more of the system covered by the test, the longer the test takes to run.
  • An increase in Variation typically leads to a decrease in coverage. With high variation, there is usually a very high number of tests. If the suite of tests is to complete running in a reasonable timeframe, we usually decrease the coverage of these tests.

As the model shows, no test can ever maximise for all dimensions. Any test will compromise on some of the dimensions. We therefore need to choose which dimension to prioritise for a test. This is the trade-off. Each test should prioritise one of the dimensions. The trade-off of priorities should be based on what feedback about the system we need.

For example, if we need tests that give us information about the behaviour of the whole system, which will be valuable for a long time, we’re most likely willing to compromise on speed of execution and variation. The trade-off is now explicit and deliberate. Traditionally we would have ruled out such a test immediately because it would take too long to run.

The way I’d like to see the model being used is for teams to decide what system feedback they’re missing, decide what trade-offs to make, and then what kind of tests to write.

I believe this to be the the first iteration of the model, I expect it to evolve. I’m certain there are other dimensions I haven’t yet included, perhaps even more important dimensions. What dimensions do you use when deciding what type of tests to write? What dimensions do you think should be added to the model?

I would like thank Louise Perold, Jacques de Vos and Cindy Carless who helped me refine my thinking around this model and who helped improve this article.

software development

The States, Interactions and Outcomes Model

TL;DR: The States, Interactions and Outcomes model provides a way for cross-functional teams to collaboratively explore, specify and document expected system behaviour.

Specification by Example (SbE) and Behaviour-Driven Development (BDD) can be an incredibly effective way for teams to explore and define their expectations for the behaviour of a system. The States, Interactions and Outcomes Model provides a set of steps, and a lightweight documentation structure for teams to use SbE and BDD more effectively. The best way of conveying the model is through a worked example.

Worked example
To demonstrate the model and the process, I will take you through applying it to a problem I use frequently in coaching and training. Imagine we are creating software to calculate the total cost of purchased items at a point of sale. (This problem is inspired by Dave Thomas’ Supermarket Pricing Kata and here.) You walk up to a till at a supermarket, hand the check-out person your items one-by-one, and the checkout person starts calculating the total of the items you want to purchase. The total is updated each time the checkout person records an item for purchase.

We would like to include a number of different ways of calculating the total price for purchased items, since the supermarket will want to run promotions from time to time. Some of the pricing methods we would like to include are:

  • Simple Pricing: the total cost is calculated simply by adding up the cost of each individual item recorded at the point of sale.
  • Three-for-Two Promotion: By three of any particular item, pay for only two. This promotion is specific to the type of item being sold. For example, buy three loaves of Brand-X bread, pay for only two.
  • Combo Deal: A discount is applied when a specific combination of items is purchased.
  • Bulk Discount: A discount is applied when more than a specific number of a particular item is purchased.

In this article I will deal with only ‘Simple Pricing’ and ‘Three-for-Two Promotion’. I will deal first with ‘Simple Pricing’ completely, and then start with ‘Three-for-Two Promotion’.

Simple Pricing

  • System boundaries: We are concerned only with the way the total for the purchased items is calculated. We are not concerned with things like how the cost of an item is acquired (e.g. barcode scanning), accepting payment etc.
  • Types of inputs: For Simple Pricing, the only input is the price of the item being recorded – item price.
  • Types of state: What affects calculating the total price besides item price? For Simple Pricing, the total after recording an item – the new total – is determined by both the price of the captured item, as well as the total before the item is captured. Therefore state consists of current total.
  • Outcome dimensions: For Simple Pricing, the outcome consists only of the total calculated as a result of capturing an item – new total.
  • Possible values for state types: Current total is an integer, which can be negative, 0, or positive.
  • Possible values for inputs: Item price is an integer, which can be negative, 0, or positive.

Expected outcomes for combinations of state and inputs:

State Interaction Outcome Scenario Name
Current total Capture item that costs New total Error
0 0 0 Free first item
0 10 10 First item
10 10 20 Second item
0 -10 ERROR – item price can’t be negative First item with negative price
10 -10 ERROR – item price can’t be negative Second item with negative price
10 ABCDEF ERROR – invalid input Text input

Three-for-Two Promotion

  • System boundaries: The system boundaries don’t change compared to Simple Pricing.
  • Types of inputs: For Three-for-Two Promotion the type or name of the item is now also required as an input – item type.
  • Types of state: The outcome is now also affected by two other types of state: the types of items already captured – already captured items; and the type of Promotion currently active – Active Promotion.
  • Outcome dimensions: For Three-for-Two Promotion, the outcome consists of new total, as well as the new list of items that have been captured – new captured items.
  • Possible values for state types: Current total is an integer, which can be negative, 0, or positive. Active Promotion is a complex type. It can be ‘none’ or a promotion for a specific type of item, e.g. ‘Buy 3 Cokes, pay for 2’.
  • Possible values for inputs: Item price is an integer, which can be negative, 0, or positive. Already captured items specifies the quantity and types of items already captured.

Expected outcomes for combinations of state and inputs:

State Interaction Outcome Scenario Name
Active promotion Current total Items already captured Capture That costs New total New captured items Error
20 2 Cokes Coke 10 30 3 Cokes 3rd item with no promotion
Buy 3 Cokes pay for 2 20 2 Cokes Coke 10 20 3 Cokes 3rd qualifying item with 3 for 2 promotion
Buy 3 Cokes pay for 2 20 1 Coke, 1 bread Coke 10 30 2 Cokes, 1 bread 3rd item doesn’t trigger promotion

There are several interesting things about the specifications above to which I’d like to draw particular attention:

  • All the words and concepts used are domain-level words and concepts. There are no implementation or software-specific words.
  • The specification describes the transactions and outcomes only, not how the work should be done.
  • The things that determine the outcome of a transaction are super-obvious and explicit. This makes it easier to detect and discuss edge cases.
  • Invalid states and interactions are easy to see.
  • The path to any particular state is clear and obvious
  • Should we want to, it would be easy to automate the verification of a system which should satisfy these specifications.

As mentioned above, I developed and use this model during my coaching and training. It has proven very effective for quickly exploring and documenting system behaviour. In some BDD Bootcamps, we have explored and specified legacy systems running in productions in about 3 hours. One of the ways this has proven useful is people in the bootcamp who have not worked on those particular systems gained a very thorough high-level overview of the intention of the system.

The worked example above follows these steps:
1. Explicitly define and bound the system under specification. What is included, what is excluded?
2. What are the different inputs to the system?
3. What are the types of state that the system can have? Another way to ask this: Besides the inputs, what can affect the outcome of an interaction?
4. What constitutes system outcome? Is any output returned to the user? Note that an outcome must, by definition, include all states as identified above. Outcome can also include error conditions.
5. For each type of state, what are the possible values?
6. For each type of input, what are the possible values?
7. For each combination of state and interaction, what is the expected outcome (including all dimensions)?

The Thinking Behind The Model
The idea behind the model is that the outcome of a system interaction is a function of the interaction and the state of the system at the time of interaction. We can develop a complete and comprehensive specification of expected system behaviour by describing the expected outcome for every possible combination of state and interaction.

Specification by Example and Behaviour-Driven Development
The model and the steps are largely based on the concepts of Specification by Example and Behaviour-Driven Development. Specification by Example (SBE) is the practice of specifying expected system behaviour using concrete values instead of natural-language descriptions. For more on Specification by Example,you can’t do better than Gojko Adzic’s book. Behaviour-Driven Development (BDD) uses SBE. One of the reasons I use SBE is that it allows us to work with something tangible, instead of ‘invisible ideas’. Some of the benefits of using BDD and SBE are:

  • Getting feedback on the work from a wider audience earlier in the process.
  • Making edge cases more obvious.

Ordinarily, we would need to write some software to achieve these things. By using BDD and SBE we can get these benefits before writing any software. However it is not always easy to get started with these techniques.

A common challenge teams face when they start using BDD and SBE is the need to make every aspect of expected externally-observable system behaviour completely explicit. That is, all the factors which affect the behaviour of the system must be identified and made explicit. If any of these factors are missing or unknown, we cannot specify expected system behaviour completely and comprehensively – we will have gaps. It is difficult to develop a successful software product if there are gaps or inconsistencies in what we expect the software to do.

Understanding systems
The steps above are designed to help a team understand the system they’re dealing with. The simplest way we can understand the behaviour of a system is as a simple transaction: some entity is stimulated or exercised in a particular way, and the entity does some work. The simplest way of modeling a transaction is by stating that the input to a system determines the output.


In this view, the system output is determined only by the input to the system. I have come to use the terms ‘Interaction’ and ‘Outcome’ instead of ‘input’ and ‘output’ respectively, because they are closer to the way most people think about working with software products: “I interact with a system to achieve some outcome”.


However, it is important to understand that the outcome of an interaction with a system is determined not only by the interaction, but also by the state of the system at the time of the interaction.


The introduction of state into the picture often causes some challenges. The first challenge is differentiating between interaction and state. The easiest way to distinguish between them is by asking the question What determines the outcome of an interaction besides the input?.

The next challenge is understanding that system state is generally not described by a single value. System state is typically made up of multiple dimensions or types, and therefore must be expressed as a set of concrete values, one value per dimension. The same applies to values supplied to the system as part of an interaction.

Once a team begins thinking in terms of states, interactions and outcomes, they’re generally able to have more effective conversations around what behaviour they expect from their system.

software development

Jidoka and Multi-tasking

I’ve recently done a fair amount of research into the application of lean manufacturing techniques to software development. Its mentioned in a lot of places that the Toyota Production System is based on JIT and Jidoka. (Personally I think Kaizen should fit in here as well, as a governing philosophy.)

Essentially, jidoka means:

  • automatically stopping the line when a defect is detected
  • fixing the defect
  • instituting counter-measures to prevent further defects (implies root-cause analysis)

By instituting these counter-measures in the system immediately, you’re building quality into the system.

In my opinion, jidoka resonates with the ‘Boy-Scout Principle’ (leave it better than when you found it) and the Pragmatic Programmers’ ‘Don’t Live with Broken Windows’.

From my interpretation, jidoka means that when you find a defect in your software development process, you stop it there and then, and fix it. Broadly this would include bugs, ‘bad’ or flawed code, broken builds etc. (Please challenge me on these in the comments, I’m not 100% sure if all of these fit in.)

If I extrapolate a bit, this implies that if I’m doing a code review of one of my report’s work and I find some badly written or designed code, I should immediately pull all the developers in my team off of what they’re doing, fix the bad code, and have a session on why the code is bad and how it should be written in future (the counter-measure).

This is where my difficulty begins. It is now relatively well documented that multitasking causes delays in inefficiencies in the process. I know from personal experience that the context switch involved in changing tasks, at any granularity, is expensive and disruptive.

Then, given that interrupting all the members in my team will cause a major context switch, how do I satisfy the demands of jidoka?

If a bug is reported by the QA team or by an end-user, does the developer (or pair) who originally worked on that feature/code stop what he’s doing right now and fix the bug?

Maybe jidoka is less applicable to software development as it is to manufacturing: how much context is involved in the case of a worker attending an assembly line? (I don’t know, I haven’t worked as one before.)

I am led to another (off-topic) question: in the case of a bug report, which causes less of a context switch:

  • the developer moving to work on the bug right away, while the context of his original work on the code is still fresh at the expense of the context of the current task
  • the developer moving to work on the bug only once his current task is complete, thereby retaining the context of the current task, but losing context of the buggy code

How does one achieve a good balance between satisfying jidoka and disrupting the team as little as possible?

When should the knowledge created by the bug fix be disseminated across the team?

Should teams have a scheduled weekly or fortnightly code review/knowledge dissemination session?


Applying TDD in principle if not in practice

My previous post was about describing the essence of TDD as a specification and direction tool, using non-technical examples. A quick recap: the point of TDD is to know where you going, and how to know when you’ve achieved your goal.

Several people have told me that TDD is all well and good, but it doesn’t apply to them, for various reasons. One chap told me he can’t to TDD because he’s a ‘SQL developer’ (his words). A friend told me he works on a vendor system, which is configured with XSLT files, and there is no tooling to support TDD. Its all very well to use whatever flavour of xUnit you want, but what about those who don’t have frameworks? What can the SQL, ASP and XSLT progammers and configurers out there do?

This is why it is important to understand the principle of TDD, and not only its practice (and why I think most demonstrations of TDD that I’ve seen focus too much on the practice).

As long as you have some way to set up a desired outcome, and can compare your actual outcome to your desired, you can perform TDD. You may not get some side benefits like automated regression testing and test coverage, but as mentioned in my previous post, those aren’t primary benefits. You also might not get the benefits of a decoupled design, since this may not make sense in your application.

You can still convey your intent. You will still have direction and focus. You will still be able to answer the question ‘am I done?’

A nice example of not getting the primary benefit of a decoupled design is the case of TDD in the SQL context. The team in which I’m currently working is still strongly focused on stored procedure and dataset-based programming. Therefore I’ve done a fair amount of thinking and playing in the TDD for SQL space. I’ve come up with a framework based on nUnit and TSQLUnit, but that’s another point.

I’ve used this framework to do TDD for stored procedures. One major shortcoming of T-SQL as a language (I feel dirty now) is that there’s no concept of abstraction. There is no way to break dependencies. If stored proceudre 1 (SP1) depends on user-defined function a (UDFa), there is no way I can test SP1 in isolation to UDFa. I therefore don’t get the benefit of a decoupled design, but I can still express the intent of my stored procedure, and I can prove when I’ve satisfied the requirment.

This is a clear case of the tooling failing the technique, but the technique can still be applied in principle. The same should be applicable in many other instances where there either isn’t tooling for the product (e.g. a vendor’s product which is configured).

Again, as long as you can create an expectation, and compare your actual outcome to that expectation, you can apply TDD in principle. So if you’ve previously had a look at TDD and wanted to apply it, but felt a lack of tooling for your particular case, think about how you can apply it in principle.


Test-Driven Development – Part 1

I’ve found Test-Driven Development (TDD) to be pretty much revolutionary in my own software development efforts. Yet I find it is often very poorly explained, and explanations focus too much on the technical aspects of the technique, instead of the ‘philosophy’ or principle behind it. I place a lot more value on principles and philosophies over practices, since often practices can’t be implemented exactly as described, whereas there can still be a takeaway from the philosophy.

In my opinion, TDD is unfortunately named, since it confuses people. The intent of TDD really has nothing to do with testing. Testing is merely the mechanism through which one gains the primary benefits.

My primary TDD mantra is ‘TDD’s not about testing, its about specification’. In my interpretation, TDD has 2 primary benefits:

  1. It forces you to specify the problem you’re trying to solve very narrowly, and very explicitly.
  2. It promotes a loosely-coupled, modular design, since testable designs are generally loosely-coupled by nature.

Most people will state that the aim of TDD is a large proportion of test coverage and automated regression tests. This is not true. Both of these goals are readily achievable without doing test-first TDD; i.e. they can be easily achieved by writing tests after writing code. Test coverage and automated regression tests are very nice by-products of TDD (when its done correctly), but to my mind they are certainly not the primary benefit nor aim of TDD.

I’d like to explain the first of these primary benefits of  TDD with a non-software related, everyday (somewhat contrived) example. (I’ll need to think of another contrived example to explain the second.) The reason for a non-software related example is I want you to be able to explain the concept to a non-technical person, like your mother or your CEO (or maybe even your team lead 😛 ).

Imagine that I get a crazy idea in my head that I want to become more ‘eco friendly’ in my day-to-day life, and contribute less to global warming. My statement of intent, or vision may be something like ‘I want to make a personal contribution to the lessening of global warming.’ This statement makes one warm and fuzzy, but on its own, its quite difficult to implement. I call this my strategy.

Good management textbooks and courses will tell you one has to implement strategy through strategic objectives. So in order to implement my strategy of being more eco friendly, I come up with the ‘strategic objective’ of reducing my daily carbon emissions.

After some thought about this objective being a little ‘hand-wavy’ and hard to measure, I can refine it by phrasing it as ‘I will decrease my daily commute petrol usage by 10% by the end of this month’. The objective now is measurable, and has a concrete time frame. I have a concrete goal towards which I can work, and measure daily progress (my car has a trip computer which indicates current and average fuel consumption). I can easily see whether I have met (or exceeded) this goal next time I fill up my car.

I have progressed from the warm-and-fuzzy but somewhat hand-wavy statement of ‘I want to do my bit to decrease global warming’ to something tangible to which I can work towards. I know when I have achieved my goal, without a shadow of a doubt, and I can prove it. I can now brag to my friends about how wonderful I am, and maybe I can even get a tax break or something.

Now that the example is concluded, I’d like to point out some things about strategic objectives that I learned at university. Strategic objectives (SOs) must be phrased so that they are measurable, and have a time frame. It says only what must be achieved, but not how. The fact that its measurable means that you know when its done.  You don’t know how to do it (that comes later, at the tactical and operations levels), but you know when its achieved.

If an SO is not measurable, it is considered bad and must be rephrased and rethough. This is akin to designing the SO and implicitly the fulfilment of the SO.

So, an SO is a statement of intent, its a description of a requirement. In x months’ time, we must have sold y units. The fact that it is measurable means  you know when you’re done, but you can also manage it, analyse it, report on it etc. If its not measurable, rephrase it.

If an SO is not a strong enough statement of intent,will those members at the tactical and operations levels  – who don’t operate at all at the strategic level – be able to interpret it and implement it?

If an SO is not measurable, how can you prove to your board that the strategy has been fulfilled?

The testing aspect of this example is simply asking oneself: ‘its now x months down the line, have I sold y units? Its a yes or no answer. Its comparing the actual outcome to the desired outcome.

This example is not test oriented at all. We don’t create strategic objectives in order to run a somewhat trivial test at the end of the prescribed period. The SO exists as a direction, as a statement of intent, as a specification of the where the business needs to be at some point. Sure the measurment aspect is a big part of the SO, but we don’t write SOs for the sake of measuring them.

TDD is no different. The point of TDD is not to write tests. TDD uses the test as a mechanism for feedback. Are we done yet? You can know if you’re done only you know where you’re supposed to be. I don’t know if I’ve made a meaningful contribution to decreasing my emissions until I’m at the petrol pump. A company doesn’t know if its fulfiled its strategy until it can prove it sold x units by a certain date.

TDD forces you to make that end-goal explicit, and to start with that end-goal. Businesses don’t write down strategic sales targets after they’ve made the sales. They start with the target, then put measures in place to achieve that target. TDD in software development is the same.

Start with your end-goal, stated in the form of tests, usually unit tests. You then know where you going. You can then start to decide how you’re going to get there.

To use another example, developing without TDD is like driving around with a sat-nav device, ending up somewhere, and post-haste setting that point as your destination.

Unfortunately, the examples I’ve discussed here don’t really explain the second primary goal of TDD, a loosely-coupled design. I will try think of an example for this benefit and blog it.

Some more notes:

1: As mentioned, businesses start with strategy and strategic objectives, and implement through tactical and operational implementations. They don’t run for a couple of months then retrofit their strategy to where they land up. One can state that ‘its more important to know where you’re going, than how to get there’.

One might also be able to state that ‘its more important to know where you’re going, than actually getting there’. This statement arises from the following: according to agile principles, one delivers the highest value tasks first. Since TDD tests are written before the code to satisfy them, one might argue that the test is more important than the code written to pass the test.

This can be reinforced by considering software product maintenance. The majority of these efforts is spent in understanding the code written by other developers. TDD tests can be a shortcut to understanding the intent of the code.

Also, as businesses changed, an application’s code is likely to change too (as should the TDD tests). Essentially, the intent of the code (conveyed by the test) is likely to outlive the code itself.

2: Behaviour-Driven Design (BDD) is an evolution of TDD, which basically has the notion of ‘executable specifications’ as its holy grail. Instead of developers having to interpret the end-goal presented to them in some type of specification and translating that into code, the customer of the software (business) should be able to codify their specifications in a language as close to their own natural language which can be executed as code.

3: Good TDD tests should be statements of intent. This is often what is meant as ‘tests documenting the code’.