Estimating Chaos: Complexity Levels

One of the most important questions in software development (hopefully) right after “what is it that we are building?” is the question, “when is it going to be done?”
Unfortunately, the answer to the question “when is it going to be done?” isn’t a particularly easy question to answer for any reasonably complex or interesting software project. What’s worse is that software development itself is the most unpredictable activity on a software project. Compounding this problem even further is that while one might expect software projects to both over and underestimate, in fact software projects experience a chronic underestimation problem. This underestimation problem is frequently revealed as projects are decomposed into smaller and smaller pieces. Every time a task is decomposed and better understood, it tends to grow, but rarely does it shrink.
It is important to note that no single estimation technique is appropriate for every project, or even every stage of a project. What is true of any effective estimation technique though is that at its core, it is a function that takes a set of input and produces an immutable output. In other words, the only way to change the estimate is to change the inputs. You can’t argue with the estimate.

Time based estimation

Time based estimation is one of the more common traditional approaches. The projects in which time based estimation is most helpful are those with very little research content (mostly development on well understood, existing systems) and in systems where there are few differentiating capabilities. For example, on a maintenance project on a legacy system in which tasks are well understood “cookbook” style upgrades, historical records of how long a similar task took isn’t a terrible approach. Generally speaking though, time based estimation has been one of the more problematic approaches in many more active software projects, and especially greenfield and commercial development.
Some of the problems with time based estimation:

It gives the impression of precision where there is none. By the time you add the hours up, you have a number that looks like 300.25 days, but in actual fact you are probably looking at 160-600 days of work, and closer to 600.
It’s personal. Time based estimates devolve quickly into who is doing the work and how long it should take one person over another. It puts the individual at the forefront rather than the team. Approaches like “Ideal hours” and “Ideal days” attempt to remedy this, but further confuse the issue by having hours that’s aren’t actually hours.
More subject to pressure. Instead of determining the relative difficulty of a task, the development team and management argue about whether something should *really* take 8 hours or not, or whether it would take Joe 6 hours and Bob 8 hours, the focus is put simply on whether a particular task is more or less difficult than another.
Time requires constant revision with “fudge factors” to accommodate meetings, lunch, distractions, discussions, communication… all the things a healthy team should be doing are “overhead penalties.” The project’s tasks contain time data that doesn’t correspond to the actual project duration.
Time assumes all resources are equal and all tasks are equal over time. Losing certain people will have a more dramatic effect on schedule than others, and adding people may actually cost you. The cost of disruptions have to all be manually adjusted.

So if time based estimation has all these problems, what should we use? Over the past decade, point based estimation has been used to significantly improve predictability in software projects. Many agile development processes have adopted complexity points as the overall recommended estimation technique across the project. Point based estimation especially helps projects in which the requirements are not well understood, expected to change over the course of the project, and projects in which the desired capabilities are unique and have little or no historical data supporting them. Well disciplined point based estimation also accounts for and reports the effects of gaining and losing resources, training, time off, and even employee morale.

Complexity Points

Points are an arbitrary scale of numbers which typically follow an ever increasing sequence. One of the most important features of points is that they decouple estimation from time, so the numbers should not represent hours or days. What is important is to establish a relative scale of increasing effort and difficulty that the team all understands and agrees upon. (A sign that the team has internalized the scale is that they all come up with the same complexity point number for a given task in planning poker.) While time based estimates are very subject to interpretation and give the illusion of precision, points estimates are specifically and intentionally abstract, establishing relative difficultly of tasks within the context of the project and the people on the project. Because they calibrate to the team and project, they tend to become very accurate very quickly when used correctly. Typically the calibration only takes two iterations.
A healthy software project is a cooperative game that relies heavily on teamwork to be fast, efficient, and successful, so the estimation approach and tasking should serve that purpose and be based around teams rather than collections of people. The Team is the abstraction that produces estimates and is assigned work. The members of the team are an implementation detail. Points excel at predicting team performance over individual performance and support a more advanced model of collective code ownership among the team, eliminating bottlenecks and knowledge silos.

Complexity Levels

What I describe here is an augmented form of point based estimation designed to improve the precision of pointing, allow for faster and more effective resource transition onto the team, and improve the accuracy of the overall project estimate. It should be compatible with any agile process that support points, and because Scrum doesn’t prescribe any particular estimation technique (other than it not be time based) it works fine with Scrum as well.
Major goals:

Answer the question: When will it be done?
Increase the accuracy of the answer to the question.
Increase the precision of the answer to the question.
Monitor project health.
Show schedule slippage as soon as possible.
Estimate, measure, and task against a team, not a person.
Understand and account for all the effort expended indirectly on a project. Whether it’s meetings, lunch, holidays, overtime, training, losing or gaining a team member, the estimation approach should capture this and show the effect on the schedule.

So far, points mostly address the goals, but they’ve also got some drawbacks that can sometimes result in losing these advantages. One problem with points is that they often become a proxy for time. As people estimate with numbers, especially if they’ve done time based estimation in the past, they tend to start associating the points with hours or days. 2 points becomes 2 hours, and so on. It would be better to have a scale that isn’t as easy to start confusing with time. In order to overcome this, we’ll borrow a page from fuzzy logic estimation by assigning names to the levels of difficulty. Even though we’ll still translate “under the hood” to points.

Designing the complexity scale

What should we be estimating? We should be estimating stories or features. What we are tasking are capabilities or features of the system. There are many activities that support the creation of the product (design, testing, meetings, support, teamwork) and it is these very activities will be captured in the velocity of the project. In typical agile form, tasks should result in completed, tested, demonstrable functionality with all code and tests written and accounted for. The idea here is that the code should be in production quality shape at the end of each iteration, and we are not estimating the overhead required to make these stories or features reality. Tasks that are not directly tied to a shippable feature such as “support Developer in this task” or “have a meeting with Paul” should not be included or estimated in the task. Why not? Because these tasks do not result in executable code, they can skew the results dramatically while not informing what the actual execution of a feature will require. They are the things we do as part of accomplishing the task, and so that is how they are factored into the estimate. In the end, we want to know how long it will be before a feature is working, not how long it took to have a meeting to talk about the feature. All of these things still have an effect the schedule, but they have an effect behind the abstraction of velocity.
Because we humans have a tendency to underestimate large, complex systems, the scale should be designed to prevent that. On one end, the scale should reflect the simplest features of the system, the kind of features that would take less than a day. On the other end of the scale should be something that will be difficult, but can reasonably be expected to be finished within the iteration. If tasks that take 5 minutes (rename a file) are included with tasks that produce features or functional code, it can create an overestimation problem (aka sandbagging) by inflating lots of small cleanup to the level of a feature.
Five complexity levels is a good starting point, particularly for a team not accustomed to this estimation approach. Before you become an expert, five levels is about the limit of discrimination that a human can effectively perform. As you become better and more accurate, you can introduce finer grained levels and other terms, but I recommend starting simpler and increasing complexity only as it proves itself to be necessary, otherwise you’ll end up spending more time trying to understand the minor differences between 10 or 15 levels than estimating.
In a pure fuzzy logic based approach, we could pick any set of terms we liked, so long as the team all understands the relative size and approach. This could be T-Shirt sizes, dog breeds, drink sizes (short, tall, grande, venti, big gulp), or whatever else the team bonds with. So long as the team understand the relative and increasing level of difficulty, you can go with any terminology.
For the sake of example, this is a list of terms I have used in the past (inspiration obvious…):

Trivial
Simple
I can do it
Ouch
Hurt me, plenty!

Even though we can technically use any terminology, we can increase the effectiveness of our estimation by giving the terms objective characteristics. This serves three purposes: It gives context and objective criteria to new team members who haven’t been “in the know” about what a particularly dog breed represents to the team, it clearly identifies the necessary skill level needed for a task, and it helps defeat pressure-based influence on the estimates.

Trivial – A trivial task is one in which the approach and design are well understood, no research is expected to be needed, and is not a large effort.
Simple – This requires some thought, perhaps evaluation of a few different approaches, but it’s mostly just coding without having to break new ground.
I can do it – This is the first feature that could be considered challenging, but the expectation is that it won’t be too difficult, and you’re still excited. It will require some research, perhaps prototyping, and design work. This is a fun problem to solve.
Ouch – These tasks need significant research, design, prototyping, and are known to have snags and gotchas. There are definite unknowns, and you wouldn’t want to assign more than a couple of these to any one person.
Hurt me plenty – The most complex task that anyone believes could reasonably be accomplished in an iteration. There are enough unknowns that these seem risky, but not so risky as to need to be broken down further. As a calibration, the most senior developers on the team should be able to accomplish tasks of this complexity, though not without focus and effort.

To achieve the level of accuracy and precision we want, it is critical that the scale not include anything we don’t expect could be accomplished in an iteration. It is very important that if you are scheduling “Hurt me, plenty” stories and they aren’t happening, then your bracketing difficultly level will be lost and “Ouch” will become the new upper bound reducing the effectiveness of your estimates. If a complexity level ever exceeds the ability to accomplish it, then our ability to predict end date will suffer, or become entirely meaningless as large unknowns with unreliable numbers are included. Once we don’t know how large it really is, it could be twice as hard as an Ouch, or it could be fifteen times as hard. Typically, tasks should not be carried. If they are, the team is either signing up for too many points, or the scale is flawed. With disciplined, our velocity accuracy and precision should increase.
Velocity is a common metric in iterative agile development. Velocity is a remarkable simple, yet remarkably accurate measure of how fast a team can produce product capabilities and features. To calculate velocity, you just add the number of points the team accomplished in an iteration, and compare that with how many points you have to go. If you complete 10 points in an iteration, an iteration is 2 weeks, and you have 90 points to go, you can expect to be done in 9 more iterations. Also, velocity measures the gestalt performance of a team, not individuals. Adding a member to the team could reduce the velocity (as readers of Mythical Man Month might guess), and removing a (problematic) team member could increase the velocity. Many non-intuitive discoveries about productivity have been made by companies calculating accurate velocity. This is something velocity and points can identify that simple time based approaches are simply incapable of doing.
Gestalt is a German word that means form or shape. In English usage, it refers to a conceptual “wholeness.” It is often said that the gestalt of something is “different than and greater than the sum of its parts.” When a team becomes a high performance team, a new entity emerges that performs orders of magnitude greater than what the group of individuals on that team could ever do independently. Particularly among experienced developers, high performance teams tend to create themselves when permitted, and most high performance teams simply form in the absence of management interference, and gestalt teams are not actually difficult to create.
Now, so far, we haven’t assigned points to our complexity levels. As with complexity points, the points should be an ever increasing scale, and reflect the magnitude of difference between the problems. For example, if we use powers-of-two, will a “Hurt me, plenty” take 16 times as long as a “Trivial”, and could I reasonably expect to do 16 “Trivial” tasks or 1 “Hurt me” task in an iteration? This is also why decoupling the complexity level from the numeric values will allow us to tweak the scale as we start measuring our performance. It may be that the fibonacci series more accurately reflects the actual results, or it could be that we determine the best fit mathematically from history project performance data.
Think of the iteration as a truck, and the complexity points as boxes of various sizes. When you plan the iteration, you look at the moving average of the previous iterations, and attempt to fill the truck to capacity for the next iteration. So, if the team has accomplished 20 points of complexity, they should be able to do 20 1-point tasks, or 1 16-point task and 1 4-point task, and so on. The question is, does this prove to be true? What if the team can do 40 1-point tasks, but only 2 8-point tasks? In that case, it may be that the 1 point tasks are best described as 2 points, but the 8 point tasks are fairly accurate. The fibonacci sequence beginning at 2 is a better fit for where estimates are landing.
Powers of 2: 1, 2, 4, 8. 16
Fibonacci: 1, 2, 3, 5, 8
Fibonacci starting at 2: 2, 3, 5, 8, 13
In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
Think about the “midpoint” of any software project you’ve been on. If you said at that time that you thought it could be completed in a week, most everyone would think you’re mad. If you said a month, you’d get an incredulous reaction. By the time you start talking in a three to six month timeframe though, nobody can really conceive of the problem at that level of detail, and it seems like it might even be plausible. This is at the heart of the problem with underestimation in software. Once we are far enough from the details of the problem, or there are too many details to consider, everything just starts looking a little simpler…
Our scale is forcing us to get the problem broken down into chunks small enough that they can be envisioned within a few weeks (an iteration). Until they are broken down, the law of large numbers does not apply to our estimates because they are all skewed by a lack of knowledge and understanding. This is why it is important to exclude the huge unknowns from the estimation process entirely and bound our scale with what we can actually accomplish. If the problem can be broken down enough that we’ve avoided the underestimation problem, and we should be able to rely on the law of large numbers to move towards an average over time. What do you do with the huge unknowns? As long as you still have them, you are still in the envisioning or speculation phase of the project, too far in the front end of the code of uncertainly, and not ready to provide a responsible estimate for the project (at least not with any level of precision). Once you have broken down your epic stories and tasks to points that fit in an iteration, you can being to have the confidence in a fairly precise end date.
If you haven’t used points before, this is all why the scale matters and how it works. The points are tied to fixed length iterations, which form the unit by which the time estimate is calculate. At the beginning of the project, choose an iteration length and then stick with it. If a month is too long (and it usually is), try going to 3 weeks. If your project supports it, you can even try two weeks (more difficult if you are creating a lot of differentiable features, easier on a project with a lot of well known solutions). Once you find a comfortable length, don’t change it. Sometimes people may want to change it because of holidays, or because there’s a series of meetings. Do not do it. These are the very events that provide invaluable information. For example, you might find that your velocity goes up over the holidays and find out in the retrospective that developers had a lot more time to focus because of fewer meetings. I’ve met many people who get a tremendous amount done on holidays due to the reduced noise. What’s more, you now know the effect of a major event on the team and project.
By focusing on difficulty terms rather than points, it is possible to analyze the best fitting values for these difficulty levels over time to improve the accuracy and precision of the overall release date. Since the terms are not arbitrary, but reflect specific characteristics that make the tasks more difficult, they are less subjective than arbitrary terms and new terms can be added over time. For example, you might choose to add “Irritating” and “Tedious” with a values of 4 and 8 to describe work that isn’t particularly difficult, but is time consuming.
Something else we’ve improved upon is the ability to manage by skill level. Giving a junior developer a “Hurt me plenty” task is sure to have a detrimental effect on both velocity and morale because either the junior developer is going to need considerable attention from the more experienced developers who then cannot concentrate on complex tasks, or the new developer will fail to complete the work and have reduced morale. People are not interchangeable parts, and this estimation technique highlights difficulty in an unambiguous way.
This isn’t a far from the point based estimation used and proven time and again on agile projects over the past decade. It’s a minor tweak that provides three major benefits: By decoupling the estimation process from points, we make the estimates in terms that are less abstract to us humans. Secondly, we are able to refine the points easily during the project to better fit reality as it emerges. The last point is that because we are describing the tasks in terms of the nature of the difficulty of the assignment, we can use this to improve efficiency on teams with very mixed skill levels and understand the effect on the schedule.
References:

Software Estimation: Demystifying the Black Art: McConnell
Agile Project Management: Highsmith
Code Ownership
Law of Large Numbers
Teamicide