# Why Do We Use Fibonacci Numbers to Estimate User Stories?

Frequently there are great debates about the use of the Fibonacci sequence for estimating user stories. Estimation is at best a flawed tool but one that is necessary for planning work.

User story estimation is based on Department of Defense research in 1948 that developed the Delphi technique. The technique was classified until the 1960’s (there are dozens of papers on the topic at rand.org). Basically, the Rand researchers wanted to avoid the pressure towards group conformity that typically led to bad estimates. So they determined that estimates had to be done in secret. Initially, the estimates would be far apart because people had different perceptions of the problem so they would have them talk about highs and lows after estimating in secret, then estimate in secret again. At Rand Worldwide you can read the original papers that demonstrate convergence.

Rand researchers then studied the effect of the numbers estimators can choose and found a linear sequence gave worse estimates than an exponentially increasing set of numbers. There are some recent mathematical arguments for this for those interested. The question then--if you want the statistically provable best estimate--is what exponentially increasing series to use. The Fibonacci is almost, but not quite exponential and has the advantage that it is the growth pattern seen in all organic systems. Why does the Fibonacci sequence repeat in nature? So people are very familiar with it and use it constantly in choosing sizes of clothes. For example, tee shirt sizes are Fibonacci. Since some developers are averse to numbers (a really strange phenomenon for those working with computers) they can use tee shirt sizes and their estimates are easily translated to numbers.

Microsoft repeated this research in recent years in an award-winning IEEE paper. As a result, Microsoft has abandoned hourly estimation on projects. See Laurie Williams, Gabe Brown, Adam Meltzer, Nachiappan Nagappan (2012) *Scrum + Engineering Practices: Experiences of Three Microsoft Teams. *IEEE Best Industry Paper Award, 2011 International Symposium on Empirical Software Engineering and Measurement.

So the Agile community has converged on the Fibonacci as the sequence to use. Unfortunately, many agile teams do not use it properly and try to get everyone to agree on one Fibonacci number which gives you mathematically and experientially provable bad estimates through forced group conformity. This is the very thing the Rand researchers invented the Delphi Technique to avoid.

Over and over again, researchers have shown that hourly estimates have very high error rates. This is true even if the user is an expert. It’s the tool that’s the problem. If you want to practice based on evidence, relative size estimates simply deliver a much more accurate estimate.

Hi Jeff,

Another great article, after reading it and watch that YouTube video, I think nature always gave us the answer why estimate in hours is wrong and against human thinking.

Thanks,

Andrew

Hi Andrew,

Which video is that? Can you provide the link??

Mohan

Thanks for sharing Jeff.

“The Fibonacci is almost, but not quite exponential ”

This statement is not correct. Fibonacci is exponential.

More here: https://en.wikipedia.org/wiki/Random_Fibonacci_sequence

Great link, as it states, “Johannes Kepler discovered that as n increases, the ratio of the successive terms of the Fibonacci sequence Fn approaches the golden ratio =(1+ sqrt(5))/2 which is approximately 1.61803. In 1765, Leonhard Euler published an explicit formula, known today as the Binet formula,”

However, in a Scrum sense we are focused on small values of N, eg 0,1,1,2,3,5, 8. The ratio between these terms is not close to constant for small values of N.

Agreeing on one number is wrong? What is the right way?

The best choice is to average the estimates (you can round to the nearest whole number). The mean is a maximum likelihood estimator for normally distributed errors and will generally be more accurate than the mode or median.

It also is very quick to average everyone’s estimates whereas forcing everyone to agree on a specific number slows down the entire estimation process.

I’ve done some reading on this; in order to sample a normal distribution successfully, sample size needs to be at least 30. I haven’t looked at whether your statement applies to student (or t) distributions with the same validity. Have you seen any work on the same? I’d be curious to read it.

Hi Anthony,

There isn’t exactly a minimum sample size, what is true is that the bigger your sample, the more accurate the data. The “30” rule of thumb generally is based on trying to make statistical claims at a specific level of certainty or p-value. The academic standard for a p-value is 0.05 which in practice somewhat often aligns with about 30 data points depending on the null hypothesis, the signal to noise ratio in the data, etc (https://www.researchgate.net/post/What_is_the_rationale_behind_the_magic_number_30_in_statistics).

However, regardless of the number of datapoints, the mean minimizes the MSE (mean squared error) assuming that estimation errors are normally distributed (https://math.stackexchange.com/questions/967138/formal-proof-that-mean-minimize-squared-error-function). This is true with 2 data points or 2 million.

In reality, the estimation errors when using fibonnaci estimation are probably not quite normally distributed, however practically speaking the mean is very quick to calculate and is a concept that everyone will easily understand so I view it as having the right balance of statistical integrity and practical application to make it the strongest choice for most teams. In theory, if you knew the exact team specific estimation error distribution you could increase accuracy a little bit by adopting that for an individual team.

I was taught to average the number only if the team was within one “level” of each-other on the Fibonacci sequence. For example: A team of five members estimates a backlog item and three team members estimate “3” and two team members estimate “5” – you add it all up and average, 19/5=3.8 round up to 4.

Second example: team of 5, three team members estimate “3” – one team member estimates “5” and one team member estimates “8.” The people who estimated “3” and “8” have a discussion. Then another vote takes place and if they are all within one level, add the total and average.

Is that how you do it, or do you just take the average of the entire vote regardless of how far apart the differences are in the sequence?

Hi Dean,

I asked Jeff if he was aware of a study specially on the “one level apart” system that you described vs a simple average and he said that as far as he knows there isn’t an existing study and we’d have to do a research experiment to test it scientifically.

In our view, a “one level apart” is overkill and is significantly slower. Having the person with the lowest and the highest estimate say in 30 seconds or less why they estimated what they did if they are more than 2 levels apart and then having everyone re-estimate is a reasonable structure, as being 3 levels apart can be an indicator that one person misunderstood the user story. I’ve done this on some of my teams. Among an experienced and stable team, estimations that are 3 levels apart are quite rare so this system can be implemented with a minimal time cost.

Without a study it is hard to be certain so take the above as you will.

We definitely have a 3-level limit with our voting and you’re right, that’s very rare. But, that gives a voice to each person voting to explain why their estimation is high or low, and gives the team a chance to discuss it. I think the biggest challenge for my team (especially people new to Scrum), is how to think in Fibonacci numbers rather than thinking about story points in increments of time. So, that team discussion when numbers are so varied, is actually a great learning experience for everyone.

Seems to be a lot of discussion about something that matters little.

Estimates are often wrong, so what.

If our pointing suggests we average 40 points a sprint as opposed to 44, so what.

Estimates are qualitative whereas actuals are quantitative. So you’re comparing apples and oranges anyway.

Let’s move on to something more important like delivering value.