At Reflection, we want to be as transparent as possible on what we do and the data we provide. However, as we already talked about here, providing a meaningful accuracy measure on our estimates is a tricky business. We don’t believe a simple averaged error value would be reliable – it would in fact be more misleading than anything else. It would be an oversimplification of a complex problem. So we developed a system of “Confidence Level” that is (hopefully) simple, intuitive and meaningful.
In a nutshell, it’s a simple rating system, ranging from C, B, A, A+ to A++, where A++ are the estimates we are most confident in their accuracy. It is meant to simply guide our users in how much trust they should put on a specific estimate – since it varies so much.
I will explain what they mean and how we came up with it below, but first let’s sum up the reasons why this is a complex problem.
The range in downloads/revenue is so large that any average is misleading. The app market is roughly a power-law (downloads or revenue vs ranks), so it’s very unequal. This is one of the key aspects, but I’ll be brief on it since we covered it in detail in our article “How Does Reflection Model the App Market So Well?”.
How to measure the error on our estimates?
- Absolute error (Eg. we over or underestimated by 20 downloads). This would be the simplest, but due to the wide range of values (from 0 to 100,000+) this is not very indicative of how bad the estimate really was. See in the table below, how the absolute error increases with the value. Also if we took the average value, it would be totally dominated by the top few apps with massive numbers – while all the smaller apps would contribute almost no error.
- Percentage error (Eg. we were 20% off):
Note: We want it to be the same whether we over or under estimate, so we take the absolute error at the nominator and the average of the Estimate and the RealValue at the denominator (instead of simply the RealValue).
This makes more sense, with such a large range of values. But the app market is so flooded with apps, many get very few downloads. In particular for the Paid apps, the vast majority of apps make only a handful of downloads per day per country. So if we estimate 3 downloads, but the app really had only 1 download, then we overestimated by 100% which looks really bad percentage wise, but in reality, it’s only 2 downloads off. Considering the range of values we are dealing with here, 2 downloads is not really significant.
To be able to measure our accuracy reliably, we need enough data. Let’s consider a worst case scenario here, if all of our partner’s apps were ranked particularly badly some day, say from 100 to 500, how could we estimate our accuracy in the top 100? We simply can’t. Also our averaged accuracy would be great, because we only have apps in lower ranks which are much easier to estimate… very misleading.
Now, how do we address those issues? to provide a meaningful measure of how much we should “trust” our estimates.
- Market inequality: We break down the wide range of ranks in different groups. So we will measure the accuracy for each range of ranks individually, this way we are comparing more similar apps with each other. The ranges are distributed to be as uniform as possible on a logarithmic scale, since this is the nature of our data. This image might help explain this …
- Error measure: We realised that the percentage AND the absolute error, both need to be large for the estimates* to be effectively “bad”. So we came up with a custom error metric that can mimic well what a human (ok, maybe just statisticians…) would intuitively judge as good, or bad. We combined (multiply) the percentage and the log of the absolute errors – taking the log here simply makes the values more linearly distributed. We also added a constant parameter (α) at the denominator for the percentage error, so that the error of only a few downloads (or dollars) would have less impact.
The end result is a metric, conveniently ranging between 0 (perfect) to 10 (bad), that we find a lot more intuitive, but we will let you be the judge:
- Coverage: The last relevant piece is the coverage – because for an error measure to be statistically significant, we need many measurements. So for this, we consider how many partners data we have on average per day for each range of rank. We define a ceiling, above which anymore data wouldn’t help the accuracy. So this gives us a metric between 0 (no data) and 1 (ceiling).
Putting it all together in a Confidence Level metric:
The output of this is a single number (percentage) representing the level of confidence we should have for each estimate, considering all the aspects mentioned above. Now, we don’t want our users misinterpret this metric – to confuse it with a percentage of error or whatnot. So we simply break it down into 5 categories, where the separations are percentile based. And we label them as C, B, A, A+ and A++.
Where to Find the Confidence Level on Reflection
Confidence Levels are available on app revenue and download estimates for each country for Pro users. To see them, view the “Revenue and Downloads” tab on any app page, and overlay countries using the drop down. The confidence levels will appear in the chart tooltip as you hover over the chart as long as you have the “Confidence Scores” check box selected.
View the Revenue and Downloads tab for an app (Pro users only) and overlay countries on the chart using the drop down:
Make sure you have the “Confidence Scores” checkbox selected:
Hover over the chart to see confidence scores for each data point. The two columns are confidence scores for phone and tablet if “All Devices” are selected for an iOS app:
We want to show as many estimates as possible, because gaps in data makes any analysis much more difficult, at the same time we need to ensure our estimates are accurate. Hence the requirement, in our eye, to provide such a labelling. We hope that you appreciate the effort of transparency!
* When measuring the error for an app we use Estimateloo – an estimate found without using that specific app in our modelling – this technique is called leave-one-out.
PS. If you didn’t get a word of this and that you don’t see a A++ in our estimate next to your app, just make sure you link your account with us. That’s all you really need to know! 😉