What?
In the kind of research that I do, and that most of the students who are doing who will visit this site, the main task that requires statistics is the task of determining whether or not one algorithm (say, algorithm A) is better than another algorithm (algorithm B). E.g. suppose we have 10 results for each of these algorithms as follows:
A | B |
3.0 | 3.1 |
2.8 | 2.7 |
1.4 | 1.5 |
3.5 | 3.4 |
2.8 | 2.7 |
1.1 | 1.2 |
9.8 | 9.9 |
2.6 | 2.7 |
3.2 | 2.9 |
4.5 | 4.4 |
Which is best?
I'll assume all students know what mean, median, variance and standard deviation are, and how to calculate them. If you don't know, you can easily find out. E.g. see http://www.fmi.uni-sofia.bg/vesta/virtual_labs/freq/freq2.html. generally we may find that A has a better (for now, we just assume better means higher) mean than B. But this doesn't mean algorithm A is better than B. In fact we can never be sure about such a claim. Basically, a statistical test plugs in numbers like the above and comes up with a p-value; this is the probability that we would see numbers like this even if both A and B are about the same. Naturally, this probability is lower if the mean of A is higher, but the probability is higher if the numbers tend to have a high variance. Anyway, see below for a basic step by step.
I will try to add to this page gradually with tutorial material and more tests. But for now we have some very quick and simple "how to"s.
How?
We can go a long way just using this site, which automates the process for us.
http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/TwoPopTest.htm
Here is a real case. In Ridzuan Daud's PhD, we are developing a ruleset optimisation algorithm called ORGA. Is ORGA better than C4.5 on the horse-colic dataset? Well the results of 10 runs of each are as follows:
ORGA: 76.4, 77.334, 80.667, 78.334, 82.667, 78.667, 77.667, 76.0, 79.0, 78.334
C4.5: 72.0, 73.0, 72.667, 73.333, 73.0, 72.333, 71.667, 74.667, 74.0, 73.0
Higher is better (these are percentage accuracy results). Actually, in this case it seems completely clear that ORGA is better. But to do proper scientific reporting we should do the statistics anyway. Often it may look clear, but still may not be statistically significant. E.g. suppose we had just 2 results for each?
Anyway, go here: http://www.mirrorservice.org/sites/home.ubalt.edu/ntsbarsh/Business-stat/otherapplets/TwoPopTest.htm and enter those numbers. ORGA in one box and C4.5 in the other. Click "Calculate test for means difference".
Since there are fewer than 30 samples per algorithm in this case, this page calculates the T-statistic, and then works out the p-value for it. Otherwise it would have calculated the F-statistic, which is better for larger samples. Notice that the p-value in this case is 0.00025 -- you can think of this as 100*(1-0.00025) = 99.975% confidence that ORGA is better than C4.5 on this dataset. Usually we are happy with p < 0.1; with p >= 0.1, we have to say there is no clear evidence that A is really better than B. If we get this situation, then we might do more experiments (i.e. get 20 runs of each, or 50 runs of each, etc ...) and we may then be able to find a statistically significant difference (but we may not).
The story on the Wisconsin breast cancer dataset is a little different. Here the raw results are:
ORGA: 94.1, 93.629, 87.303, 86.416, 93.997, 94.287, 94.075, 92.838, 90.179, 93.579
C4.5: 93.1, 95.606, 92.618, 95.079, 92.618, 94.903, 93.848, 94.376, 93.848, 94.903
In this case we can see (by using the above site) that the C4.5 mean is better, and the p-value is around 0.032; now, of course, the p-value refers to the confidence we can have in the statement "C4.5 is better than ORGA on this dataset" -- and we can indeed be quite confident -- i.e. 96.8% confident.
Now, just for the purposes of illustration, I will add 2.16 to each of the ORGA results and create results for a pretend algorithm "PRET" -- this leads to a mean for for PRET that is very very slightly better than the mean for C4.5
PRET: 96.26, 95.789, 89.463, 88.576, 96.157, 96.447, 96.235, 94.998, 92.339, 95.739
C4.5: 93.1, 95.606, 92.618, 95.079, 92.618, 94.903, 93.848, 94.376, 93.848, 94.903
The site tells us that PRET has a better mean now, with 94.2 against 94.09 -- however the p value is 0.45739, which is far away from any indication of confidence in the notion that PRET is better than C4.5. Note in this case, however, that the variances are what lead to the lack of confidence. PRET (just as ORGA in this case) showed a very high variance (occasionally showing results like 86 and 87), which saps any confidence we can have in the mean of a few samples. In contrast, if we compared 10 results which were all 94.2 give or take 0.001, againts 10 results which were all 94.09 give or take 0.001, then these would show the same means, but the p value for the first being better than the second would indicate much more confidence. I just tried it and it showed P ~ 0.00025.