SFB 876 - News

Rapid Soccer Mining: Euro 2012

Who will win the European Soccer Championship? Many supporters care about their team and the probability of winning the championship. A small team of graduates of the collaborative research center SFB 876 tries to answer this question directly in the beginning. Follow their posts, not only predicting the winners, but demonstrating the whole process from data retrieval via model learning up to the prognosis.

And that makes...

Dr. Ruhe, Tim , 13.7.2012

A few weeks after the triumph of the Spanish squad it is about time to draw some conclusions. However, we will not draw conclusions on the performance of the individual team and will concentrate on the performance of our model.

Doing so it is of course interesting which model got the most games right. In this category we have a clear winner who dominated almost as much as the Spanish dominated Italy. With 18 out of 31 matches correct, the winner is: Oddset.

Second place goes to Jan-Hendrik with a temporary success in the everlasting match men vs. machine. 16 correctly predicted games speak for themselves. Congratulations.

That makes the third and last place for our model. But that does not make us too sad. At least not really. Having almost 50% of the matches correct, the model performed just as expected. Sure, winning against Oddest would have been nice, but it would have given a totally wrong expression on our model. It would have been pure luck and the influence of statistical fluctuations. Nothing less, nothing more.

Differences in the B-note

Looking at the three different ways of predicting soccer games one might have come up with the idea of betting. If one would do that in reality depends – at least to a great deal – on the personal opinion on betting and the will to take a certain risk. But... you know ... in principle...?

To make it short: We did calculate the possible win for each of the three models, based on the assumption that 1 Euro is bet on every game. However, the results in this subcategory are somewhat surprising. Placing your bet on the teams favored by Oddset will give you a plus of 85 Cents. Not much, but winning at least.

Things look really bad for Jan-Hendrik, though. Despite the fact the he had 16 matches correct, he would be loosing 5.55 Euro. We strongly recommend him not to bet on soccer matches.

But now for our machine learning model. It actually wins 2.25 Euro. That does not seem that much either but try to get that much at your bank!

The question how this happened can answered quite easily. It turns out that our model did have some kind of inside knowledge on the wins of underdogs. How it got that? We don't know! Maybe we were just lucky. Although, soccer has not nothing to do with luck, does it?

Last game coming up on Sunday

Dr. Ruhe, Tim , 29.6.2012

Spanien - Italien

Looks like a sure thing for Spain and the third title in a row. Betting odds see this match the same way but a little less clear. Only our local expert Jan-Hendrik believes in a draw.

Win Spain Win Italy Draw
Jan-Hendrik X
Betting odds(Oddset) 2.00 3.00 2.75
Our prediction 69.7% 18.4% 12.0%

Who will make it to the final?

Dr. Ruhe, Tim , 27.6.2012

Portugal – Spain

Spain slightly ahead of Portugal. We guess Ronaldo does not agree with that.

Win Portugal Win Spanien Draw
Jan-Hendrik X
Betting odds(Oddset) 3.55 1.80 2.80
Our prediction 40.3% 45.9% 13.8%

Germany - Italy

Sure thing! After Greece Germany is going to send home Italy as well.

Win Germany Win Italy Draw
Jan-Hendrik X
Betting odds(Oddset) 1.80 3.55 2.80
Unsere Vorhersage 57.5% 24.0% 18.5%

Wanted! Opponent of the German team

Dr. Ruhe, Tim , 24.6.2012

England - Italy

Who’s going to meet the Germans in the semifinal on Thursday? The betting odds do not have a clear statement. Our model, however, predicts a classic: Germany vs. England!

Win England Win Italy Draw
Jan-Hendrik X
Betting odds(Oddset) 2.40 2.40 2.75
Our prediction 48.2% 29.8% 22.0%

Number three

Dr. Ruhe, Tim , 23.6.2012

Spain - France

This on seems to be pretty clear. Spain sending home the French. Can les bleus find back to the strength of former days?

Win Spain Win France Draw
Jan-Hendrik X
Betting odds(Oddset) 1.65 4.10 2.90
Our prediction 41.5% 35.1% 23.4%

Second quarterfinal today!

Dr. Ruhe, Tim , 22.6.2012

Germany - Greece

According to Sami Khedira a loss against Greece does not play any role in the heads of the German players. Our algorithm agrees with that. Greece is no more than an intermediate stop for the Germans.

Win Germany Win Greece Draw
Jan-Hendrik X
Betting odds(Oddset) 1.25 7.00 4.00
Our prediction 43.6% 24.5% 31.9%


Dr. Ruhe, Tim , 21.6.2012

Czech Republic - Portugal

A promising first match. According to our algorith the teams will draw after 90 minutes and take it into overtime. Maybe even penalties have to decide. Who’s in better shape mentally?

Win Czech Republic Win Portugal Unentschieden
Jan-Hendrik X
Betting odds(Oddset) 4.10 1.65 2.90
Our prediction 23.8% 34.9% 41.3%

Note: The quoted probabilities correspond to the result after 90 minutes!

Last matches of the group phase

Dr. Ruhe, Tim , 19.6.2012

England - Ukraine

Close match between England and Ukraine. Will Ukraine use the advantage of playing at home or will the second host leave the tournament in an early phase?

Win England Win Ukraine Draw
Jan-Hendrik X
Betting odds(Oddset) 2.00 3.00 2.75
Our prediction 35.9% 37.3% 26.8%

Sweden - France

In contrast to the betting sites our model does see an advantage for Sweden. Will the wikings leave the tournament with a win?

Win Sweden Win France Draw
Jan-Hendrik X
Betting odds(Oddset) 4.50 1.55 3.05
Unsere Vorhersage 48.6% 28.4% 23.0%

Group C: Italy, Spain or Croatia

Dr. Ruhe, Tim , 18.6.2012

Croatia – Spain

Spain does have a clear advantage against Croatia. Although our model does not see the Spanish as far ahead as the betting odds.

Win Croatia Win Spain Draw
Jan-Hendrik X
Betting odds(Oddset) 7.00 1.50 2.60
Our prediction 34.2% 42.4% 23.4%

Italy - Ireland

The Squadra Azurra also makes it into the quarterfinals.

Win Italy Win Ireland Draw
Jan-Hendrik X
Betting odds(Oddset) 1.25 7.00 4.00
Our prediction 61.3% 18.9% 19.8%

A mouse took a stroll through the deep dark wood

Dr. Ruhe, Tim , 18.6.2012

Trees are awesome. They provide shadow, prevent you from rain and transform Carbondioxide into Oxygen. A larger group of trees is called a forest. So much for the German country sayings.

Not only as a forest ranger one has to deal with trees. Data Miners also work with trees on a regular basis. In many cases models used for predictions are based on trees. Decision trees. But just like regular trees decision trees do not just fall off the sky they have to grow. Although for decision trees it’s more that they are grown.

Predicting means ordering

In principle growing a decision tree is nothing more than a big ordering proceeding from the roots towards the leafs. Anything can be sorted.

Imagine you were looking at a pile of soccer games that neede to be sorted by their results. Criteria used for sorting are up to you. The only thing you would have to keep in mind is that you need to use the same criteria for every game. And: The actual result of the game can not be used for sorting.

Now you’re good to go. You might start by looking at the Fifa ranking of the away team. All games with this particular parameter being greater 63 go to a newly made left pile while the rest goes to the right one. By doing so you already achieved quite a good separation.

But the separation can get better! Starting from each of the piles you can create two new ones. That also will have to happen according to a certain criterion. Maybe the trend – up, down or equal.

This little game of creating piles from piles from piles continues until no more piles can be created. This is the case when either all games in a pile already have the same result or are so similar that the pile can not be split up any further. Analog to real trees these final, small piles are called leafs.

The pattern evolving during this procedure of sorting and piling gets wider on top. Just like a tree. So, what was grown with your help is a decision tree.

With the help of your decision tree you can now start classifiying matches that ar e still about to happen. Spain vs. Croatia for example. According to its parameters this game will be processed through all the branches of the decision tree until it ends up in a leaf. To come up with a prediction you will now have to take a look at all the matches used to create the leaf. If all of them have the same result the prediction for Spain – Croatia will be this very result. Simple.

Things are not quite as simple when the games inside the leaf have different results. But in this case it is also possible to come up with a prediction. You might take the result of the majority of the matches in the leaf for example. Or you might just calculate the probability of the individual results.

More trees are better

Since Hermann the Cheruscan it is well known that if you will have a clear advantage if you are familiar with forests. That is not only true in case of a Roman invasion lead by a certain Quintilius Varus but also in the field of machine learning.

Using a larger number of trees in general has the effect of improving the prediction.

This very learning algorithm called Random Forest has been developed by Leo Breiman an d basically works like the audience joker on Who wants to be millionaire . All of the trees are created independently and come up with an individual result on how a particular match ends. The final prediction is then achieved by averaging over all predictions. Totally democratic.

The advantage of this method is that it will not only return a certain prediction but also the probability it calculated for a certain result.

To make such a forest work it has be made sure that all the trees are different. This is done by drawing a subset of parameters at random when splitting up a certain pile. From all the selected attributes the one that has the best separation power is chosen. By doing so a Random Forest is generated in which none of the trees is the same and all of the trees can come up with an individiual decision on how a match will end.

Group B: Who will make it to the quarterfinals?

Dr. Ruhe, Tim , 17.6.2012

Denmark - Germany

The predicted strength of the Danish for all games so far seems to be more like a natural phenomenon. Can they surprise again?

Win Denmark Win Germany Draw
Jan-Hendrik X
Betting odds(Oddset) 4.50 3.25 1.50
Our prediction 38.8% 31.5% 29.7%

Portugal - Netherlands

Things are close in this group. Will the Dutch still make it to the quarterfinals?

Win Portugal Win Netherlands Draw
Jan-Hendrik X
Wettquoten(Oddset) 2.40 2.35 2.80
Unsere Vorhersage 48.1% 18.4% 33.5%

Last matches of the group phase starting today

Dr. Ruhe, Tim , 16.6.2012

Czech Republic - Poland

Our model does see some advantage for the Czech team but the psycholigical effects of playing at home are hard to predict. So we’ll see how the Polish perform.

Win Czech Republic Win Poland Draw
Jan-Hendrik X
Betting odds(Oddset) 2.80 2.10 2.75
Our prediction 46.2% 27.4% 26.4%

Greece - Russia

Sometimes model predictions are weired. Looking at the performance of the team so far the clear advantage of Greece seems a little absurd.

Win Greece Win Russia Draw
Jan-Hendrik X
Betting odds(Oddset) 4.00 1.70 2.80
Our prediction 50.3% 25.9% 33.6%

England, France or Ukraine?

Dr. Ruhe, Tim , 15.6.2012

Ukraine - France

Close match between Ukraine and France. Slight advantage for France though.

Win Ukraine Win France Draw
Jan-Hendrik X
Betting odds(Oddset) 3.20 1.90 2.80
Our prediction 35.6% 38.7% 25.7%

Sweden - England

The predicted draw puts the English into a difficult position. Will they make it to the quarterfinals?

Win Sweden Win England Draw
Jan-Hendrik X
Betting odds(Oddset) 3.20 1.90 2.80
Unsere Vorhersage 17.2% 34.8% 48.0%

The next two matches of group C

Dr. Ruhe, Tim , 14.6.2012

Italy - Croatia

Advantages for Italy versus Croatia. Still a tough race between the three teams from Southern Europe.

Win Italy Win Croatia Draw
Jan-Hendrik X
Betting odds(Oddset) 2.00 3.00 2.75
Our prediction 57.0% 22.9% 20.1%

Spain - Ireland

Spain with a clear advantage against the fighting Irish . Will they make a giant leap towards the quarterfinals?

Win Spain Win Ireland Draw
Jan-Hendrik X
Betting odds(Oddset) 1.25 7.00 4.00
Our prediction 43.0% 17.2% 39.8%

New! Today's odds from an improved model

Dr. Ruhe, Tim , 13.6.2012

Right on time we improved our model. Here's the new brand new predictions for today's matches.

Denmark - Portugal

Will the Danish surprise again? Once more our predictions and the betting odds disagree. With a win Denmark enters the quarterfinals and Ronaldos chances are merely theoretical.

Win Denmark Win Portugal Draw
Jan-Hendrik X
Betting odds(Oddset) 3.30 1.85 2.85
Our prediction 57.0% 23.9% 19.1%

Netherlands - Germany

Great news for the Germans! The predicted draw does not help the Dutch but gets Germany a lot closer to reaching the quarterfinals.

Win Netherlands Win Germany Draw
Jan-Hendrik X
Betting odds(Oddset) 2.80 2.10 2.75
Our prediction 27.% 20.% 51.%

Data Mining? What the...

Dr. Ruhe, Tim , 12.6.2012

A statistician, a computer scientist and a physicist meet at a party. Says the physicist: “...

What starts out like a joke is really the beginning of a very successful collaboration at TU Dortmund University. This collaboration is organized within the Collaborative Research Center SFB 876. In principle this works as follows: Statisticians will come up with a new method which is then efficiently implemented by the computer scientists and applied to large data set by the physicists. And all of this goes under the name of data mining .

Data mining ? What the...? Mining, that does sound like tradition. Like the work of an actual craftsman. Well, we do not carry around lamps and do not go underground. And also helmets are only worn at very special occasions.

But still data mining can be compared to regular mining. It’s both a professionalised search. The search for gold, diamonds or coal on the one hand and the search for information on the oter hand.

But the similarities continue. Just like coal or oil information might be hidden in different depths. One day they are found directly at the surface and can be found in a simple google search. Another time they might be hidden under huge amounts of useless data one has to dig into.

Soccer data are somewhere. The results of games can be accessed easily. Thanks to wikipedia the history of a lot of teams can be traced back to the beginning of time. But the result itself does not tell us anything about the circumstances of the game. Except for those who did actually watch it. But the exact circumstances of the 1:0 win of the Germans against Switzerland on November 22nd 1950 are probably only remembered by a handful of people.

Information might also be scattered out and washed down the river as small nuggets. In this case they need to be gathered from different sources.

But they might also come across as huge layers. Data from ebay, facebook or amazon can only be mined using industrial techniques or – or in our case computer clusters.

In data mining as well as in regular mining one does need tools. Although, tools in data mining are merely digital consisting of algorithms and computer programs. But one thing remains true: There is a specific tool for every task. Choosing the wrong algorithm in data mining can be compared to digging a hole with a screwdriver.

Automatic algorithms can be used to crawl the internet. But there are easier tasks as well. A simple sorting of data according to a certain criterion is dobe by an algorithm. So is the filtering of useless information.

Once the right tool has been found to dig for ore, oil or diamonds these need to be processed. Diamonds have to be polished, ore has to be smelted and oil needs to be refined. This is similar for information. After extraction data need to be refined and interpreted. This can be done using simple diagrams or in the form of predictions. From the known results and circumstances of previous events the outcome of future events is predicted. From the well known constellations of previous soccer games a prediction for upcoming one can be made. By using data mining.

Glück Auf!

Greece – Czech Republic

Close match where none of the teams seems to has an advantage. Draw, says our model.

Win Greece Win Czech Republic Draw
Jan-Hendrik X
Betting odds(Oddset) 2.55 2.25 2.75
Our prediction 32.2% 30.3% 37.5%

Poland - Russia

Will the Russians make it straight to the quarterfinals? Yes, says our algorithm. After the 1:1 versus Greece it’s almost all or nothing for Poland.

Win Poland Win Russia Draw/th>
Jan-Hendrik X
Betting odds(Oddset) 2.80 2.05 2.85
Our prediction 36.6% 45.5% 17.9%

France vs. England - extremely hard to predict

Dr. Ruhe, Tim , 11.6.2012

France - England

Trick question and very hard to predict. Betting odds see France slightly ahead of England. For our algorithm it’s the other way around.

Win France Win England Draw
Jan-Hendrik X
betting odds(Oddset) 2.35 2.45 2.75
Our prediction 32.0% 39.0% 29.0%

Ukraine - Sweden

The advantage of playing at home will not be enough for Ukraine to beat the Swedish.

Win Ukraine Win Sweden Draw
Jan-Hendrik X
Betting odds(Oddset) 2.15 2.65 2.80
Our prediction 36.0% 47.2% 16.8%

The World Cup winner enters the stage

Dr. Ruhe, Tim , 10.6.2012

Spanien - Italien

World Cup winner and cup holder Spain enters the stage against Italy. Looking good for Spain in this very match. Can the Squadra Azurra fight back?

Win Spain Win Italy Draw
Jan-Hendrik X
Wettquoten(Oddset) 1.75 3.65 3.00
Our prediction 44.0% 31.7% 24.3%

Ireland - Croatia

Croatia versus the fighting Irish. Is it going to be a good start into the tournament?

Win Irland Win Kroatien Draw
Jan-Hendrik X
Betting odds(Oddset) 3.00 2.00 2.75
Our prediction 23.4% 47.5% 29.1%

Finally! Predictions for the hardest group of the tournament

Dr. Ruhe, Tim , 9.6.2012

Netherlands - Denmark

Very interesting match! Especially because our prediction does not match with the betting odds. Denmark is not as clear an underdog as assumed. Is this going to be the next surprise?

Win Netherlands Win Denmark Draw
Jan-Hendrik X
Betting odds(Oddset) 1.50 5.00 3.05
Our prediction 36.6% 38.6% 24.8%

Deutschland - Portugal

Clear advantage for the Germans. Ronaldo and others however, might not agree with that.

Sieg Deutschland Sieg Portugal Unentschieden
Jan-Hendrik X
Betting odds(Oddset) 1.75 3.45 3.00
Our prediction 56.0% 22.7% 21.3%

Scrambeled Eggs

Dr. Ruhe, Tim , 8.6.2012

Scrambeled eggs are tasty and will give you the necessary strength to make it all the way to the final of the Euro 2012. When preparing the eggs there’s a strict separation between egg-white and yolk.

Somehow that’s similar to the home and away teams. There’s a strict line, too and in most of the cases it is very clear which one of the teams is playing at home . In a European Championship this is not as clear. Well, there are official home and away teams so that everyone knows what to wear but a real advantage of the home team does not exist.

As already stressed in the previous post this is completely different for the matches of the qualifying stage. Here the home team does have a significant advantage. In fact it looks like, soccer players are actually playing better having their own supporters behind them. This fact however, comes with some serious problems that need to be avoided in order to reliably predict the outcome of the games for the Euro.

The most simple solution would be to use matches from European championships only. But that also comes with serious caveats: There’s just too few of them. And the histories of those games are a lot more difficult to access. The FIFA ranking for example only exists since 1993.

So what do we do? Simple: Scramble. As long as it takes for egg-white and yolk to form one yellowish substance. The only thing that’s different here is how we scramble. What we basically do is a simulated coin toss. Heads: Everything stays the way it was. Tails: Home- and away team will be switched.

We find that doing this the matches of the Euro `08 can be predicted much more precisely. Well, we are still unable to predict draws but overall two thirds of the games are classified correctly. Eight out of eleven home team wins and seven out of ten away team wins are identified correctly.

To summarize things: This model does work accuarately enough. And now things are getting interesting. We’ll apply the model on the matches of the Euro 2012.

About favourites and underdogs

Within the tournament things are constantly changing. Favourites are dissappointing the audience by playing uninspired and boring, the Germans as usual need a few games to discover their potential and some underdogs are doing surprisingly well. This is the reason why we are going to update our predictions on a daily basis. Recent results can and need to be taken into account. Despite all that we might just take a look into the crystal bowl.

Group A: Poland using its advantage

In fact there’s not much to worry about for the Polish team. The advantage of playing at home which was taken into account for Poland and the Ukraine will get Poland straight to the quarter finals. Second place is going to be a tough race between Russia and the CzechRepublic with the Russian having a slight advantage.

Group B: A real tough one

I cannot help stating it: This is the toughest group of the tournament. In turn this make it hard to predict. In fact the differences are so tiny that it might be of importance who is just slightly better on that very day. Hard competion between Denmark, the Netherlands and Germany to the very last minute. Portugal: disappointing.

Group C: Two out of three from Southern Europe

Italy, Croatia and Spain will be up for two places in the quarter finals. Slight advantages for Croatia and Spain. The matches against Ireland might make the difference in a tight group.

Group D: Ukraine or France?

It will come down to Ukraine or France in group D. In case the Ukraine manages to use the advantage of playing at home it seems possible that they make it to the quarter finals. England seems to be slightly ahead especially against France. But can they do it without Rooney? We’ll see.

Drumroll please: The first predictions

Of course we are going to compare our results with others. This is why we will not only post our prediction but also betting odds and the illuminated predictions of our local expert Jan Hendrik.

Polen - Griechenland

No doubt: Poland is going to win its first match. Or can the Greek come up with the big surprise just like in 2004? Highly unlikely according to our algorithm.

Win Poland Win Greece Draw
Jan-Hendrik X
Betting odds(Oddset) 1.80 3.45 2.85
Our prediction 82.0% 7.5%% 10.5%

Russia – Czech Republic

Betting odds and our predictions agree on a tight matches. Maybe some tiny advantages for the Russians.

Win Russia Win Czech Republic Draw
Jan-Hendrik X
Betting odds(Oddset) 1.90 3.85 2.85
Our prediction 31.3% 23.8% 44.9%

Fail?!? The Euro `08 as a test

Dr. Ruhe, Tim , 7.6.2012

As a boy I was not very fond of modeling. My fingers are all thumbs, which is why I kept breaking the small pieces that were supposed to become ships and airplanes. On rainy Sunday in April 1992 I even glued my thumb to the desk. The end of a short and unsuccesful career in model building.

But still, I am building models today. Or better, I have them build by the computer. How that works?

About ships and aircrafts

Imagine you had never seen an airplane in your life. Neither did you see a ship. Now imagine someone was going to take you by the hand and kept showing you airplanes: Passenger planes, biplanes, military jets and cargo aircrafts. After that someone would be showing you ships: Cruise liners, oil tankers, container ships, fishing boats and yachts. If you now had to decide about an unknown vehicle you could say with certainty: Ship. Or airplane. Despite all the differences between the individual types of vehicles you had learned from the examples.

The computer does the exact same thing. It learns from examples. It is creating rules. One of those rules could be: A ship does not have wings. How those rules are created can be very different from algorithm to algorithm. But what comes out of a learning algorithm is always the same. And it’s called a model. Using this model the computer is enabled to decide wether an unknown vehicle is a ship or an aircraft.

For the case of soccer matches learning is a little more difficult. The algorithm has to distinguish between three instead of two possible outcomes. And soccer always comes up with surprises. There are underdog wins, early red cards, offside goals, unjustified penalties and Arjen Robben.

Nevertheless, can one use the computer to learn a model from examples. The question is however, if the model can be used for predictions. If one follows along on that track one ends up with the question: When is a model a good model? And the answer is: When it models reality accurately enough.

That is not very precise, I know. But what accurately enough means depends strongly on what the model is going to be used for. A boat folded from paper will slowly drift along on the river while my authentic and very accurate model of the Titanic is at home impressing the neighbours. If anyone at all.

The same thing is true for our models. They also have to describe reality to some extend. In order to evaluate if they can be used for prediction they have to be tested. Testing is done using vehicles where we already know wether they are ships or aircrafts but that were not used for learning.

The Euro `08 as a test

We decided to use the Euro 2008 in Austria and Switzerland as a test. The outcomes of all games of the group phase are known and can now be compared to the predictions of our algorithms.

Doing that one finds that nine out of ten wins of the home teams are classified correctly. That is astonishing. This won’t hold however, when taking a look at the wins of the away teams. Only three out of eleven are recognized correctly. Taking a look at the draws things are even worse. None of those is classified correctly.

But individual matches are interesting as well. The loss of the German team against Croatia is predicted correctly. That’s nice but still a little surprising. Looking at the individual parameters for both teams one finds that Germany beats Croatia in every single one. Same thing for the win of the Swiss over Portugal. Correctly classified also. And again the Portugese seemed to be better in every single parameter.

What happened here? Is this the end of our idea of predicting the Euro? Did the algorithm develope some kind of intuition or sixth sense?

To address these questions one has to take a look at the predictions and at the examples that were used for learning.

Starting with the draws one finds that there are only three draws present in our test case. That is way to few to make any statements. More matches especially more draws are needed.

But also the examples the learner was trained on are important. Only 20% of those are draws. So, probably more draws are needed for learning as well.

For wins of the home and away teams things are little more complicated. Most of the home wins are reconized as such but there are more home wins predicted as well, 19 overall. Compared to only four predicted wins of the away team that is a lot. So what happened?

Thinking about games we looked at one immediately finds that the usual advantage of the home team is not present in a European championship. Except for the hosting countries. For the examples we used for learning – the games of the qualifying stage for the Euros `04 and `08 - this advantage was present. So, what happens as a consequence is that in case the difference between the two teams is not very large it is more likely that the home team will win. Putting it into an example the chance of a home team win is much larger for the match Croatia – Germany than for Azerbaijan- Germany.

Such an effect is called a bias and will of course influence the prediction. Wins of the home team will be predicted much more often. Especially if the difference in strength is not very big. Like in a Euro.

The important question is now: Can we account for the bias? The answer is yes.

How? That will be presented in the next article.



Dr. Ruhe, Tim , 6.6.2012

Execpt for soccer June is known for something else: Strawberries. My Grandma taught how to make jam out of those. Two pounds of fruit, one pound of sugar, cook for five minutes and fill into a jar while still hot. Sounds as simple as it is.

Collecting historical soccer results is just as simple. Within minutes a search on google can find you basically anything you want. Histories of all participants as well as the results of all of the final tournaments can be acessed via various websites. And of course there are still friendlies and the World Cup. It’s a little like being on a market whith all the fruit vendors lining up nicely.

At this point one has to decide. Organic strawberries? A little more expensive but well, organic. Or the cheap ones on sale.

My Grandma always says: “Not every strawberry has what it takes.” And she’s right. They have to be just perfect. Red and and fruity and full of taste.

This is true for data as well. Criteria are different though. In our case it was important that all of the basic parameters (circumstances of the game) could be accessed easily. This website proved itself to be very useful.

Pretty good so far. But two pounds of strawberries don’t just miraculously turn into jam. So better get some sugar. Or in our case: Go to the FIFA website. The FIFA rankings can be accessed back until 1993. The rankings at the time of the game are an important parameter for the actual strength of the individual teams. So is the trend. Did a nation improve? Was it downgraded?

Strawberries and sugar make a pretty good jam. But Grandma would not be Grandma if she couldn’t come up with some secret ingredients. A little lime juice and shot of Cointreau make a taste that only Grandmas are capable of.

The same thing is true for data extracted from the Fifa ranking. They can be improved enormously using a simple trick. Not only the absolute ranking is of interest but also the difference in ranks. So ist the difference in points.

Now all ingredients just have to be cooked and filled in jars. But that’s some work for the computer to do. And instead of jars we’ll be using files. Nevertheless, one is now enabled to predict the outcome of soccer games. Or to make jam.


What we do and why

Dr. Ruhe, Tim , 6.6.2012

Sometimes you cannot avoid the question what you are doing for a living. I usually stay as vague as possible, saying something about having a job at TU Dortmund University and working on my PHD thesis. In case I add that I am doing this in Physics causes appreciation (sometimes) or surprise (most of the time).

One day however, I received a reaction I was not ready for: “How do I have to visualise that?”

Different from what you see on TV was the best I could come up with. There's not a single blinking light in my office. Really. But if I try to nail it down to a single point this is what I do: I classify events from a neutrino detector at the South Pole. Usually, only physicists understand that without asking further questions. So maybe I have to simplify things. The actual location of the detector does not really matter. So I could say: I classify events from a neutrino detector.

That's better. But still it's not nice. Now I would have to explain what a neutrino detector does. Difficult without taking the scenic route and mentioning Wolfgang Pauli and his invention of the neutrino. Without describing the discovery of the neutrino within the famous Poltergeist experiment. Even without a short introduction to the standard model of particle physics. You see where this is going?

So simply further? That puts it down to a simple: I classify events. A lot better.

But what is an event? In principle an event can be anything: Tomorrow's gas price, the outcome of the next election or the outcome of a soccer match.

But how can a soccer match be similar to a physical event? If you won't settle for obvious things like the trajectory of the ball it can be summarized like that: Both can be described by certain parameters. Parameters are basically the circumstances of an event. Who played whom? How are the team ranked? Who's the favorite for the bookies? Who has the better offense? Who the better defense?

And both events will get a label. In soccer a label is simply the result of the game. In classic German sports betting that would be: Home team wins (1), away team wins (2), draw (0).

In case parameters and labels are known for one might get really brave starting to make predictions. For league games, friendlies or the EURO. And that is exactly what we are going to do.

We will try to predict the games of the EURO 2012 and in a series of articles we will explain how we did that. Starting with collecting the data and ending with an interpretation of the predictions all steps will be explained in this blog.

Can we do that? To be honest we don't know. But that's the exciting part. In soccer as well as in research. A prediction is true? Great! A prediction is not true? Also great! Because the real challenge is not the correct prediction but the understanding of surprising results.

Follow us!

13.7. - And that makes...
29.6. - Last game coming up on Sunday
27.6. - Who will make it to the final?
24.6. - Wanted! Opponent of the German team
23.6. - Number three
22.6. - Second quarterfinal today!
21.6. - Quarterfinals!
19.6. - Last matches of the group phase
18.6. - Group C: Italy, Spain or Croatia
18.6. - A mouse took a stroll through the deep dark wood
17.6. - Group B: Who will make it to the quarterfinals?
16.6. - Last matches of the group phase starting today
15.6. - England, France or Ukraine?
14.6. - The next two matches of group C
13.6. - New! Today's odds from an improved model
12.6. - Data Mining? What the...
11.6. - France vs. England - extremely hard to predict
10.6. - The World Cup winner enters the stage
9.6. - Finally! Predictions for the hardest group of the tournament
8.6. - Scrambeled Eggs
7.6. - Fail?!? The Euro `08 as a test
6.6. - Strawberries
6.6. - What we do and why