Georg Ruß' PhD Blog — R, clustering, regression, all on spatial data, hence it's:

Januar 8th, 2008

The „squared“ in „mean squared error“

The latest experiments with the sports science data always turned out outrageously high errors which most of the time missed the scale of the original attribute by orders of magnitude. After conducting some experiments which returned negative conclusions like:

  • the error is too high, therefore the network cannot be trained on these data or
  • there is some tendency to overfitting when the network size and the learning rate increase, but the error is way too large anyway,

I presented the examples to the sports science people which were also quite surprised about the error’s order of magnitude. When returning back to my office, I had the sudden idea what I had actually shown in the graphs — after taking the sqrt of the mse I ended up with what I had actually wanted to show in the plots: the absolute error, i.e. the absolute of the difference between the predicted result from the network and the actual result. It’s somewhat embarrassing, anyway, but at least I know the cause of the high-error issue.

For the plot below I also got some advice from the sports science people to take out some redundant data (four columns out of 24 are more or less sums of three others each). Now the error is in a region that is more reasonable when a min/max of 841/1004 are expected for the target attribute. The plot shows error vs. the size of the first and second hidden layer of the feed-forward network, respectively.
neuro12, absolute error instead of mse

Dezember 4th, 2007

Overfitting vs. Cross-Validation

One more experiment with the sports science data clearly shows the issue of overfitting of the neural network. I devised a script that automatically generates networks with two hidden layers of neurons and varies the layers‘ sizes from 2 to 16 systematically, respectively. No matter how the parameters for the learning process were set, the mean squared error plotted against the two layer sizes (right: first hidden layer, left: second hidden layer) is zero as soon as the networks get larger:
Without cross validation, the error is zero with larger networks

However, when applying cross-validation (test set: 1 record, validation set: 1 record, training set: 38 records), the error, especially towards the larger layer sizes, rises. This is a clear sign of overfitting (scales as in the last figure):
With cross validation, one can see the effects of overfitting quite clearly

As usual, the matlab script for this entry, which doesn’t differ much from the latest ones: neuro10-new_plot.m.

November 2nd, 2007

Normalization in the context of sports science data

At the moment I’m somewhat split between teaching „Intelligent Systems“ courses and thinking about the agriculture data as well as the sports science data. In yesterday’s meeting, we discussed the three data blocks that are available from them. I’ll receive the purged data soon. One thing that was mentioned was the issue of normalization. Read the rest of this entry »

Oktober 30th, 2007

Highly recommended: Matlab’s nnet manual

Diving deeper into Matlab’s endless built-in functions, I discovered (i.e. read) Mathworks‘ nnet manual. I usually abhor user manuals for specific programming languages, but Mathworks has made it an enjoyable read.
Read the rest of this entry »

Oktober 2nd, 2007

Some more results for the sports science data

I finally ended up simplifying the whole task and starting from the very beginning. I had two data sets of two athletes with the same training attributes (data columns). The earlier matlab script did some sort of pretraining with the one dataset and some sort of main training and cross validation with the second dataset. Remember, I am still trying to reproduce the results from the paper (which were generated with Data Engine) using MatLab.
Read the rest of this entry »

Oktober 1st, 2007

Some results for the sports science data

The prediction capabilities of the neural network that was coded in the last post do not seem to be as good as expected, at least not in the standard configuration. When I fed the data set (which I will not publish here) through the network and the cross validation, the results are as follows:
Read the rest of this entry »

September 28th, 2007

MatLab script v1 for the sports science data

A well-commented script that tries to model the data mining process from the sports scientists is online.
Below is a quick screenshot for reading, the script can be downloaded here.

There are some steps (two main steps) for training the network:

  • Since there is not much data available for training, additional data was taken from another athlete.
  • the network is initialized once and stored in a variable,
  • the network is pre-trained: it is assumed that it can then better adapt to the actual training data,
  • the main training is performed starting from the pre-trained network,
  • this is repeated for (number of data) and cross validation is carried out.

Read the rest of this entry »

September 20th, 2007

Details on the sports science data mining process

The current area of application of the sports science data mining is in

  • olympic swimming
  • archery
  • disabled swimming

When it comes to the research targets, we are trying to

  • model the effects of different training strategies towards the outcome of an upcoming tournament,
  • predict the tournament time (or any standardized measure of success) at the Olympic Games.

Read the rest of this entry »

September 18th, 2007

Prediction using sports science data

This project ties in with earlier work done by Jürgen Edelmann-Nusser and Nico Ganter: predicting athletes‘ tournament swimming times using only their training data. It works as follows:

  •  During the athletes‘ training sessions, their amount of training in different disciplines (running, strength, stamina) is recorded.
  • The athletes complete a tournament and their results are recorded as well.
  • These data, consisting of training times and fields and the respective result in tournament, can be used to train one or more neural networks.
  • Once the neural networks are trained, one can predict or try to predict the outcome of the upcoming tournament.
  • Furthermore, one could adapt the athletes‘ training strategy by varying the training parameters and applying the strategy with the best predicted tournament result.

Presumably, this work will be done using MatLab and its nnet Neural Networks toolbox. Since I’m on the application side of the work, I will probably be scripting the neural network stuff in MatLab and publish the scripts here.