Georg Ruß' PhD Blog — R, clustering, regression, all on spatial data, hence it's:

Dezember 4th, 2009

Slides for my talk at the ATO

I’ll be giving a talk at the Australian Taxation Office, likely to take place on 11th of December at 1100 local time (Canberra, ACT). The slides can be obtained here: slides-russ2009ato.pdf. The abstract is as follows:

Data Mining in Agriculture

In recent years, due to new and affordable technological advances, data
collection has turned into an everyday task. Nowadays, especially with the
advent of the global positioning system and modern farming vehicles, sensors
and equipment, even agriculture has turned into a data-driven discipline
called precision agriculture. However, as in numerous other research and
production areas, collecting data is not sufficient for economic or ecological
well-being. The collected data have to be mined and turned into usable
knowledge.

Therefore, this talk presents some approaches towards data mining in
agriculture. The talk will begin with a short overview about the origins of
the actual agriculture data. The difference between spatial and non-spatial
approaches will be emphasised using an example of yield prediction. Some of
the non-spatial techniques such as clustering, regression and feature
selection may be carried over to spatial approaches. Most of the presented
work considers very recent issues which remain unsolved in this discipline so
far (at least to the speaker’s knowledge). Furthermore, the presented work is
an excerpt of what is going to be the speaker’s PhD thesis, which is likely to
cover Data Mining in Agriculture from a computer scientist’s perspective.

November 19th, 2009

Research Update

At the end of the summer term and into the first few weeks of our university’s winter term I have been able to continue to do my research, much unlike my work in recent years when I shifted back to teaching activities. I’ve been able to fill a complete chapter of what is likely to be my dissertation thesis. Read the rest of this entry »

September 18th, 2009

Workshop Invitation: Data Mining in Agriculture

I have been invited by Petra Perner, the head of the ibai institute which organized the ICDM and MLDM conferences, to hold a workshop on „data mining in agriculture“ at next year’s ICDM conference, which will be taking place in July in Berlin.

The website is currently being constructed: http://dma2010.de. The important details are there and the pdf call for papers will be published soon.

Juli 27th, 2009

Report: MLDM 2009

Last week I also participated in the MLDM 2009, which is a biennial conference for Machine Learning and Data Mining, organised by the same team as the ICDM series. My paper was accepted as a poster presentation and I also chaired a session on association rules, which happens to be strongly related to my diploma thesis. The conference was a bit larger than the ICDM, with around 60 scheduled talks, of which 48 took place due to dropouts. It was a bit more theoretical than the ICDM, but still really worth it since usually the data mining problems were closely motivated by real-world problems.
Read the rest of this entry »

Juli 27th, 2009

Report: ICDM 2009

As I mentioned some time ago, I got a paper accepted at the ICDM 2009 conference, held in Leipzig, Germany. I really liked this small type of conference last year and it was even better this year. The organisers had scheduled 32 presentations in three days, no parallel sessions and 25 minutes of talking time for every presenting author. At least from my point of view, this conference was very useful since it wasn’t that much about the theory of data mining or machine learning, but focused instead on the practical point of view. There were lots of industry people who just had their data problems and applied data mining to it. Theory is important, but practical applications are what makes the world go round. The invited talks by Claus Weihs and Andrea Ahlemeyer-Stubbe were really good examples of theory and practice. Claus Weihs could even remember that he had seen my data mining problem before, at the IFCS 2009 in Dresden, where there were a lot more presentations than at the ICDM.
Read the rest of this entry »

Juli 17th, 2009

ICDM2009 / MLDM2009

This week saw me busy preparing for next week’s two conferences ICDM2009 and MLDM2009, both taking place at the same location in Leipzig consecutively.
Read the rest of this entry »

Juli 7th, 2009

Paper for SGAI AI-2009 accepted

The paper which I mentioned in the previous post has been accepted for publication at the SGAI AI-2009 conference. The reviewers were rather confident about the paper contents and it seems that my work is quite interesting for computer scientists.

Nevertheless, I’ve started digging somewhat deeper into the issue with spatial autocorrelation which is likely to exist in the georeferenced data sets I’m using. So far, this has usually been neglected and might lead to biased results when regression is carried out. My main idea for my PhD contribution is to develop or find a regression model which does take the spatial autocorrelation into account.

To give you an idea of the data sets and fields I’m working with, here’s a georeferenced plot of the N2 fertilizer on one of the fields during 2007:

N2 dressing on one of the fields in 2007

N2 dressing on one of the fields in 2007

. R is really great for working with (georeferenced) shapefiles.

Mai 28th, 2009

Publication submitted for SGAI AI-2009

Since the Series of AI conferences by the BCS Specialist Group on AI has been useful the last two times, I decided to submit yet another paper there. Again, I’m currently working with the agriculture data for yield prediction.

The question this time is: Which of the features in the data sets I have are actually useful for yield prediction? In recent publications I have done some research into different regression models which enable yield prediction. Taking this a step further, I’m now looking at feature selection. There’s a lot of research in that area (but mostly on classification, not regression) and I’m not quite finished getting to grips with it. Nevertheless, the central story line is emerging. In the SGAI AI-2009 paper I have developed or adapted a feature selection approach which uses forward selection (i.e. starting with an empty set of features and subsequently adding the most promising ones). The subsets are then evaluated using support vector regression and regression trees. In the end, a ranking of the features is presented which points out important (relevant) features and also shows the ones that are redundant or irrelevant.

As mentioned before, I’m now using R for computing, hence the script language will also be changed. I won’t be converting scripts from matlab to R, though. The script for the above SGAI AI-2009 paper (titled: Feature Selection for Wheat Yield Prediction) can be found here: 2009sgai-featuresel.R (or use the script overview)

It also seems as if the central theme of my dissertation will be along the above lines.

April 16th, 2009

ICDM/MLDM 2009, both accepted

Both my publications that I’ve handed in for the ICDM and MLDM conference have been accepted. The first one with two overall positive reviews, and the second one with just one review allowing it in for the poster session.

Apart from this, there’s not much to report, except that I’m currently converting (the scripts) to R for doing the computing stuff. Seems even more high-level and more abstract than matlab. And: it’s GNU and I can use vim for script editing and running.

März 23rd, 2009

Matlab script for IFCS/GfKl/ADAC 2009 article

I’ll submit an extended version of what was planned for the IFCS2009 conference at the ADAC journal (also at Springer). Some of the issues raised during the IFCS conference are addressed in this.

New baselines for the regression model comparison are computed:

  • a simple linear regression
  • a naive prediction: giving previous year’s yield as prediction

The result is that support vector regression outperforms MLP, RBF, RegTree and the two above predictors.

The matlab scripts (one outer one for the data set selection, and an inner one for the actual model comparison), are on-line: