The following is a paper summary for the ICDM 2010 conference, which will be held in Berlin during July. It mainly elaborates on the issue of spatial autocorrelation in the agriculture data I’m using. It refers to my previous publications (2008, 2009) at this conference where I presented standard regression approaches using different techniques for the task of yield prediction. It seems these techniques considerably underestimate the prediction error due to spatial autocorrelation. I therefore developed an approach based on k-means clustering to enable yield prediction on spatial data sets. The conference reports from the previous years are here: , 2008, 2009.

The abstract for the paper titled (for now): Regression Models for Spatial Data: An Example from Agriculture

The term precision agriculture refers to the application of
state-of-the-art GPS technology in connection with small-scale, sensor-based
treatment of the crop. This data-driven approach to agriculture poses a
number of data mining problems. One of those is also an obviously important
task in agriculture: yield prediction. Given a precise, geographically
annotated data set for a certain field, can a season’s yield be predicted?

Numerous approaches have been proposed to solving this problem. In the past,
classical regression models for non-spatial data have been used, like
regression trees, neural networks and support vector machines. However, in a
cross-validation learning approach, issues with the assumption of statistical
independence of the data records appear. Therefore, the geographical location
of data records should clearly be considered while employing a regression
model. This paper gives a short overview about the available data, points out
the issues with the classical learning approaches and presents a novel
spatial cross-validation technique to overcome the problems and solve the
aforementioned yield prediction task.