Georg Ruß' PhD Blog — R, clustering, regression, all on spatial data, hence it's:

August 7th, 2012

HACC-spatial scripts

Since I’ve been asked quite often recently to publish the HACC-spatial scripts for the respective algorithm I developed in my PhD thesis, this post will list them and give a few explanations.

For now, here’s the zip file with R scripts:

From the thesis abstract:

The second task is concerned with management zone delineation. Based on a literature
review of existing approaches, a lack of exploratory algorithms for this task is concluded, in
both the precision agriculture and the computer science domains. Hence, a novel algorithm
(HACC-spatial) is developed, fulfilling the requirements posed in the literature. It is based
on hierarchical agglomerative clustering incorporating a spatial constraint. The spatial
contiguity of the management zones is the key parameter in this approach. Furthermore,
hierarchical clustering offers a simple and appealing way to explore the data sets under
study, which is one of the main goals of data mining.

The thesis itself can be found here: PhD thesis (32MB pdf), the algorithm is described on pdf page 124 (print page 114): hacc-spatial-algorithm.pdf.

Further explanations and shorter descriptions are to be found in two publications, available in fulltext: Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture and Machine Learning Methods for Spatial Clustering on Precision Agriculture Data.

Let me know if there are questions, comments or even successful results when applying the algorithm to your data sets.

There are also two youtube videos of the clustering (with an additional pre-clustering step, the “inital phase”): F440-REIP32-movie.avi and F611-REIP32-movie.avi. It’s probably the end of both videos where it gets interesting. Compare the plots for the REIP32 variable of the F440 and F611 data sets (F440: PhD pdf page 185 (clustering on page 138) and F611: PhD pdf page 195).

Important points

  • The algorithm was designed to work with spatial data sets: each data record/point in the data set represents a vector of values which also has a location in space (2D/3D).
  • The data points should be spatially roughly uniformly distributed (probably with high density, although that doesn’t really matter). That is, it does not and cannot rely on density differences in the geospatial data distribution.
  • The input structure for the R scripts is a spatialPointsDataFrame with variables. The algorithm (the function) allows to select particular variable(s) for clustering. I.e. you may use multiple variables for clustering.
  • The algorithm is definitely not optimized for speed. It served my purposes well, but may take a while to run on your data.
  • The contiguity factor cf is subject to experimentation.

Apart from that, there’s not much to comment (yet). Let me know about questions or issues and I may be able to fix them or list further requirements here.


November 29th, 2011

Environmental Data Mining

It just occurred to me that I should probably further develop my research profile and find an appropriate umbrella term that best covers my research interests. A quick suggestion including a definition would be Environmental Data Mining to describe the task of finding interesting, novel and potentially useful knowledge (=data mining) in georeferenced (spatial) and temporal multi-layered data sets (=environmental data). I haven’t done any research on this umbrella term yet (search engines provided but a few hits, but if I stay in research, this is probably where I’d try to be headed. Computer science is (to me) an ancillary science that needs specific applications and builds/provides solutions to specific tasks based on actual data sets collected in practice. And R is the best tool for this :-)

(this merits a new category at the top level)

Februar 28th, 2011

Slides for talk at MLU Halle

Tomorrow’s going to be my second talk in German at MLU Halle. Here are the slides: russ2011mlu-slides.

And there’s also a video of the clustering here: In short, the video compares a spatial clustering on the precision agriculture data I have, using four variables (P, pH, Mg, K) and low spatial contiguity (left) as well as high spatial contiguity (right). The clustering is hierarchical agglomerative with an initial tessellation of the field into 250 clusters which are subsequently merged. The clustering has been implemented in R (generating .png files of each plot) with subsequent video encoding with ImageMagick (convert) and Mplayer (mencoder). Nice demo, I guess.

Januar 17th, 2011

Two really useful R books

Looking back on the work I’ve done so far (finding a thesis topic, finding data, finding tools) I can definitely recommend the two books below. They’re R-related and they contain a lot of examples which still help in implementing the ideas I have. The first is Modern Applied Statistics with S (Venables/Ripley) and the other one is Applied Spatial Data Analysis with R (Bivand/Pebesma/Gómez-Rubio) from the “Use R!” series. It’s just perfect to look up things in those books which you might need in your current implementation. Besides, there’s still the R mailing lists to ask your questions and the authors of the above books are typically present at those lists.

If you prefer a bookstore, look out for these on the shelves:

Dezember 9th, 2010

[R] and nnet.default

I’ve recently been active on the R-help mailing list because I had some issues
with the default implementation of neural networks (nnet). Seems as if the
mailing list solved my problem or at least hinted me towards a solution. The
nnet function seems somewhat strict about its arguments.