Georg Ruß' PhD Blog — R, clustering, regression, all on spatial data, hence it's:

August 7th, 2012

HACC-spatial scripts

Since I’ve been asked quite often recently to publish the HACC-spatial scripts for the respective algorithm I developed in my PhD thesis, this post will list them and give a few explanations.

For now, here’s the zip file with R scripts: hacc-spatial.zip.

From the thesis abstract:

The second task is concerned with management zone delineation. Based on a literature
review of existing approaches, a lack of exploratory algorithms for this task is concluded, in
both the precision agriculture and the computer science domains. Hence, a novel algorithm
(HACC-spatial) is developed, fulfilling the requirements posed in the literature. It is based
on hierarchical agglomerative clustering incorporating a spatial constraint. The spatial
contiguity of the management zones is the key parameter in this approach. Furthermore,
hierarchical clustering offers a simple and appealing way to explore the data sets under
study, which is one of the main goals of data mining.

The thesis itself can be found here: PhD thesis (32MB pdf), the algorithm is described on pdf page 124 (print page 114): hacc-spatial-algorithm.pdf.

Further explanations and shorter descriptions are to be found in two publications, available in fulltext: Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture and Machine Learning Methods for Spatial Clustering on Precision Agriculture Data.

Let me know if there are questions, comments or even successful results when applying the algorithm to your data sets.

There are also two youtube videos of the clustering (with an additional pre-clustering step, the “inital phase”): F440-REIP32-movie.avi and F611-REIP32-movie.avi. It’s probably the end of both videos where it gets interesting. Compare the plots for the REIP32 variable of the F440 and F611 data sets (F440: PhD pdf page 185 (clustering on page 138) and F611: PhD pdf page 195).

Important points

  • The algorithm was designed to work with spatial data sets: each data record/point in the data set represents a vector of values which also has a location in space (2D/3D).
  • The data points should be spatially roughly uniformly distributed (probably with high density, although that doesn’t really matter). That is, it does not and cannot rely on density differences in the geospatial data distribution.
  • The input structure for the R scripts is a spatialPointsDataFrame with variables. The algorithm (the function) allows to select particular variable(s) for clustering. I.e. you may use multiple variables for clustering.
  • The algorithm is definitely not optimized for speed. It served my purposes well, but may take a while to run on your data.
  • The contiguity factor cf is subject to experimentation.

Apart from that, there’s not much to comment (yet). Let me know about questions or issues and I may be able to fix them or list further requirements here.

mail: researchblog@georgruss.ch

März 13th, 2012

Fotos der Verteidigung

Fotos der Verteidigung gibt’s hier: Album Dissertationsverteidigung

Die Dissertation ist jetzt bei der Bibliothek publiziert und auch hier online zu finden: http://blog.georgruss.de/?page_id=358

November 29th, 2011

Environmental Data Mining

It just occurred to me that I should probably further develop my research profile and find an appropriate umbrella term that best covers my research interests. A quick suggestion including a definition would be Environmental Data Mining to describe the task of finding interesting, novel and potentially useful knowledge (=data mining) in georeferenced (spatial) and temporal multi-layered data sets (=environmental data). I haven’t done any research on this umbrella term yet (search engines provided but a few hits, but if I stay in research, this is probably where I’d try to be headed. Computer science is (to me) an ancillary science that needs specific applications and builds/provides solutions to specific tasks based on actual data sets collected in practice. And R is the best tool for this :-)

(this merits a new category at the top level)

November 21st, 2011

Dissertation eingereicht

"Spatial Data Mining in Precision Agriculture"

 

Mit dem offiziellen Datum vom 23.11.2011 habe ich heute meine Dissertation eingereicht. Jetzt ist der Fakultätsrat dran, dann sind es die Gutachter und wenn alles glatt läuft, bin ich bei der Verteidigung dran. Vorbehaltlich der Genehmigung durch den Fakultätsrat findet die Verteidigung am 23.02.2011, 15 Uhr, in 29-301 statt. Der Dissertationstitel entspricht der Überschrift dieses Blogs.

Fürs Binden habe ich übrigens 42 EUR bezahlt. Das kann doch kein Zufall sein!

August 26th, 2011

ICDM and DMA workshop in NYC next week

Just before I head off into the weekend, the latest update on where I’ll be next week:

Industrial Conference on Data Mining, taking place from Tuesday August 30th until Saturday, September 3rd, in New York City (actually, it’s near Newark Airport [EWR] in New Jersey, but it’s close enough). I’ll be presenting a continuation of my work on HACC-spatial (the hierarchical agglomerative spatially constrained clustering) which I showed at my workshop and the ICDM 2010 conference last year.

Therefore, my talks’ content will be along similar lines, with similar, but updated slides:

The second talk for my workshop will also contain a few slides about the joint paper with Antonio Mucherino, who won’t be able to come personally, but who contributed a nice survey for my 2nd Workshop on Data Mining in Agriculture.

Juni 15th, 2011

First thesis draft submitted

Last week I handed in the first 228-page draft of what’s probably going to be
my thesis. Let’s see what the reviewers say, I hope there are not too many
fundamental issues with that draft.

April 27th, 2011

Thesis status

My thesis proceeds as expected and planned. The second main chapter is finished and off to the first reviewer, while the first main chapter is currently being written. The experiments are currently running on the lab machines (which are much quicker now than half a year ago using R — new hardware) and the plots will be generated soon. Time for applications. Deadlines seem to work :-)

The two latest papers of mine have been accepted at SCAI and ICDM. And there’s another upcoming journal article for (likely) GeoInformatica and the upcoming book of our working group on Computational Intelligence.

Those were the days …

März 31st, 2011

Yet another talk at MLU

On Tuesday I gave another talk at the MLU with a remixed auditorium and I
received a lot of additional input for my work and my PhD thesis. There’s a lot
of geospatial analysis to be done, as pointed out by Joachim Spilke.

The two main tasks clarified on Tuesday for the first half of my thesis revolve
around the continuation of Georg Weigert’s work on the yield (potential)
prediction. The first is whether it’s actually necessary to consider the
spatial information in the regression, i.e. whether the spatial
cross-validation I’ve developed is necessary and useful in practice. The second
is which regression model is to be chosen in a practical setup. Currently, it’s
a neural network, but if a different technique turns out to produce better
(whatever “better” means) predictions, that might be tried.

Februar 28th, 2011

Slides for talk at MLU Halle

Tomorrow’s going to be my second talk in German at MLU Halle. Here are the slides: russ2011mlu-slides.

And there’s also a video of the clustering here: http://www.youtube.com/watch?v=Xk7eT4-F2Fg In short, the video compares a spatial clustering on the precision agriculture data I have, using four variables (P, pH, Mg, K) and low spatial contiguity (left) as well as high spatial contiguity (right). The clustering is hierarchical agglomerative with an initial tessellation of the field into 250 clusters which are subsequently merged. The clustering has been implemented in R (generating .png files of each plot) with subsequent video encoding with ImageMagick (convert) and Mplayer (mencoder). Nice demo, I guess.

Februar 7th, 2011

Third thesis reviewer is set

My thesis’ third reviewer is fixed: Prof. Peter Wagner from Martin-Luther-Universität Halle-Wittenberg, whom I indirectly received my data from, via Martin Schneider, who’s now at Agri Con GmbH, Jahna.