Georg Ruß' PhD Blog — R, clustering, regression, all on spatial data, hence it's:

August 7th, 2012

HACC-spatial scripts

Since I’ve been asked quite often recently to publish the HACC-spatial scripts for the respective algorithm I developed in my PhD thesis, this post will list them and give a few explanations.

For now, here’s the zip file with R scripts:

From the thesis abstract:

The second task is concerned with management zone delineation. Based on a literature
review of existing approaches, a lack of exploratory algorithms for this task is concluded, in
both the precision agriculture and the computer science domains. Hence, a novel algorithm
(HACC-spatial) is developed, fulfilling the requirements posed in the literature. It is based
on hierarchical agglomerative clustering incorporating a spatial constraint. The spatial
contiguity of the management zones is the key parameter in this approach. Furthermore,
hierarchical clustering offers a simple and appealing way to explore the data sets under
study, which is one of the main goals of data mining.

The thesis itself can be found here: PhD thesis (32MB pdf), the algorithm is described on pdf page 124 (print page 114): hacc-spatial-algorithm.pdf.

Further explanations and shorter descriptions are to be found in two publications, available in fulltext: Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture and Machine Learning Methods for Spatial Clustering on Precision Agriculture Data.

Let me know if there are questions, comments or even successful results when applying the algorithm to your data sets.

There are also two youtube videos of the clustering (with an additional pre-clustering step, the “inital phase”): F440-REIP32-movie.avi and F611-REIP32-movie.avi. It’s probably the end of both videos where it gets interesting. Compare the plots for the REIP32 variable of the F440 and F611 data sets (F440: PhD pdf page 185 (clustering on page 138) and F611: PhD pdf page 195).

Important points

  • The algorithm was designed to work with spatial data sets: each data record/point in the data set represents a vector of values which also has a location in space (2D/3D).
  • The data points should be spatially roughly uniformly distributed (probably with high density, although that doesn’t really matter). That is, it does not and cannot rely on density differences in the geospatial data distribution.
  • The input structure for the R scripts is a spatialPointsDataFrame with variables. The algorithm (the function) allows to select particular variable(s) for clustering. I.e. you may use multiple variables for clustering.
  • The algorithm is definitely not optimized for speed. It served my purposes well, but may take a while to run on your data.
  • The contiguity factor cf is subject to experimentation.

Apart from that, there’s not much to comment (yet). Let me know about questions or issues and I may be able to fix them or list further requirements here.


Dezember 20th, 2011

Star X18i rooted

My novel Android Smartphone Star X18i (ordered from that particular shop), running Android 2.3.4, has been rooted using the instructions in this thread: (method 2b, using the zergRush exploit, v3). Nice work for the script, I just copied the instructions step-by-step. The phone itself seems to be a Sony Xperia X10 clone.

Read the rest of this entry »

November 29th, 2011

Environmental Data Mining

It just occurred to me that I should probably further develop my research profile and find an appropriate umbrella term that best covers my research interests. A quick suggestion including a definition would be Environmental Data Mining to describe the task of finding interesting, novel and potentially useful knowledge (=data mining) in georeferenced (spatial) and temporal multi-layered data sets (=environmental data). I haven’t done any research on this umbrella term yet (search engines provided but a few hits, but if I stay in research, this is probably where I’d try to be headed. Computer science is (to me) an ancillary science that needs specific applications and builds/provides solutions to specific tasks based on actual data sets collected in practice. And R is the best tool for this :-)

(this merits a new category at the top level)

August 26th, 2011

ICDM and DMA workshop in NYC next week

Just before I head off into the weekend, the latest update on where I’ll be next week:

Industrial Conference on Data Mining, taking place from Tuesday August 30th until Saturday, September 3rd, in New York City (actually, it’s near Newark Airport [EWR] in New Jersey, but it’s close enough). I’ll be presenting a continuation of my work on HACC-spatial (the hierarchical agglomerative spatially constrained clustering) which I showed at my workshop and the ICDM 2010 conference last year.

Therefore, my talks’ content will be along similar lines, with similar, but updated slides:

The second talk for my workshop will also contain a few slides about the joint paper with Antonio Mucherino, who won’t be able to come personally, but who contributed a nice survey for my 2nd Workshop on Data Mining in Agriculture.

August 17th, 2011

Navin MiniHomer, gpsbabel

I recently bought a Navin Minihomer for geocaching, geo-logging and wayfinding. Really nice device, and I got it to work under linux using gpsbabel using the following instructions:

  • Zeroth, see if there’s a /dev/ttyUSB* node created when plugging in the device. If not, compile the respective kernel module; it’s under USB support — USB serial converter — Prolific …, the module is called pl2303.
  • First, get the gpsbabel sources (currently 1.4.2) from or grab the CVS version
  • For the source (non-cvs) version, apply the patch written by Josef Reisinger and linked in this thread:
  • compile and install
  • have a look at the sources (the patch) to see what functionality is available. That is, look at the files prefixed with miniHomer in the xmldoc directory.
  • Feel free to use the bash script below to use the functions of the Navin Minihomer.
  • Drop me an email with comments, if necessary, email address is in the bash script.

The script below supports

  • minihomertool erase
  • minihomertool set [1-5] latitude longitude
  • minihomertool read
  • minihomertool init
  • minihomertool dump

The first command erases the log, the second can set the appropriate waypoint in the order they appear when cycling through the miniHomer’s menu (House to Bar) with lat/long in decimal degrees separated by spaces, the third reads the device’s log and splits it by day, and the fourth initializes the device to a certain speed (didn’t have to use it so far). The bash script requires setting the path to the gpsbabel (patched) binary and the USB device. It certainly works for me, except that gpsbabel produces strange gpx files where the dates of the points are set to sometime in the year 2031. I don’t care at the moment, it seems to be just a fixed shift. The last (dump) command just grabs the log dump from the logger, writes this to a file and processes it further, even correcting for the somewhat strange date by setting a negative offset of -172032 hours. Gpsbabel segfaults first, but still writes the log (but misses the waypoints in the dump, which I don’t need anyway).

Here’s the script: minihomertool bash script. It’s certainly not perfect, doesn’t care about errors and could clearly be more elegant, but whoever wants to can customize it.

There’s more information on the German znex site:

April 27th, 2011

Thesis status

My thesis proceeds as expected and planned. The second main chapter is finished and off to the first reviewer, while the first main chapter is currently being written. The experiments are currently running on the lab machines (which are much quicker now than half a year ago using R — new hardware) and the plots will be generated soon. Time for applications. Deadlines seem to work :-)

The two latest papers of mine have been accepted at SCAI and ICDM. And there’s another upcoming journal article for (likely) GeoInformatica and the upcoming book of our working group on Computational Intelligence.

Those were the days …

März 31st, 2011

Yet another talk at MLU

On Tuesday I gave another talk at the MLU with a remixed auditorium and I
received a lot of additional input for my work and my PhD thesis. There’s a lot
of geospatial analysis to be done, as pointed out by Joachim Spilke.

The two main tasks clarified on Tuesday for the first half of my thesis revolve
around the continuation of Georg Weigert’s work on the yield (potential)
prediction. The first is whether it’s actually necessary to consider the
spatial information in the regression, i.e. whether the spatial
cross-validation I’ve developed is necessary and useful in practice. The second
is which regression model is to be chosen in a practical setup. Currently, it’s
a neural network, but if a different technique turns out to produce better
(whatever “better” means) predictions, that might be tried.

Februar 28th, 2011

Slides for talk at MLU Halle

Tomorrow’s going to be my second talk in German at MLU Halle. Here are the slides: russ2011mlu-slides.

And there’s also a video of the clustering here: In short, the video compares a spatial clustering on the precision agriculture data I have, using four variables (P, pH, Mg, K) and low spatial contiguity (left) as well as high spatial contiguity (right). The clustering is hierarchical agglomerative with an initial tessellation of the field into 250 clusters which are subsequently merged. The clustering has been implemented in R (generating .png files of each plot) with subsequent video encoding with ImageMagick (convert) and Mplayer (mencoder). Nice demo, I guess.

Februar 7th, 2011

Third thesis reviewer is set

My thesis’ third reviewer is fixed: Prof. Peter Wagner from Martin-Luther-Universität Halle-Wittenberg, whom I indirectly received my data from, via Martin Schneider, who’s now at Agri Con GmbH, Jahna.

Januar 17th, 2011

Two really useful R books

Looking back on the work I’ve done so far (finding a thesis topic, finding data, finding tools) I can definitely recommend the two books below. They’re R-related and they contain a lot of examples which still help in implementing the ideas I have. The first is Modern Applied Statistics with S (Venables/Ripley) and the other one is Applied Spatial Data Analysis with R (Bivand/Pebesma/Gómez-Rubio) from the “Use R!” series. It’s just perfect to look up things in those books which you might need in your current implementation. Besides, there’s still the R mailing lists to ask your questions and the authors of the above books are typically present at those lists.

If you prefer a bookstore, look out for these on the shelves: