Since I’ve been asked quite often recently to publish the HACC-spatial scripts for the respective algorithm I developed in my PhD thesis, this post will list them and give a few explanations.
For now, here’s the zip file with R scripts: hacc-spatial.zip.
From the thesis abstract:
The second task is concerned with management zone delineation. Based on a literature
review of existing approaches, a lack of exploratory algorithms for this task is concluded, in
both the precision agriculture and the computer science domains. Hence, a novel algorithm
(HACC-spatial) is developed, fulfilling the requirements posed in the literature. It is based
on hierarchical agglomerative clustering incorporating a spatial constraint. The spatial
contiguity of the management zones is the key parameter in this approach. Furthermore,
hierarchical clustering offers a simple and appealing way to explore the data sets under
study, which is one of the main goals of data mining.
The thesis itself can be found here: PhD thesis (32MB pdf), the algorithm is described on pdf page 124 (print page 114): hacc-spatial-algorithm.pdf.
Further explanations and shorter descriptions are to be found in two publications, available in fulltext: Exploratory Hierarchical Clustering for Management Zone Delineation in Precision Agriculture and Machine Learning Methods for Spatial Clustering on Precision Agriculture Data.
Let me know if there are questions, comments or even successful results when applying the algorithm to your data sets.
There are also two youtube videos of the clustering (with an additional pre-clustering step, the „inital phase“): F440-REIP32-movie.avi and F611-REIP32-movie.avi. It’s probably the end of both videos where it gets interesting. Compare the plots for the REIP32 variable of the F440 and F611 data sets (F440: PhD pdf page 185 (clustering on page 138) and F611: PhD pdf page 195).
Important points
- The algorithm was designed to work with spatial data sets: each data record/point in the data set represents a vector of values which also has a location in space (2D/3D).
- The data points should be spatially roughly uniformly distributed (probably with high density, although that doesn’t really matter). That is, it does not and cannot rely on density differences in the geospatial data distribution.
- The input structure for the R scripts is a spatialPointsDataFrame with variables. The algorithm (the function) allows to select particular variable(s) for clustering. I.e. you may use multiple variables for clustering.
- The algorithm is definitely not optimized for speed. It served my purposes well, but may take a while to run on your data.
- The contiguity factor cf is subject to experimentation.
Apart from that, there’s not much to comment (yet). Let me know about questions or issues and I may be able to fix them or list further requirements here.
mail: researchblog@georgruss.ch