There are many outlier detection methods in literature, in this study case we focus on unsupervised methods based on local density metrics of points. The rationale behind the two tested methods is that in large datasets consisting of 3D points the number of outliers is much lower than the number of correct points. The correct points are also clustered with respect to outliers, and therefore outliers can be detected by metrics that represent mutual distance between neighbouring points. In the next sub-sections the two methods are described in-depth.
From the definition by Hawkings [5] “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”. In point datasets from laser scanners, SfM and other spatial sensors outliers can be produced from incorrect processing, multipath or from unwanted objects [4], such as birds or dust particles. In SfM in particular, which represents the first dataset, outliers can be from mismatches of keypoint descriptors, which can be common when using a small number of targets or none at all – e.g. with smartphones, or where the image geometry is below optimal [19, 20].
There are several ways to remove outliers with unsupervised, semi-unsupervised or even manual methods. Many users still prefer to remove outliers manually [21], but in this implementation the target is to have a high degree of automation, therefore the two methods that were tested are unsupervised.
In this implementation we tested two methods: (i) Statistical Outlier Removal (SOR), (ii) local outlier factor (LOF). Four predictors – two per each method – were tested: they consist of SOR and LOF values for each point, and of absolute differences, with respect to their median value, of LOF and SOR values, referred to as SOR2 and LOF2 respectively. The hypothesis behind these last two predictors is that most points will be correct, and the median of the distribution of SOR and LOF values will reflect correctness, thus points with values of SOR or LOF distant from the median will likely be outliers. The threshold for optimal results is calculated using ROC curves and applied to flag outliers.
All the above described methods require detecting a number K of nearest neighbours (KNN) to each point. The creation of a metric structure in large point sets is critical for detection of KNN in an acceptable time span. K-d tree structures and methods for approximate nearest neighbours search are implicitly used in the implementation of the methods, libnabo for R cran [22] and the fast library for approximate nearest neighbours (FLANN) [23] in the point cloud library (PCL) [24].
Statistical outlier removal (SOR)
The SOR method is a distance-based approach, which assigns a probability of being an outlier to each point by comparing its distance to neighbours. The statistic used in this case is local density calculated by measuring distances of a user-defined number K of nearest neighbours [8] (in this paper referred to as KNN). It is trivial to state that outliers, by definition, should be significantly distant from the main distribution of other points, see Fig. 4. SOR filter for this work was implemented as an R function using nabor package [22] for fast calculation of KNN distances. SOR filter is also fully implemented as part of PCL.
Local outlier factor (LOF)
The local outlier factor (LOF) algorithm as described by [25, 26] is an unsupervised method which assigns a score to each point by computing its local density deviation with respect to its neighbours in a cluster. An outlier or a group of outliers substantially have a lower density than their neighbours do, thus a LOF value significantly greater than the rest (see Eq. 1–4).
The number of neighbours chosen is typically greater than the minimum number of points a cluster can contain, so that other points can be local outliers relative to this cluster. In practice, such information can be available if the user is knowledgeable about the data. Such situation is likely in the two presented cases, as SfM point density and GNSS rate of recording can provide estimation of respective point density. The LOF method also has the advantage of limiting statistical fluctuations [27].
Fundamentally three steps are necessary to extract LOF values for each point. First for each point (i) every distance with k other points is calculated, and defined as K.dist.
$$ K.{dist}_{i,j}= dist\left({P}_i,{P}_j\right) $$
(1)
where K-distance of point Pi is the distance between Pi and Kth nearest point, Pj .
The second step calculates reachability distance (R.dist) for every point and its K neighbours. The reachability distance is the maximum between two values: the K.dist of the considered point and the considered neighbour, for each KNN other points (see Fig. 4).
$$ R. dist\left({P}_i,{P}_{K^{th}}\right)=\max \left(K.{dist}_{K^{th}}\left({P}_{K^{th}}\right);K.{dist}_i\right) $$
(2)
The local reachability density (LRD) is then defined for each point as inverse of the average reachability distances of point Pi. In the equation below, the numerator defines the cardinality of the point set of KNN.
$$ LRD\left({P}_i\right)=\frac{\left\Vert {N}_k\left({P}_i\right)\right\Vert }{\sum_{P_j\in {N}_k\left({P}_i\right)}R. dist\left({P}_i,{P}_j\right)} $$
(3)
The last step calculates LOF value for each point is calculated by comparing LRD value of the point with LRD value of its k neighbours.
$$ LOF\left({P}_i\right)=\frac{\sum_{P_j\in {N}_k\left({P}_i\right)}\frac{LRD\left({P}_j\right)}{LRD\left({P}_i\right)}}{\left\Vert {N}_k\left({P}_i\right)\right\Vert } $$
(4)
In this work the LOF method is implemented as a new filter in point cloud library (PCL). The source code is available in a GitHub repository for inclusion in PCL builds [28]. PCL is a “standalone, large scale, open project for 2D/3D image and point cloud processing. PCL is released under the terms of the BSD license, and thus free for commercial and research use” [24, 29]. PCL provides the ideal framework to process large point datasets. In these methods finding nearest neighbours is an essential step. Spatial metric structures allow approximate nearest neighbours matching with binary trees and are implemented in PCL via the FLANN library [30, 31].