- Original article
- Open Access
Implementation and assessment of two density-based outlier detection methods over large spatial point clouds
© The Author(s). 2018
- Received: 30 April 2018
- Accepted: 22 August 2018
- Published: 10 September 2018
Several technologies provide datasets consisting of a large number of spatial points, commonly referred to as point-clouds. These point datasets provide spatial information regarding the phenomenon that is to be investigated, adding value through knowledge of forms and spatial relationships. Accurate methods for automatic outlier detection is a key step. In this note we use a completely open-source workflow to assess two outlier detection methods, statistical outlier removal (SOR) filter and local outlier factor (LOF) filter. The latter was implemented ex-novo for this work using the Point Cloud Library (PCL) environment. Source code is available in a GitHub repository for inclusion in PCL builds.
Two very different spatial point datasets are used for accuracy assessment. One is obtained from dense image matching of a photogrammetric survey (SfM) and the other from floating car data (FCD) coming from a smart-city mobility framework providing a position every second of two public transportation bus tracks.
Outliers were simulated in the SfM dataset, and manually detected and selected in the FCD dataset. Simulation in SfM was carried out in order to create a controlled set with two classes of outliers: clustered points (up to 30 points per cluster) and isolated points, in both cases at random distances from the other points. Optimal number of nearest neighbours (KNN) and optimal thresholds of SOR and LOF values were defined using area under the curve (AUC) of the receiver operating characteristic (ROC) curve. Absolute differences from median values of LOF and SOR (defined as LOF2 and SOR2) were also tested as metrics for detecting outliers, and optimal thresholds defined through AUC of ROC curves.
Results show a strong dependency on the point distribution in the dataset and in the local density fluctuations. In SfM dataset the LOF2 and SOR2 methods performed best, with an optimal KNN value of 60; LOF2 approach gave a slightly better result if considering clustered outliers (true positive rate: LOF2 = 59.7% SOR2 = 53%). For FCD, SOR with low KNN values performed better for one of the two bus tracks, and LOF with high KNN values for the other; these differences are due to very different local point density. We conclude that choice of outlier detection algorithm very much depends on characteristic of the dataset’s point distribution, no one-solution-fits-all. Conclusions provide some information of what characteristics of the datasets can help to choose the optimal method and KNN values.
Technologies related to acquisition of spatial data have grown exponentially and are still following this trend today. Spatial data are enabled when information recorded by the sensor is linked to a conventional spatial reference system, usually cartographically defined as a coordinate reference system (CRS). Such information is referred to as geoinformation. This allows to map the information from the CRS to the real world and viceversa. Global Navigation Satellite Systems (GNSS), before solely available for military applications from the United States’ Global Positioning System (GPS) constellation, is now publicly accessible from several providers and with unprecedented accuracy. Accurate GNSS, along with a trend in the direction of lighter, less-expensive and metrically more accurate sensors, produces high-volumes of geospatial data. Crowd-sourcing solutions and sensors distributed in smart cities create and use large volumes of spatial data . Datasets with unstructured points are a common direct or indirect output from such technologies.
Analyses of point-clouds has become a focus of scientific investigation also due to laser scanner technology. Laser scanners, from fixed, mobile or airborne platforms, can acquire several thousands of points per second, sampling objects and creating 3D representations. Technology in laser-derived 3D measurements is still improving at a fast rate; an example is the introduction of single photon-count sensors  which multiplies the number of measurements that a sensor can provide in a unit of time, potentially providing even larger datasets. Datasets with a large unstructured point can also be produced in a photogrammetric workflow, e.g. after aligning images using structure from motion (SfM), via dense matching  . The analysed datasets in this paper are derived from photogrammetry and from direct GNSS measurements, but the approach can be applied also to datasets from laser scanners.
In this scenario, outliers play an important role in the first phases of processing. A point dataset must be rid of outliers for the following modelling steps to be successful. Optimal outlier removal has been thoroughly investigated [4–8], and is still subject of investigation nowadays in many fields, such as fraud detection, medicine, pattern recognition and measurement error detection. Methods can be divided in supervised  and unsupervised: in this case the two tested methods belong to the unsupervised category.
Nowadays many spatially-enabled sensors can produce datasets with massive volume that can easily contain millions of points with attributes. In this study two quite different examples of such surveys were tested. One dataset is a product of a photogrammetric procedure (SfM) for creating a 3D model using overlapping imagery taken from a remotely piloted airborne system (RPAS). The second dataset is from trajectory data collected from vehicles every second via GNSS. These type of data are commonly referred to as Floating Car Data (FCD) and are becoming a very important part of smart-city frameworks.
SfM point dataset
Artificial outliers were created to define a final control dataset (Fig. 2). Two types of outliers were created: (i) randomly positioned single points at a distance between 1 and 200 m from the DSM and (ii) randomly positioned clusters of points, with 2 to 30 points per cluster, with the cluster centre randomly positioned between 2 and 200 m above the DSM (Fig. 2 in red and blue respectively). R cran  was used to simulate and add the outliers to the dataset by randomly picking a non-outlier point and transforming its position according to the rules described above.
FCD – Floating Car data
The largest part of movements in an urban environment is constrained to the road network. Thanks to the recent development of navigation technologies, nowadays GNSS sensors represent a low-cost, efficient and already largely widespread tool to collect such movement information from different types of objects, including pedestrians and vehicles (cars, bicycles, buses …) , especially if compared with more traditional traffic monitoring methods like loop detectors or automatic plate number recognition . GNSS sensors are capable of recording at high rate, e.g. 1 position per second of the tracked object, so that its continuous movement is recorded as a trajectory containing a sequence of sampled points. This type of surveying is extremely important in estimating hazard situations, e.g. integrated with remote sensing  or integrated with geographic information systems (GIS) [17, 18].
These type of data are gaining importance as new paradigms are being implemented in real scenarios. Bigdata processing for smart-cities can be applied to high volumes of data from multiple sensors, which are analysed to get in depth information on the multiple dynamic aspects of a mobility and other factors.
To test outlier detection the FCD from two bus lines were used, line 11 and line 39. The methods were applied to 2D and 3D data: 2D dimensions were geospatial positions, i.e. latitude and longitude provided by GNSS, and the third dimension was the estimated velocity of the vehicle at each point.
There are many outlier detection methods in literature, in this study case we focus on unsupervised methods based on local density metrics of points. The rationale behind the two tested methods is that in large datasets consisting of 3D points the number of outliers is much lower than the number of correct points. The correct points are also clustered with respect to outliers, and therefore outliers can be detected by metrics that represent mutual distance between neighbouring points. In the next sub-sections the two methods are described in-depth.
From the definition by Hawkings  “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism”. In point datasets from laser scanners, SfM and other spatial sensors outliers can be produced from incorrect processing, multipath or from unwanted objects , such as birds or dust particles. In SfM in particular, which represents the first dataset, outliers can be from mismatches of keypoint descriptors, which can be common when using a small number of targets or none at all – e.g. with smartphones, or where the image geometry is below optimal [19, 20].
There are several ways to remove outliers with unsupervised, semi-unsupervised or even manual methods. Many users still prefer to remove outliers manually , but in this implementation the target is to have a high degree of automation, therefore the two methods that were tested are unsupervised.
In this implementation we tested two methods: (i) Statistical Outlier Removal (SOR), (ii) local outlier factor (LOF). Four predictors – two per each method – were tested: they consist of SOR and LOF values for each point, and of absolute differences, with respect to their median value, of LOF and SOR values, referred to as SOR2 and LOF2 respectively. The hypothesis behind these last two predictors is that most points will be correct, and the median of the distribution of SOR and LOF values will reflect correctness, thus points with values of SOR or LOF distant from the median will likely be outliers. The threshold for optimal results is calculated using ROC curves and applied to flag outliers.
All the above described methods require detecting a number K of nearest neighbours (KNN) to each point. The creation of a metric structure in large point sets is critical for detection of KNN in an acceptable time span. K-d tree structures and methods for approximate nearest neighbours search are implicitly used in the implementation of the methods, libnabo for R cran  and the fast library for approximate nearest neighbours (FLANN)  in the point cloud library (PCL) .
Statistical outlier removal (SOR)
Local outlier factor (LOF)
The local outlier factor (LOF) algorithm as described by [25, 26] is an unsupervised method which assigns a score to each point by computing its local density deviation with respect to its neighbours in a cluster. An outlier or a group of outliers substantially have a lower density than their neighbours do, thus a LOF value significantly greater than the rest (see Eq. 1–4).
The number of neighbours chosen is typically greater than the minimum number of points a cluster can contain, so that other points can be local outliers relative to this cluster. In practice, such information can be available if the user is knowledgeable about the data. Such situation is likely in the two presented cases, as SfM point density and GNSS rate of recording can provide estimation of respective point density. The LOF method also has the advantage of limiting statistical fluctuations .
where K-distance of point Pi is the distance between Pi and Kth nearest point, Pj .
In this work the LOF method is implemented as a new filter in point cloud library (PCL). The source code is available in a GitHub repository for inclusion in PCL builds . PCL is a “standalone, large scale, open project for 2D/3D image and point cloud processing. PCL is released under the terms of the BSD license, and thus free for commercial and research use” [24, 29]. PCL provides the ideal framework to process large point datasets. In these methods finding nearest neighbours is an essential step. Spatial metric structures allow approximate nearest neighbours matching with binary trees and are implemented in PCL via the FLANN library [30, 31].
It is trivial that the best method and combination of parameters (KNN and threshold) must have the highest number of true positives and true negatives and the lowest number of false positives and false negatives. In this investigation we consider detecting points which are outliers, therefore positives are the outliers and negatives are the inliers. Two possible types of errors can be present when predicting a binary response (inliers vs. outliers): false outliers (i.e. type I error, false positive rate - FPR), and false inliers (i.e. type II error, missed outliers, false negative rate- FNR). In this investigation particular attention is given to false inliers (FN) – points which are outliers, but are incorrectly assigned as inliers, are considered. This is because for further processing of point datasets, this type of error leads to worse consequences than false outliers. The Receiver Operating Characteristic (ROC) curve is used to define optimal balance overall performance and best-performing threshold, and false negative rate is analysed in depth.
where V is the value of LOF, LOF2, SOR or SOR2: Vmin is the lowest and Vmax is the highest value in the set.
Since the threshold T has to be determined, we plot TPR as a function of FPR for all possible values V. This will be applied to the SfM dataset and to the two FCD datasets to determine the optimal value of T for all cases. Optimal T is chosen by adopting the corresponding value of T which provides the highest value of area under the curve (AUC). The AUC is a single combined measure of sensitivity and specificity allowing effective comparison between results . Specific results for the two datasets are reported in the next sections.
SfM point dataset
LOF2 performed close to SOR2 and both outperformed LOF and SOR. This indicates that assigning to each point a metric based on absolute difference from median, improves the ability to discern outliers from inliers.
FCD – Floating Car data
Two objectives were reached in the presented investigation: the implementation of the LOF method in the PCL open-source library with its integration in a GUI, and results of testing the LOF method against the SOR method using two very diverse datasets in terms of technology and point density and distribution. It is worth noting that investigations on outlier detection methods keeps on being a topic of high interest, due to the many technologies that provide datasets with a large number of unstructured points.
Results are mixed, with the two datasets resulting in best performances from different methods and threshold types. This indicates that, very likely, the type of point distribution, i.e. the local density fluctuation, influences on the choice of method for detecting outliers. SfM point dataset clearly LOF2 performed close to SOR2, both with high KNN values, and both outperformed LOF and SOR. This indicates that assigning to each point a metric based on absolute difference from median, improves the ability to discern outliers from inliers. This is quite different from the FCD datasets; which showed opposite behaviour. The best results were given by low values of KNN for all except the 2D dataset of line 39, which had highest KNN perform best. SOR performed best for line 11 whereas line 39 had SOR2 at lowest KNN do best for the 3D dataset, and LOF do best for the 2D dataset; again with lowest and highest KNN respectively. This seemingly erratic behaviour reflects the very different datasets chosen for testing, which was one of the objectives of this investigation. As mentioned, SfM has a much more consistent density, whereas FCD has higher density fluctuations. This can explain why thresholds of absolute differences from the median (SOR2 and LOF2) outperformed with respect to using LOF and SOR values as thresholds, whereas this was not the case for the FCD dataset. It is worth mentioning that points at border of a dataset can be perceived as outliers, but this case can be considered a “margin” effect that can be ignored in most cases because the objects of interest in a survey are usually not at the margin of the survey; this is to be considered when planning a survey.
An aspect worth noting is that in SfM dataset the AUC value for best methods (LOF2 and SOR2) levels out at higher KNN values. This is important because it indicates that result at the best KNN = 60 is no particularly better than the result from KNN = 20. Considering that processing is much faster at the latter value of KNN, users can choose this value instead of the higher value. Another interesting point is that at and above KNN = 20 results are good, and they seem to stabilize, i.e. results do not deteriorate with higher KNN values. Experimentation stopped at KNN = 70, also due to long processing time, future tests might increase KNN to see if, and when, there is a deterioration. This behaviour is likely related to the median value of LOF (Fig. 8 - right) that becomes stable at KNN > = 20, meaning that at least 20 neighbours are necessary, for the SfM dataset, to represent the local fluctuation. In SfM dataset, while LOF2 increases with KNN, LOF is constant at KNN 10–20 and deteriorates at KNN > 20. In this dataset KNN values in the 10–20 range bring this difference between LOF and LOF2, likely due to the way that different thresholds are calculated; i.e. using, as threshold, the absolute difference from median LOF improves the efficiency of the method, whereas LOF value alone is not enough to discriminate outliers from inliers.
Other practical considerations are necessary to select the proper approach for removing outliers. The dataset must be analysed to understand if there are any systematic ways to model either outliers or inliers. For example SfM datasets are more prone to have outliers related to the Z axis value, whereas the floating car dataset has outliers which are sensible to planar offsets due to vehicles going on different routes with respect to the usual track. Therefore a careful evaluation of the dataset source will help to figure which descriptors can be inserted to improve results. In the FCD point dataset, the third dimension is velocity, but this feature did not improve results with respect to only planar 2D spatial coordinates. It is very likely that better results can be achieved with specific descriptors extracted from the dataset. For example, the floating car dataset has a linear characteristic; therefore, a degree of linearity of neighbouring points can be added as descriptor and will likely improve results. The focus of this paper is to assess two generic algorithms and not to evaluate specific use cases, but it is worth reporting that specific descriptors can help in detecting outliers.
The bottom-line of the results is that there is not a one-method-suits-all, and not a best number of nearest neighbours - KNN - to consider in these two methods. Best KNN values strongly depend on local density of points. As mentioned, to choose ideal KNN, enough neighbours must be used to represent local fluctuations. This seems trivial, but is important to keep in mind. Differences in AUC and TPR values show that ideal combinations of method and KNN must be chosen depending on the characteristics of the dataset and of the type of outliers that are expected (clustered or not).
The data on public transportation (BUS trajectories) were provided by Prof. Piero Boccardo, Polytechnic of Torino, part of the URBAN-GEO BIG DATA project which funded this investigation (see section on funding).
This work is funded and supported by URBAN-GEO BIG DATA, a Project of National Interest (PRIN) funded by the Italian Ministry of Education, University and Research (MIUR) – id. 20159CNLW8.
Availability of data and materials
Data are available upon request to the corresponding author. Original source code of LOF implementation in PCL library is available as open source (GNU GPL) in GitHub .
FP created the main idea and the structure of methods, RR organized and implemented the method in the FCD dataset, FF contributed to discussion and review, AM verified algorithms and supported review. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Brovelli MA, Minghini M, Zamboni G. New generation platforms for exploration of crowdsourced geo-data. In: Earth Observation Open Science Innovation. Cham: Springer International Publishing; 2018. p. 219–43. Available from: https://doi.org/10.1007/978-3-319-65633-5_9.View ArticleGoogle Scholar
- Swatantran A, Tang H, Barrett T, DeCola P, Dubayah R. Rapid, High-Resolution Forest Structure and Terrain Mapping over Large Areas using Single Photon Lidar. Sci Rep. 2016;6:28277. Available from: http://www.nature.com/articles/srep28277 View ArticleGoogle Scholar
- Remondino F, Barazzetti L, Nex F, Scaioni M, Sarazzi D. UAV photgrammetry for mapping and 3D modeling – current status and future perspectives. Int. Arch Photogramm Remote Sens Spat Inf Sci. 2011;38:14–6.Google Scholar
- Sotoodeh S. Outlier Detection in Laser Scanner Point Clouds. Int Arch Photogramm Remote Sens Spat Inf Sci. 2006;36:297–302. Available from: http://www.isprs.org/proceedings/XXXVI/part5/paper/SOTO_653.pdf Google Scholar
- Hawkins DM. Identification of Outliers. Dordrecht: Springer Netherlands; 1980. https://doi.org/10.1007/978-94-015-3994-4.
- Hodge VJ, Austin J. A survey of outlier detection Methodoligies. Artif Intell Rev. 2004;22:85–126. Available from: http://link.springer.com/article/10.1007/s10462-004-4304-y View ArticleGoogle Scholar
- Atanassov R, Bose P, Couture M, Maheshwari A, Morin P, Paquette M, et al. Algorithms for optimal outlier removal. J Discret Algorithms. 2009;7:239–48. Available from: http://linkinghub.elsevier.com/retrieve/pii/S1570866709000021 View ArticleGoogle Scholar
- Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. ACM Sigmod Rec. 2000;29:427–38. Available from: http://dl.acm.org/citation.cfm?id=335437 View ArticleGoogle Scholar
- Pirotti F, Sunar F, Piragnolo M. Benchmark Of Machine Learning Methods for Classification of a Sentinel-2 Image. Int Arch Photogramm Remote Sens Spat Inf Sci. 2016;41:335–40. Available from: http://www.int-arch-photogramm-remote-sens-spatial-inf-sci.net/XLI-B7/335/2016/ View ArticleGoogle Scholar
- Masiero A, Fissore F, Pirotti F, Guarnieri A, Vettore A. Toward the use of smartphones for mobile mapping. Geo-spatial Inf Sci. 2016;19:210–21.View ArticleGoogle Scholar
- Pirotti F, Neteler M, Rocchini D. Preface to the special issue “Open Science for earth remote sensing: latest developments in software and data.”. Open Geospatial Data Softw Stand. 2017;2:26. Available from: http://opengeospatialdata.springeropen.com/articles/10.1186/s40965-017-0039-y View ArticleGoogle Scholar
- Girardeau-Montaut D. CloudCompare (version 2.9) [GPL software] [Internet]. 2017. Available from: http://www.cloudcompare.org/. Accessed 01 Jan 2018.
- Bivand RS, Pebesma E, Gomez-Rubio V. Applied spatial data analysis with R. 2nd ed. New York: Springer; 2013.View ArticleGoogle Scholar
- Guarnieri A, Pirotti F, Vettore A. Low-cost MEMS sensors and vision system for motion and position estimation of a scooter. Sensors. 2013;13:1510–22. Available from: http://www.mdpi.com/1424-8220/13/2/1510/ View ArticleGoogle Scholar
- Yang C, Gidófalvi G. Mining and visual exploration of closed contiguous sequential patterns in trajectories. Int J Geogr Inf Sci. 2018;32(7):1282–304.View ArticleGoogle Scholar
- Boccardo P, Tonolo FG. Remote sensing role in emergency mapping for disaster response. In: Eng. Geol. Soc. Territ. - Vol. 5 Urban Geol. Sustain. Plan. Landsc. Exploit; 2015.Google Scholar
- Pirotti F, Brovelli MA, Prestifilippo G, Zamboni G, Kilsedar CE, Piragnolo M, et al. An open source virtual globe rendering engine for 3D applications: NASA World Wind. Open Geospatial Data Softw Stand. 2017;2:4. Available from: http://opengeospatialdata.springeropen.com/articles/10.1186/s40965-017-0016-5 View ArticleGoogle Scholar
- Piragnolo M, Pirotti F, Guarnieri A, Vettore A, Salogni G. Geo-Spatial Support for Assessment of Anthropic Impact on Biodiversity. ISPRS Int J Geo-Information. 2014;3:599–618. cited 2014 Apr 26]. Available from: http://www.mdpi.com/2220-9964/3/2/599 View ArticleGoogle Scholar
- Barazzetti L, Remondino F, Scaioni M. Automation in 3D reconstructing results on different kinds of close-range blocks. Int Arch Photogramm Remote Sens Spat Inf Sci. 2010;38:55–61.Google Scholar
- Scaioni M, Feng T, Barazzetti L, Previtali M, Lu P, Qiao G, et al. Some applications of 2-D and 3-D photogrammetry during laboratory experiments for hydrogeological risk assessment. Geomatics Nat Hazards Risk. 2014 [cited 2014 Jun 28:1–24. Available from: http://www.tandfonline.com/doi/abs/10.1080/19475705.2014.885090
- Westoby MJ, Brasington J, Glasser NF, Hambrey MJ, Reynolds JM. ‘Structure-from-motion’ photogrammetry: a low-cost, effective tool for geoscience applications. Geomorphology. 2012;179:300–14. Available from: http://linkinghub.elsevier.com/retrieve/pii/S0169555X12004217 View ArticleGoogle Scholar
- Elseberg J, Magnenat S, Siegwart R, Nüchter A. Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration. J Softw Eng Robot. 2012;3:2–12.Google Scholar
- Muja M, Lowe DG. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. Int Conf Comput Vis Theory Appl Viss. 2009:331–40.Google Scholar
- PCL Point Cloud Library. 2017. Available from: http://pointclouds.org/. Accessed 01 Jan 2018.
- Breunig MM, Kriegel H-P, Ng RT, Sander J. LOF: Identifying Density-Based Local Outliers. In: Proc. 2000 Acm Sigmod Int. Conf. Manag. Data; 2000. p. 1–12. Available from: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.8948.Google Scholar
- Kriegel H-P, Kröger P, Schubert E, Zimek A. LoOP: local outlier probabilities. In: Proc. 18th ACM Conf. Inf. Knowl. Manag; 2009. p. 1649–52. Available from: http://doi.acm.org/10.1145/1645953.1646195.Google Scholar
- Goldstein M, Uchida S. A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS One. 2016;11:e0152173.View ArticleGoogle Scholar
- Pirotti F. PCL LOF Filter Implementation. 2018 [cited 2018 Jan 1]. Available from: https://github.com/fpirotti/PCL-LOFFilter Google Scholar
- Rusu RB, Cousins S. 3D is here: point cloud library (PCL). Shanghai, China: IEEE Int. Conf. Robot. Autom; 2011.Google Scholar
- Muja M, Lowe DG. Fast Matching of Binary Features. In: Compututer and Robot Vision (CRV); 2012. p. 404–10.Google Scholar
- Muja M, Lowe DG. Scalable Nearest Neighbor Algorithms for High Dimensional Data. IEEE Trans Pattern Anal Mach Intell 2014;36(11):2227-2240Google Scholar
- Isenburg M. LASlib (with LASzip). 2017. Available from: https://github.com/LAStools/LAStools/tree/master/LASlib. Accessed 01 Jan 2018.
- SQLite library [Internet]. 2018. Available from: https://www.sqlite.org/about.html. Accessed 01 Jan 2018.
- Sachs MC. Generate ROC Curve charts for print and interactive use [internet]. 2017. Available from: https://cran.r-project.org/web/packages/plotROC/. Accessed 01 Jan 2018.
- Fawcett T. ROC Graphs : notes and practical considerations for researchers. ReCALL. 2004;31:1–38. Available from: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.9777&rep=rep1&type=pdf Google Scholar
- Ling CX, Huang J, Zhang H. AUC: A statistically consistent and more discriminating measure than accuracy. Int Jt Conf Artif Intell. 2003:519–24.Google Scholar