Evaluating two freely available geocoding tools for geographical inconsistencies and geocoding errors
Open Geospatial Data, Software and Standardsvolume 2, Article number: 11 (2017)
Geocoding is highly prone to error for various reasons. This paper examines the geographical inconsistencies associated with geocoding errors seen when using two freely available geocoding tools, Google Sheets and ggmap.
Two hundred restaurants, all recipients of California’s Center of Excellence award, were selected for the analysis. The geocoded addresses were plotted on maps using QGIS, Google Maps, OpenStreetMap (OSM), and Google Earth for visualization, comparison, and validation. A stepwise method of analyzing the geographical inconsistencies is provided that can be adapted for any locational analytics.
Results and discussion
Both Google Sheets and ggmap were able to successfully geocode all 200 addresses, but ggmap incorrectly geocoded eight addresses as being more than 2,000 miles from their actual location. Addresses containing the ampersand character, &, caused ggmap to incorrectly geocode their location. After replacing the ampersand with the word and, ggmap was able to correctly geocode those addresses. The corrected locations plotted on Google Maps and OSM were similar, and they exactly matched the actual locations when plotted on Google Earth.
Both Google Sheets and ggmap are equally capable of geocoding physical locations, but R users are advised that addresses for geocoding must be free of the ampersand character if correct results are to be obtained. In addition, geocoded outputs should be plotted on a map using QGIS, ArcGIS, Google Maps, OSM, R, or any other such mapping tools for visualization and validation. This will ensure a high-quality geospatial analysis of places or events when locational information is vital for decision-making.
Geographic location plays a vital role in a variety of socioeconomic and environmental decisions, such as in selecting sites for new businesses or providing location-based services [1, 2]; detecting, valuing, and defining protected marine areas ; locating prospective areas for grid-connected offshore wind power development ; identifying disease-prone areas ; responding to crime and natural hazards; and locating customer-friendly shopping malls . Geocoding, the process of assigning coordinates (latitude and longitude) to a physical location, has helped various industries improve performance through spatial analysis [6, 7]. However, the accuracy and reliability of geocoded results have always been a matter of concern among the geospatial analytics community [4, 7,8,9,10]. Senaratne et al. (2017) provide a detailed review of the various methods applied in assessment of the quality of locational analytics. The authors report that accuracy measurement is the most frequent and reliable technique currently in practice . They define accuracy as the degree of closeness between measured and actual values, noting that it may vary with use of various geocoding tools . Geocoding is highly prone to error for various reasons, including lack of coverage (local vs. global); lack of complete, correct, consistent, and updated reference databases; and the making of inappropriate assumptions [7, 9, 10, 12, 13]. All these may affect match rate and positional accuracy . Additionally, incorrect geocoding may bias the results of spatial analysis, resulting in misclassification of actual physical locations that may adversely affect research outcomes or location-based business decisions [14, 15]. Accordingly, understanding and addressing these geocoding challenges is vital . Yet geocoding processes and error handling have been largely ignored in some studies [14, 15].
Various subscription-based and freely available geocoding tools can be used for batch geocoding of physical locations (Table 1) . All these service providers use different reference databases, geocoding algorithms, address parsing, approaches, and inaccuracy reporting methods [8, 9]. Consequently, the likelihood of differences in geocoded results is high [8, 9].
Several studies have offered a comparative analysis of various free or subscription-based geocoding services. For example, Karimi et al.  have evaluated the matching rate of geocoded addresses using web-based geocoding services, including Virtual Earth, Google Maps, Geocoder.us, MapQuest, and Yahoo Maps. In contrast, Swift and his team members  assessed seven commercial geocoding services and one open-source geocoding service—Centrus, Geolytics, ERSI Address Locator, Geocoder.us, Google Earth, Google Maps API, and the Yahoo API and USC Geocoding Platforms, respectively—to match accurately geocoded addresses. The authors selected 50 addresses for this purpose and found that only 42% of samples matched their reference data, 54% of addresses matched parcel centroids, and only 4% addresses matched USPS ZIP code centroids . All the geocoding tools tested produced varying results, indicating that analysts should indeed take care when geocoding physical locations, especially when doing so for purposes of location-based analysis, and should take that same care when selecting geocoding tools in the first place.
This study compares the use of two commonly used free geocoding tools for research and business purposes: Google Sheets, which is a Google offering, and ggmap, which is an R package. No comparison of these tools has yet appeared in mainstream journals.
ggmap is one of the most widely used geospatial R packages in a variety of domains. For example, it was used to geocode helminth (nematodes popularly known as roundworms) host–parasite interactions that helped establish the London Natural History Museum’s Host–Parasite Database ; was used in a big-data environment to geocode customer movements from homes to shopping centers ; and was used to site locations for implementation of a U.S. federal program offering families and children healthful foods during the summer months, administered by the U.S. Department of Agriculture . Google Sheets, or Google Spreadsheets, by contrast, has gained little attention among geospatial analytics communities, even though it has been applied to an array of domains, such as in the geocoding of socioeconomic historical data for visualization of urban geographies  and in the public health domain [23, 24]. Google Sheets provides advantages not seen in ggmap, because it does not require coding and is a web-based application. By contrast, ggmap runs through the RStudio software and requires a sequence of queries; even so, it is widely used and accepted by researchers and professionals the world over.
For testing purposes, a publicly available list of Center of Excellence–awarded restaurants in California was downloaded from the website of the Public Health Care Agency of Orange County . In 2016, fully 3631 restaurants were recognized as a Center of Excellence for their performance in 2015. The list contains each restaurant’s name, address, city, and ZIP code. Because use of both Google Sheets and R is subject to a maximum geocoding limit per day, only 200 of these 3631 restaurants were selected for geocoding purposes in this study. Moreover, this study seeks to compare the accuracy of geocoded results produced by two popular geocoding tools while providing a stepwise method of resolving geocoding challenges: because visual verification of individual address is a tedious task, a small sample size—but larger than that used by Swift et al. (2008)—was selected. Because the selection was purely for research purposes, no priority was given to any specific restaurant chain. The 200 addresses were stored as address.csv for further analysis.
Geocoding using Google sheets and the RStudio ggmap package
Google Sheets is a free web-based application, developed by Google for real-time online document editing while collaborating with other users . Several blog articles and tutorial videos are available that instruct users in the steps used for geocoding physical locations through Google Sheets, such as one available through GitHub, which was adapted for this study .
R is one of the most widely used statistical and visualization open-source tools [28, 29], with more than 6000 packages  contributed by thousands of authors across the world. Ggmap, a bundle of 34 functions, is a spatial data modeling and visualization R package . This package uses Google and Stamen Maps as reference sources for geocoding and mapping. The codes used in this study are adapted from Shane Lynn . Most are kept intact for reproducibility, and the code used is available in geocode_2016.R and geocode_2016.txt, accessible through this article.
Distance calculation in RStudio using the geosphere package
After geocoding all addresses using Google Sheets and ggmap, distances between coordinates having the same locations were calculated to validate the geocoding results, using the geosphere package. Geosphere, a recently developed spatial analytics R package, combines 40 functions developed for calculation of various aspects of distance, direction, and area when dealing with geographic coordinates . The distHaversine function of the geosphere package was used for distance calculation. This function measures the shortest distance between two geographic coordinates, also known as the “great-circle distance” or distance measured “as the crow flies” . The advantage of this method is that it assumes a spherical earth, ignoring ellipsoidal effects . It accepts data in a specific format only: coordinates, with the first column of the input file corresponding to longitude and the second to latitude . This method produces distance in meters, taking Earth’s radius to be 6,378,137 m . The original code was modified to produce results in miles instead of meters. The modified output data for Google Sheets and ggmap is stored as gsheets.csv and ggplot.csv, respectively. A stepwise method of assessing geographical inconsistencies of geocoding errors is presented through a flow diagram (Fig. 1).
Descriptive statistics in RStudio using the pastecs package
Descriptive statistical analysis of the distance calculated between the geocoded locations produced by Google Sheets and those produced by ggmap was performed using the pastecs R package . The stat.desc function of pastecs quantifies various descriptive statistics, including number of variables, null values, NAs, minimum, maximum, range, sum, median, mean, standard error of the mean, confidence interval of the mean, variance, standard deviation, and coefficient of variance . In the results, only a few required outputs are presented and discussed.
The Arc Geographical Information System (ArcGIS) has seen much use in spatial analytics and modeling in different perspectives and is one of the most advanced and reliable geospatial analytical tools available [35,36,37]. However, QGIS, an open-source GIS tool, has become very popular in the field of geospatial analytics . In this study, QGIS is used to plot geocoded locations on a map using QGIS version 2.18.2 for the Windows environment . Within the QGIS environment, the open layers plugin provides options for selecting Google Maps and OSM as base maps on which to plot geocoded locations. These locations were plotted on Google Maps and OSM for visualization, comparison, and validation. Google Earth, a freeware virtual globe, map, and geographical information program that offers various mapping facilities and that is one of the most reliable geocoding tools available, was also used to investigate locational accuracy . Google Earth has a street view option, which provides a 360° horizontal and 290° vertical panoramic view at the street level from a height of about 2.5 m . These help users verify actual locations by zooming to the street level.
Results and discussion
Geocoded outputs from Google sheets and ggmap
Google Sheets and ggmap were able to geocode all 200 addresses without error. The number of geocoded addresses was within the limit of 2500 instances per day for both tools; see google-sheets_ggmap_geocoded.csv for their outputs. Unlike Google Sheets, however, ggplot provides additional information on the accuracy level of the geocoded locations (google-sheets_ggmap_geocoded.csv). Most (82%) addresses were found to be geocoded with street address–level accuracy (Table 2). Twenty-two of 200 addresses were found to be accurate at the sub-premise level, with seven pointing to the nearest bus station, two to the locality, and one to the campground level (Table 2).
The geocoded addresses produced by Google Sheets and ggmap were compared by calculating the differences between the latitudes and the differences between the longitudes produced by Google Sheets and ggmap. The geocoded addresses matched in only 53% (107 of 200) cases. The minimum difference between the results produced by Google Sheets and those produced by ggmap was −0.00011 degrees of latitude and −35.65284 degrees of longitude. The maximum difference was 4.29319 degrees of latitude and 0.00016 degrees of longitude, with a standard deviation of 0.76 degrees of latitude and 6.60 degree of longitude. Furthermore, the geocoded addresses produced by ggmap exactly matched Google Sheets outputs in only 56% (92 of 165) of instances involving street address–level accuracy, 45% (10 of 22) of instances involving sub-premise-level accuracy, and 100% of instances involving either campground-level (1 of 1) or bus station–level (7 of 7) accuracy.
All 200 geolocations were visualized by means of a QGIS map (Fig. 2). The purple symbols indicate ggmap-geocoded locations and the orange symbols Google Sheets–geocoded locations (Fig. 2-a). Overlapping symbols indicate an exact match in geocoded location. Non-matched locations are distinctive on the map (Fig. 2-b). The biggest failure of ggmap geocoding was in producing similar coordinates for seven addresses, all at the bus station level (google-sheets_ggmap_geocoded.csv, Table 2). Because of their similar coordinates, however, only one is visible on the map, in Florida (Fig. 2-c).
These eight incorrectly geocoded addresses were further individually geocoded using ggmap. Surprisingly, ggmap produced similar results for seven addresses belonging to the same restaurant chain. These eight addresses were plotted using Google Maps and OSM to visualize their physical locations (Fig. 3). Both maps showed their locations in California, in their actual ZIP codes.
These eight locations were further individually plotted on Google Earth and zoomed to the street level to verify the degree of correspondence of geocoded location with actual location. Errors 1–7 were geocoded to the exact premises of the restaurant (Fig. 4). However, error 8 was located outside the premises of a hotel in which this restaurant likely operates.
Furthermore, the distance between the coordinates produced by Google Sheets and ggmap was calculated (in miles) in RStudio using the geosphere package, with a descriptive statistical summary produced (Table 3).
The maximum distance between coordinates of the same address was 2107.2 miles—from California to Florida (Fig. 2a and c). Although both Google Sheets and ggmap were able to successfully geocode all physical locations of these recipients of California’s Center of Excellence award, ggmap could not produce correct coordinates in a majority of cases. Because both tools use Google Maps as a reference, it is questionable whether they actually obtained different results. Accordingly, those addresses incorrectly geocoded—and, indeed, more than 2000 miles from their real locations—were revisited with an eye to identifying the cause of error. In doing so, it was found that all these addresses had one thing in common: their name included the ampersand character: &. (The names of these restaurants have here been replaced with “Restaurants” to protect the actual restaurants’ privacy.) The ampersand character was replaced with the word and in these addresses, and the corrected addresses were re-geocoded using ggmap, then the distance calculated between the incorrect and the corrected coordinates; the results are presented in Table 4.
After correction of the addresses, all re-geocoded results matched Google Sheets outputs. Evidently, the use of even a single problematic character, the ampersand, can cause ggmap to produce incorrect outputs, assigning coordinates as far as 2000 miles from their real location.
Although the geocoding tools Google Sheets and ggmap use a common map reference, they produce varying results. In addition, specific formatting, free of problematic characters such as the ampersand, is required for correct geocoding by ggmap. Google Sheets, by contrast, features a user-friendly environment that does more to aid production of reliable geocoding results. Regardless, users of geocoding tools should not wholly rely on whichever tool they use but rather should always verify their results by the methods outlined in this study or by any other established approach. The visualizing of geocoded results on a map using QGIS, ArcGIS, Google Earth, OSM, or R can help in identifying and resolving potential challenges to accuracy. Certainly other factors not covered in this study could also produce erroneous geocoded results, so analysts should carefully evaluate their results and report them in detail, taking particular care when geocoding physical locations in bulk. This study seeks merely to compare the geocoding respective potentials of two freely available geocoding tools for research purposes, not to promote or undermine either of them. Reporting positional accuracy challenges and methods of resolving them can help users of geospatial analytics conduct efficient and accurate spatial analysis.
Schmenner RW. Look beyond the obvious in plant location. Harv Bus Rev. 1979;57(1):126–32.
Singh SK. Geospatial analysis of census data for targeting new businesses using Geoeconomics. Journal of Intelligence Studies in Business. 2016;6(12):5–12.
Waewsak J, Landry M, Gagnon Y. Offshore wind power potential of the Gulf of Thailand. Renew Energy. 2015;81:609–26.
Rushton G, et al. Geocoding in cancer research: a review. Am J Prev Med. 2006;30(2):S16–24.
Mohamad MY, Al Katheeri F, Salam A. A GIS application for location selection and Customers' preferences for shopping malls in al Ain City; UAE. American Journal of Geographic Information System. 2015;4(2):76–86.
Goldberg DW, et al. An evaluation framework for comparing geocoding systems. Int J Health Geogr. 2013;12(1):1.
Zandbergen PA. Geocoding quality and implications for spatial analysis. Geography Compass. 2009;3(2):647–80.
Goldberg, D.W., J.P. Wilson, and C.A. Knoblock, From text to geographic coordinates: the current state of geocoding. URISA-WASHINGTON DC, 2007. 19(1): p. 33.
Karimi HA, Durcik M, Rasdorf W. Evaluation of uncertainties associated with geocoding techniques. Computer-Aided Civil and Infrastructure Engineering. 2004;19(3):170–85.
Zhang, J. and M.F. Goodchild, Uncertainty in geographical information. 2002: CRC press.
Senaratne H, et al. A review of volunteered geographic information quality assessment methods. Int J Geogr Inf Sci. 2017;31(1):139–67.
Roongpiboonsopit D, Karimi HA. Comparative evaluation and analysis of online geocoding services. Int J Geogr Inf Sci. 2010;24(7):1081–100.
Karimi HA, Sharker MH, Roongpiboonsopit D. Geocoding recommender: an algorithm to recommend optimal online geocoding services for applications. Trans GIS. 2011;15(6):869–86.
McLafferty S, et al. Spatial error in geocoding physician location data from the AMA physician Masterfile: implications for spatial accessibility analysis. Spatial and spatio-temporal epidemiology. 2012;3(1):31–8.
Hay G, et al. Potential biases due to geocoding error in spatial analyses of official data. Health & place. 2009;15(2):562–7.
Goldberg, D.W., J.P. Wilson, and M.G. Cockburn. Toward quantitative geocode accuracy metrics. In ninth international symposium on spatial accuracy assessment in natural resources and environmental sciences. 2010.
TAMG. Available Geocoding software. 2016 [cited 2016 October 05, 2016]; Available from: https://geoservices.tamu.edu/Services/Geocode/OtherGeocoders/.
Swift J, Goldberg D, Wilson J. Geocoding best practices: review of eight commonly used geocoding systems. Los Angeles: University of Southern California GIS Research Laboratory; 2008.
Dallas T. helminthR: an R interface to the London Natural History Museum's host–parasite database. Ecography. 2016;39(4):391–3.
Lovelace R, et al. From big noise to big data: toward the verification of large data sets for understanding regional retail flows. Geogr Anal. 2016;48(1):59–81.
Wilkerson RL, Khalfe D, Krey K. Associations between neighborhoods and summer meals sites: measuring access to Federal Summer Meals Programs. Journal of Applied Research on Children: Informing Policy for Children at Risk. 2016;6(2):9.
Rodger R, Fleet C, Nicol S. Visualising urban geographies. e-Perimetron. 2010;5(3):118–31.
Cinnamon J, Schuurman N. GeoWeb and web 2.0: new tools for public health. PositionIT; 2010. p. 47–51.
Cinnamon J, Schuurman N. Web technologies for public health surveillance in low and middle-income countries, in sixth international conference on geographic information science. Zurich: GIScience; 2010.
HCA. Food facility award of excellence food inspection program. 2016. Available from: http://ocfoodinfo.com/retail/award.
Wikipedia. Google docs, sheets and slides. 2016. Available from: https://en.wikipedia.org/wiki/Google_Docs,_Sheets_and_Slides.
Nuket. Google-sheets-geocoding-macro. 2010. Available from: https://github.com/nuket/google-sheets-geocoding-macro.
Rossiter D. Introduction to the R project for statistical computing for use at ITC. International Institute for geo-Information Science & earth observation (ITC), Enschede (NL), vol. 3; 2012. p. 3–6.
Ihaka R, Gentleman R. R: a language for data analysis and graphics. J Comput Graph Stat. 1996;5(3):299–314.
Vries, A.d. How many packages are there really on CRAN? 2015. Available from: http://blog.revolutionanalytics.com/2015/06/how-many-packages-are-there-really-on-cran.html. [cited 2016 October 6, 2016].
Kahle D, Wickham H. Ggmap: spatial visualization with ggplot2. R Journal. 2013;5(1):144–61.
Lynn, S. Batch Geocoding with R and Google maps. 2013. Available from: https://www.r-bloggers.com/batch-geocoding-with-r-and-google-maps/. [cited 2016 September 20, 2016].
Hijmans RJ, et al. Package ‘geosphere’. Wien: R Foundation.(R Foundation Rapport) Tillgänglig; 2015. https://cran.rproject.org/web/packages/geosphere/geosphere.pdf [02-01-2016]
Grosjean P, Ibanez F. Pastecs: package for analysis of space-time ecological series. R package version 1.3–18. 2014. http://CRAN.R-project.org/package=pastecs
Singh SK. Assessing and mapping vulnerability and risk perceptions to groundwater arsenic contamination: towards developing sustainable arsenic mitigation models (order no. 3701365), Available from ProQuest Dissertations & Theses Full Text. (1681668682). In earth and environmental studies. USA: Montclair State University; 2015. p. 392.
Singh SK, Brachfeld SA, Taylor RW. Evaluating hydrogeological and topographic controls on groundwater arsenic contamination in the mid-Gangetic plain in India: towards developing sustainable arsenic mitigation models. In: Emerging issues in groundwater resources, advances in water security, A. Fares, editor. Switzerland: Springer International Publishing; 2016.
Singh SK, Vedwan N. Mapping composite vulnerability to groundwater arsenic contamination: an analytical framework and a case study in India. Nat Hazards. 2015;75(2):1883–908.
QGIS, D., QGIS geographic information System. Open source geospatial Foundation project. 2015.
Wikipedia. Google earth. 2016. Available from: https://en.wikipedia.org/wiki/Google_Earth. [cited 2016 October 9, 2016].
This work was conducted purely for research purposes and does not necessarily represent the official views of the organization with which the author is associated. RStudio and QGIS are open-source tools, freely available for research work.
The author did not receive any funding for this study.
The author declare that they have no competing interests.
SKS conceptualized the study, analyzed the data, and wrote the manuscript.