Data collection
For testing purposes, a publicly available list of Center of Excellence–awarded restaurants in California was downloaded from the website of the Public Health Care Agency of Orange County [25]. In 2016, fully 3631 restaurants were recognized as a Center of Excellence for their performance in 2015. The list contains each restaurant’s name, address, city, and ZIP code. Because use of both Google Sheets and R is subject to a maximum geocoding limit per day, only 200 of these 3631 restaurants were selected for geocoding purposes in this study. Moreover, this study seeks to compare the accuracy of geocoded results produced by two popular geocoding tools while providing a stepwise method of resolving geocoding challenges: because visual verification of individual address is a tedious task, a small sample size—but larger than that used by Swift et al. (2008)—was selected. Because the selection was purely for research purposes, no priority was given to any specific restaurant chain. The 200 addresses were stored as address.csv for further analysis.
Geocoding using Google sheets and the RStudio ggmap package
Google Sheets is a free web-based application, developed by Google for real-time online document editing while collaborating with other users [26]. Several blog articles and tutorial videos are available that instruct users in the steps used for geocoding physical locations through Google Sheets, such as one available through GitHub, which was adapted for this study [27].
R is one of the most widely used statistical and visualization open-source tools [28, 29], with more than 6000 packages [30] contributed by thousands of authors across the world. Ggmap, a bundle of 34 functions, is a spatial data modeling and visualization R package [31]. This package uses Google and Stamen Maps as reference sources for geocoding and mapping. The codes used in this study are adapted from Shane Lynn [32]. Most are kept intact for reproducibility, and the code used is available in geocode_2016.R and geocode_2016.txt, accessible through this article.
Distance calculation in RStudio using the geosphere package
After geocoding all addresses using Google Sheets and ggmap, distances between coordinates having the same locations were calculated to validate the geocoding results, using the geosphere package. Geosphere, a recently developed spatial analytics R package, combines 40 functions developed for calculation of various aspects of distance, direction, and area when dealing with geographic coordinates [33]. The distHaversine function of the geosphere package was used for distance calculation. This function measures the shortest distance between two geographic coordinates, also known as the “great-circle distance” or distance measured “as the crow flies” [33]. The advantage of this method is that it assumes a spherical earth, ignoring ellipsoidal effects [33]. It accepts data in a specific format only: coordinates, with the first column of the input file corresponding to longitude and the second to latitude [33]. This method produces distance in meters, taking Earth’s radius to be 6,378,137 m [33]. The original code was modified to produce results in miles instead of meters. The modified output data for Google Sheets and ggmap is stored as gsheets.csv and ggplot.csv, respectively. A stepwise method of assessing geographical inconsistencies of geocoding errors is presented through a flow diagram (Fig. 1).
Descriptive statistics in RStudio using the pastecs package
Descriptive statistical analysis of the distance calculated between the geocoded locations produced by Google Sheets and those produced by ggmap was performed using the pastecs R package [34]. The stat.desc function of pastecs quantifies various descriptive statistics, including number of variables, null values, NAs, minimum, maximum, range, sum, median, mean, standard error of the mean, confidence interval of the mean, variance, standard deviation, and coefficient of variance [34]. In the results, only a few required outputs are presented and discussed.
GIS mapping
The Arc Geographical Information System (ArcGIS) has seen much use in spatial analytics and modeling in different perspectives and is one of the most advanced and reliable geospatial analytical tools available [35,36,37]. However, QGIS, an open-source GIS tool, has become very popular in the field of geospatial analytics [38]. In this study, QGIS is used to plot geocoded locations on a map using QGIS version 2.18.2 for the Windows environment [38]. Within the QGIS environment, the open layers plugin provides options for selecting Google Maps and OSM as base maps on which to plot geocoded locations. These locations were plotted on Google Maps and OSM for visualization, comparison, and validation. Google Earth, a freeware virtual globe, map, and geographical information program that offers various mapping facilities and that is one of the most reliable geocoding tools available, was also used to investigate locational accuracy [39]. Google Earth has a street view option, which provides a 360° horizontal and 290° vertical panoramic view at the street level from a height of about 2.5 m [39]. These help users verify actual locations by zooming to the street level.