The spatial resolution of geographic data has a very big impact on the results of any spatial analyses. Eidlin [10] for example, showed that New York City is the densest place in the U.S. if city boundaries are chosen as the unit of analysis, whereas the selection by metropolitan areas, in contrast, would assign this title to Los Angeles. This issue was labelled by Openshaw [20] as the Modifiable Area Unit Problem (MAUP), expressing that results of spatial analyses are influenced by the chosen zone size. Similarly, changing the zone system in transport modeling requires complete recalibration of the model. Zones should be small enough to reduce the number of intrazonal trips [7, 9], but large enough to minimise the number of zones and keep model runtimes short. Spiekermann and Wegner [23] note that the selection of the appropriate zone size is contextual, while others describe it as more art than technique [13, 16]. More often than not, zone systems were defined by local authorities decades ago and are adopted for any spatial analysis [14, 21]. Changing a zone system or creating a new one is very labor intensive, and hence, rarely done [8].
Whenever a zone system is created from scratch, it traditionally requires manual labor to decide which neighborhoods shall form one zone, which streets, rivers or other geographical features should act as boundaries between zones, and how large zones shall be. Being a manual process, it is extremely unlikely that the relative spatial resolution is proportional in different parts of the study area. Other approaches have used uniform raster cells to cover the entire study area [5, 6]. While regular raster cells are very efficient for certain spatial analysis, including Cellular Automata, they are inefficient for transport modeling and other computing-intensive analyses dealing with complicated zonal interactions [4, 5]. In contrast, zones should be larger where there is less activity and smaller where there is more activity. This way, most resources are allocated to areas that deserve most attention by the analyst.
In some cases, zones have been defined by individual land parcels [22, 24, 26]. While parcel-level data are very useful for many visualizations, no transport or land use model for larger metropolitan study areas is known to be fully operational. The sheer number of parcels makes it impossible to efficiently run complex simulations.
In an attempt to overcome these issues, Moeckel and Donnelly [18] created a tool to generate an automated zone system for the Georgia statewide model. The tool applied the quadtree algorithm and repeatedly subdivided larger raster cells into four smaller raster cells, until each raster cell had a population of no more than 5000 households. The tool created smaller raster cells in urban areas and larger raster cells in rural areas, and it was used successfully for a transport model. However, raster cells ignored jurisdictional boundaries, such as city, county and state boundaries. This shortcoming made it impossible to correctly allocate socio-demographic data, which was always given by jurisdiction, to raster cells. Furthermore, the algorithm led to some raster cells dominated by large bodies of water, or with no population at all, adding unnecessary computational requirements to further analyses. While the quadtree algorithm is still applied at the core of the research presented here, new features have been developed to address these shortcomings. Land use is now considered when disaggregating population and employment, and municipal boundaries are respected to aid in data disaggreation and the development of hierarchical models. Finally, an automated approach is implemented to identify the required threshold parameter for the algorithm.
Approaches to zoning are often proprietary and vary widely. With each project using different manual methods or techniques that have been developed in house. This work presents a step forward towards a more open approach to zoning analysis. The algorithm is simple to understand and highly automated. Yet it is configurable and can be applied using only Open Data and no commercial libraries or software such as ArcGIS. Our methods are designed to be used with whatever data the analyst has available. As a general requirement, Open Data needs to respect the privacy of individuals, and are often aggregated to the municipal level or higher. As such, zoning systems that adhere to administrative boundaries will aid the analyst in working with Open Data. While the data used in this analysis are not Open Data, it is worth noting that the openness of municipal data such as population, employment, and land use varies around the world, and is at a minimum often available without charge from relevant statistical authorities for non-commercial use. The code is open source and available at https://github.com/msmobility/silo_zoneSystem.
The paper proceeds as followed. “Literature review” section covers the MAUP problem, its implications for zone system design, and reviews the previously developed method. “Methods” section describes the algorithm and its features. “Application” section presents the creation of a zone system for the metropolitan area of Munich using our approach. “Discussion” section discusses the benefits and drawbacks of the method, and “Conclusion” section concludes.
Literature review
The significant variety of approaches in the literature towards zone system design is a tribute to the complexity and importance of the process. Moekcel and Donnelly [18] note that the zone systems used for spatial analyses and spatial modelling have different requirements. While analysis requires only that zones accurately represent statistical data in a spatial sense, zone systems for modeling also need to avoid zones of odd shapes, such as donuts and horseshoes. Openshaw [19] proposed the automatic zoning procedure (AZP), a rule based approach to iteratively aggregate smaller zones to best fit certain statistical measures. Eventually, this approach was computerized using GIS software, extending its applicability to thousands of zones [21].
Another automated approach to update existing zone systems was constructed by Cockings et al. [8], which split zones with increasing population, and merged those where the population was declining. Batty [3] developed a procedure that defines a zone system to maximize social entropy. Based on the concept of entropy from thermodynamics, spatial entropy is defined as the distribution of spatial data over an area in such a way that the information content cannot be increased.
Such automated spatial analysis zone systems are not suitable for spatial modeling due to the irregular-shaped zones they produce. The zone shape is particularly important in transportation models, as trip origins and destinations are calculated using the zone’s centroid. In some cases, such as donuts and horseshoe-shaped zones, the centroid may lie outside the zone area. Hence, zone systems need to be specifically designed for spatial modeling.
In a similar vein to the identification the MAUP problem in spatial analyses [20], multiple studies have showed the impact of zone system design on spatial modeling results [7, 15, 25]. Viegas, Martinez, and Silva [25] investigated MAUP in spatial modelling by analyzing the impact on intrazonal trips and zero-trip zones of various zone system resolutions. Lovelace, Ballas and Watson [15] investigated commute trips and confirmed that smaller zone sizes improved the fit of the model to observed data. These studies suggest that zone systems should be tailored to specific use cases in spatial modeling. However, typically this is not the case, primarily due the time and cost required to revise existing zone systems and repopulate them with socioeconomic data.
A particularly interesting approach was presented by Hagen-Zanker and Jin [12], called adaptive zoning. For every origin zone, destination zones are aggregated together based on their distance from the origin. Hence, a separate map is created for each origin, with nearby destination zones being small, and more distant ones larger. They tested the method on a commuting model in England and found the results were equivalent to the conventional model, despite a reduction in the number of zone pairs by 96% and computation time by 70%.
The introduction of computer systems made the use of raster cells attractive in spatial modelling. They are homogeneously shaped, easy to process geometrically, and have simple relationships to their the adjacent cells. Approaches using celluar automata to model land-use [5] and urban growth [1] have represented locations using raster cells and their interactions with adjacent neighbours. Moekcel [17] also used raster cells to create and compare land use models using firms versus those using employees.
Approaches using raster cells present some key challenges. Firstly, socio-economic data needs to be accurately disaggregated to these raster cells. Spiekermann and Wegner [23] presented one solution. As part of methodology for disaggregating zone systems, they generated probabilities of population and employment for each raster cell based on land-use data available at the size of the smallest raster cell. Monte-Carlo sampling with these probabilities was then used to allocate socio-economic data to these raster cells.
In the creation of an Origin-Destination matrix for transport modeling, each cell not only needs to interact with its adjacent neighbours, but all other zones as well. If the number of cells needed to cover a study area at the necessary resolution in a raster cell zone system is very large, the number of interactions between non-adjacent cells make the model computationally infeasible. Moeckel and Donnelly [18] proposed a gradual rasterization method to retain the benefits of raster cells, while reducing the number of zones. Smaller cells are created in dense metropolitan areas, and larger cells in rural areas. In doing so, they were able to programmatically define a raster cell zone system suitable for transport modeling.
Previously developed method
The gradual rasterization method to create a zoning system was first proposed by Moeckel and Donelly [18] to model traffic along the I-75 corridor in Georgia. The GDOT (Georgia Department of Transportation) statewide model [2] was used to analyze transportation improvements along the I-75 from Atlanta, Georgia to Chattanooga, Tennessee. It was found that along the section of the I-75 within the Atlanta Metropolitan region, travel demand was substantially overestimated by the model. Further investigation showed that an increase in geographical resolution within Atlanta improved results, suggesting that a higher spatial resolution in urban areas was needed. To do this, the authors proposed their gradual rasterization method to improve spatial detail in denser areas while avoiding an exponential increase in the size of the trip table.
The study area was rasterized into the smallest raster cells to be considered. A square covering Georgia was rasterized into 4096 x 4096 raster cells. The number of cells must be a power of two for the quadtree algorithm to work. Population data was then disaggregated to this raster. Population and employment were allocated proportionally to each cell based on the area percentage of the various intersecting zones.
The quadtree algorithm created the gradual raster cells. The algorithm started with one large cell covering the entire study area. If the summed population and employment of this cell exceeded the specified threshold, the cell was divided into 4 cells of equal size. This was recursively repeated for the new cells until the population of each cell was below the threshold, or the cell was of the minimum raster cell size. In this way, the number of zones was reduced by having many smaller cells in areas of higher population, and fewer larger cells elsewhere. Moeckel and Donnelly based this decision on a rule proposed by Flowerdew, Feng and Manley [11], that zones across a study area should have a similar number of households. The threshold had to be specified manually.
Moeckel and Donnelly’s approach noteworthy improved the model results. Through trial and error it was found that a threshold of 5000 units of population and employment resulted in suitable zone system consisting of almost 5000 zones. They found it remarkable that the overall model validation was improved only through changes to the spatial resolution of the assignment step, without modifications to the model design. The gradual rasterization kept roughly the same number of zones in rural areas, where the GDOT model performed well, but added zones to areas where the GDOT model under performed in urban areas. They noted that while this process could have been performed manually, it would have risked introducing inconsistencies in the spatial resolution. A straight forward, instead of gradual, rasterization to the grid of the smallest cell size would have resulted in 4 million cells. With so many raster cells, the creation of trip tables and their assignment would have become infeasible.
Objectives
Moeckel and Donnelly’s algorithm has some limitations listed below. In the following sections of this paper solutions to these shortcomings are proposed.
-
1.
Raster cells can overlap multiple jurisdictions, resulting in a ’secondary’ zone system that is not nested within the original set of zones or municipalities. This lack of hierarchy introduces added complexity and errors when assigning socioeconomic data or trip ends to raster cells.
-
2.
Population and employment are distributed to raster cells by the area percentage of the overlapping zones. This unrealistically assumes that socioeconomic data such as population and employment are evenly distributed throughout the zones or municipalities.
-
3.
The process of identifying a population and employment threshold that results in the desired spatial resolution and number of zones was a manual process of trial and error.
-
4.
Every zone that exceeds a threshold value is split into four cells of equal size. If population was only present in one corner of this zone, three out of four newly created raster cells would have no population. Thereby, resources are allocated inefficiently to some degree.