- Original article
A sustainable process and toolbox for geographical linked data generation and publication: a case study with BTN100
Open Geospatial Data, Software and Standardsvolume 4, Article number: 2 (2019)
We describe the process and tools that we have used to generate and publish the BTN100 Linked Dataset, based on the original data from the Spanish Topographic Base (1:100.000 scale) from the Spanish Instituto Geográfico Nacional. We have taken into account the limitations and lessons learned from our initial experience on the generation and publication of Linked Data from a range of geographical sources in Spain, in 2010, and we have now refined the process in order to facilitate: declarative mappings for the transformations from existing open data (shapefiles), automation of transformations whenever there are changes in the original data sources, version control, and alignment with INSPIRE URIs. As a result of this transformation and publication process we have also updated the reference ontology for geographical features and aligned with general ontologies such as GeoSPARQL.
One of the activities of the Spanish Instituto Geográfico NacionalFootnote 1 (IGN) is to produce geographical information for all the territorial entities in Spain. IGN is responsible for maintaining and making accessible cartographic and topographic databases for the representation of the Spanish territory. Their catalogs publish data related to transport networks, geodetic information, administrative units, etc. making it possible for everyone to download them from their data portalFootnote 2 under an open data license compatible with CC By 4.0Footnote 3.
Governments, via their many agencies and organizations, are constantly producing data that may be highly interrelated, but in practice become isolated data due to lack of interoperability. Cartographic and topographic information from IGN may easily enrich information from other government entities data, e.g. data from the National Institute of Statistics, Institute of Cultural Heritage, General Direction of Cadastre, Geological and Mining Institute, etc.
However, the generalized lack of use of semantics standards in the descriptions of the data elements within the data sources make it difficult to reuse them. Although progress in data availability, there are still plenty issues related to semantic interoperability; this is the ability of information systems to exchange data with unambiguous, shared meaning.
There are several initiatives around the world that have focused on generating and publishing Linked Data from a range of geospatial data sources, and a W3C/OCG Working Group was running between 2015 and 2017 with the title “Spatial Data on the Web” producing recommendations on how to publish different types of geospatial, sensor, and temporal data on the web in a principled manner. The LinkedGeoData initiativeFootnote 4 aimed to make available the information collected by Open Street MapFootnote 5 as RDF and interlinks this data with other knowledges bases. Ordnance Survey Linked DataFootnote 6 publishes a number of products from the Great Britain’s national mapping agency as Linked Data and provides access to them through a SPARQL endpoint. Swiss Linked Data ServiceFootnote 7 publishes the geospatial datasets from the Swiss Federal Spatial Data Infrastructure via a Link Data Frontend, which provides a search, querying and visualization interface.
In Spain, there was also some pioneering work on producing geospatial Linked Data from a range of data sources (many of them from IGN), as described in [1, 2]. Additionally, an ontology of administrative units has been created and published at http://vocab.linkeddata.es/datosabiertos/def/sector-publico/territorio and some regions have produced Linked Data about their administrative units, such as Aragón . In 2010, we worked on the GeoLinked Data initiative in order to enrich the web of data with Spanish Linked Data. We used as input several relational data bases and Excel spreadsheets about administrative units, hydrography and statistical domains. Then we modeled an ontology to represent this data. Later, we generated the RDF with the Geomety2RDF pluginFootnote 8 to deal with the geometrical transformations. The generated RDF was compliant with the WSG84 vocabularyFootnote 9 and the GML ontologyFootnote 10. We added some links between terms from each data source, published the resulting RDF in a triplestore and made it available via a visualization tool (Map4RDF). Nevertheless, our process had some limitations. Geometries were not making use of GeoSPARQL, since it was an emerging standard with little tool support. Our transformations were originated from special access to Oracle Spatial databases, instead of already published open data. No automation was included to deal with the evolution of the data sources, what made the Linked Data state quickly. Manual intervention was needed for this update process.
In this paper we describe the transformation process of Spanish Topographic Base in scale 1:100.000 (BTN100) catalog into Linked Data. We used GithubFootnote 11 for version control and as archival to store all the RDF transformations. The process includes defining the semantic model for this geospatial data, generating the data transformations, publishing them as a SPARQL endpoint, and maintaining the Linked Dataset.
Our work represents a forward step to improve semantic interoperability in the geospatial domain. We make use of the open BTN100 dataset and define semantics for its data. Through the semantic model we represents complex geometrical shapes, e.g. multi-lines, polygons, multi-polygons, etc. This makes it possible to visualize these geometric entities and to infer assertions such as those related to any geographical entity being embedded within another entity. In addition, we provide an automatic way to deal with changes in the data source in order to provide an always up-to-date Linked Dataset.
The paper is organized as follows: “Methodology and results” section describes all the followed steps for generate and maintain the Linked Dataset. “Conclusions and future directions” section shows the conclusions and future work.
Methodology and results
Generating Linked Data is a process that involves several activities and decisions in order to obtain a high-quality Linked Dataset. We followed the Methodological Guidelines for Publishing Government Linked Data . These guidelines cover all the steps and details that are necessary for the activities involved. The activities described by the guidelines are: specification, modelling, generation, publication and exploitation. Each activity involves one or more tasks and some techniques for carrying out them.
The first task of this activity is focused on the identification of the data sources, formats, information within the datasets and general requirements for the resulting Linked Dataset. In our case, we used the open BTN100 catalog as our data source. This data source is available from the Spanish National Center for Geographic InformationFootnote 12 (CNIG) as shapefiles in the ETRS89 and REGCAN95 Coordinate Reference System (CRS). The BTN100 catalog contains geographic information about topographic and thematic data; it was designed following the INSPIRE DirectivesFootnote 13. It clusters the data in the following themes: administrative units, protected zones, buildings and population entities, transport networks, energy and conduction, geodetic vertices, altimetry and hydrography.
A URI design task is involved in this activity. In our case, we defined the persistent URIs for our features according to the Spatial Data on the Web best practices described in  and the Technical Interoperability Standard . The base URI structure for all elements is https://datos.ign.es. We followed an upper camel case strategy to name classes and a lower camel case strategy for object and data properties and resources. In Table 1 we present our URIs design.
The final task of this activity is the definition of the license of the Linked Dataset. We decided to reuse the IGN licenseFootnote 14 for the BTN100 Linked Dataset. It is a Creative Common Attribution 4.0 International (CC BY 4.0) license.
In order to represent all themes of the dataset, we generated an ontology, which replaces the former http://geo.linkeddata.es ontology. The ontology development was made by following the LOT Methodology described on-lineFootnote 15 (originally proposed in ) and used for example in . Reusing ontologies was important through our development process. We focused our analysis on the common spatial vocabularies recommended by the W3C Working Group Note . We decided to reuse the GeoSPARQL vocabularyFootnote 16 to represent geospatial data, since it makes it possible to use specialized functions for geometries.
The GeoSPARQL vocabulary does not allow representing elements such as identifiers for resources, labels for geographical objects, altitude, etc. In order to address these shortcomings, we developed the btn100 ontologyFootnote 17. This model represents all the geographical objects from our dataset. The btn100 has links to SKOS thesauri that were developed to represent some categories of elements in our dataset; for instance, type of highway access, type of roadway, etc. These thesauri will be linked in the future to these maintained in the INSPIRE registry. All files generated during the ontology development, including the requirements, ontologies, thesauri and documentation are available into a Github repository at https://github.com/oeg-upm/ontology-BTN100.
In Fig. 1 we show the general ontology model. The esamFootnote 18 and escjFootnote 19 ontologies are included in our model because they are reused in order to represent Administrative Units and Streets respectively.
In Fig. 2a we show an extract from the classes of the btn100 ontology. As is depicted, all classes are a subclass of ObjetoGeografico which is equivalent to Feature class from the GeoSPARQL vocabulary. This equivalence is defined in geo_core ontologyFootnote 20 which reuses the GeoSPARQL vocabulary and defines several data properties that are common for all concepts of the BTN100 catalog. Finally, on the right side of Fig. 2a we present the main metrics of the btn100 which includes a summary of the total number of axioms, classes, properties, etc.
The model depicted in Fig. 2b presents the classes defined in order to represent all concepts from the administrative units and protected zones themes from those mentioned before in Specification subsection. The btn100 documentation in HTML format, including further details of the representation of the other themes and their diagrams, is available at https://datos.ign.es/def/btn100.
In Fig. 3 we present an example about the URIs definition for btn100. This illustration represents that: La Autopista “AP2-E50” tiene acceso de tipo peaje (The “AP2-E50” highway has toll access).
In this activity we followed the process depicted in Fig. 4 in order to generate the Linked Dataset. At the beginning, we extracted all shapefiles from the BTN100 data source and unzip them in order to obtain the shapefiles for each theme. Then we made some data transformations and obtained as result a RDF file. Finally, we linked the resulting dataset to other resources and obtained the Linked Dataset. As introduced before, we stored all files involved in this process into a Github repository available at https://github.com/oeg-upm/btn100.
In order to deal with the transformation tasks we used GeoKettleFootnote 21. With this tool we created a transformation file for each shapefile and configured a workflow to perform the activities described as follows. First, we cleaned the data, e.g. correcting malformed/incompatible datatypes. Then, we mapped the data to their corresponding equivalents in the SKOS concepts. After, we converted ETRS89 and REGCAN95 into WGS84 CRS in order to represent data in the GeoSPARQL standard. Last, we transformed the data, via TripleGeo pluginFootnote 22, into triples according to the model defined in the btn100 ontology.
TripleGeo converts the geospatial features into a RDF serialization, in our case into TurtleFootnote 23 files. We enable TripleGeo as a Geokettle pluginFootnote 24 in order to provide an accurate generation of the semantic information; for instance, a correct URI definition. In Fig. 5 we depict an example of the TripleGeo structure. At the top of the TripleGeo window we can set the type and URI for a resource. At the bottom we can set prefixes and URIs for the fields available in the shape file. Further details about TripleGeo configuration parameters are available in their wikiFootnote 25.
Finally, for the linking task we used the owl:sameAs relationship to align our resources with DBpediaFootnote 26 and other resources from the Spanish government open data portalFootnote 27. All files generated during the linking task are also available at the Github repository.
This step aims to provide access to the resulting dataset. We stored the RDF files into a Virtuoso triplestoreFootnote 28. Virtuoso provides a SPARQL endpoint, available at https://datos.ign.es/sparql. Some use cases with their SPARQL queries are available at https://datos.ign.es/casos-de-uso.html.
We also provided a web interface to the SPARQL endpoint via PubbyFootnote 29. A web portal about this work is available at https://datos.ign.es; it delivers a single entry point to all the resources (e.g queries, ontologies, skos, etc.).
We are displaying the dataset over a map using Map4RDFFootnote 32. This tool allows end users to visualize and interact with our Linked Dataset. Map4RDF connects to our endpoint in order to provide the faceted browser interface for each BTN100 theme and all their concepts. When a user selects a facet, Map4RDF queries our triplestore and provides the visualization for the instances of the selected facet including their respective GeoSPARQL geometries. The instance for visualization is available at http://certidatos.ign.es/map/. In Fig. 6 we show a visualization example. We exemplify the Spanish provinces painted at the map.
Despite a maintenance activity is not included in the followed guidelines, we considered it is important to ensure the dataset will always have the most current version of the data source. In order to automatize the generation and updating of the Linked Dataset we developed a Pyhton scriptFootnote 33.
The script, which will be periodically executed, starts by downloading the BTN100 data source, then it makes a testing process between the downloaded source and the previous one. If a change is detected, the script identifies the elements that need to be updated. Then it generates, via GeoKettle, the new RDF files and publishes the updates in the SPARQL endpoint. Finally the script also updates the thesauri and sameAs files in the triplestore.
As we mentioned, Github is used as our environment to deal with file versioning and storing. However, Github does not allow to push files larger than 100 MB. For this reason, the script makes another test in order to check if some of the updated data sources has more than 90 MB. If there is a file with this condition, the script breaks it down into files up to 90 MB and then uploads the resulting files into Github.
Conclusions and future directions
In this paper we have described our updated approach for the previous Spanish GeoLinked Data work, specifically for the representation of the BTN100 catalog. We have presented the process to generate and publish the BTN100 as Linked Data. The dataset has been generated by using the btn100 ontology, which reuses GeoSPARQL vocabulary. This ontology provides complex geospatial representations and makes data more interoperable with other similar datasets. Our dataset has been tested against competency questions posed by domain experts; it is modular and therefore easily extensible.
In this work we have entirely supported the process by Github, in order to provide a collaborative, distributed and version development tool. Our work also provided an automatic script in order to perform the whole process, from data extraction to publication. This script allows updating the dataset whenever a change is detected in the data source.
Our model represents all the BTN100 themes; however, we are only generating linked data for territorial units. We are using Virtuoso as the technology behind our endpoint; this is due to our previous experience with this technology. However, Virtuoso does not fully support GeoSPARQL functions; it would be important to complement our work with another triplestore in this domain. Our work addresses a specific need from the IGN, it is available in Spanish. However, the approach is applicable to other scenarios in this domain and not possibly in other languages.
de León A, Saquicela V, Vilches LM, Villazón-Terrazas B, Priyatna F, Corcho O. Geographical linked data: a spanish use case. In: Proceedings of the In I-SEMANTICS ’10 6th International Conference on Semantic Systems. New York: ACM: 2010.
Atemezing G, Corcho O, Garijo D, Mora J, Poveda-Villalón M, Rozas P, Vila-Suero D, Villazón-Terrazas B. Transforming meteorological data into linked data. Semant Web. 2013; 4(3):285–90.
Corcho O, Pérez IS, Lafuente H, Portolés D, Cano C, Peris A, Subero JM. Publishing linked statistical data: Aragon, a case study. In: Joint Proceedings of the International Workshops on Hybrid Statistical Semantic Understanding and Emerging Semantics, and Semantic Statistics (HybridSemStats). Aachen: 2017.
de León A, Wisniewki F, Villazón-Terrazas B, Corcho O. Map4rdf - faceted browser for geospatial datasets. In: Using Open Data: policy modeling, citizen empowerment, data journalism.2012.
Villazón-Terrazas B, Vilches-Blázquez LM, Corcho O, Gómez-Pérez A. Methodological guidelines for publishing government linked data. In: Linking Government Data. New York: Springer: 2011. p. 27–49.
Spatial Data on the Web Best Practices. 2017. https://www.w3.org/TR/sdw-bp/.
Technical Interoperability Standard for the Reuse of Information Resources. 2013. https://administracionelectronica.gob.es/pae_Home/dam/jcr:a8d2c143-ce9a-4fc7-afe7-ef5d9ba7c4a1/ENGLISH_Interoperability_Agreement_for%20the%20Reuse%20%20of%20Information%20Resources.pdf.
Poveda-Villalón M. A reuse-based lightweight method for developing linked data ontologies and vocabularies. In: 9th Extended Semantic Web Conference (ESWC). Berlin: Springer: 2012. p. 833–7.
Radulovic F, Poveda-Villalón M, Vila-Suero D, Rodríguez-Doncel V, Garcí-Castro R, Gómez-Pérez A. Guidelines for linked data generation and publication: An example in building energy consumption. Autom Constr. 2015; 57:178–87.
We acknowledge the work done by developers who have contributed parts of the software used: Lissete Moscoso, Francisco Siles, Victor Saquicela and Luis Vilches.
This work has been funded by Centro Nacional de Información Geográfica and DATOS 4.0: RETOS Y SOLUCIONES - UPM Spanish national project (TIN2016-78011-C4-4-R).
Availability of data and materials
All files generated during this work are available in our GitHub repositories. Ontologies and SKOS thesuauri files are available at https://github.com/oeg-upm/ontology-BTN100. Original and transformed files, plugin and scripts are available at https://github.com/oeg-upm/btn100.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.