Canonical Analysis Workflows – reproducibility and reusability on air quality data

The scientific research area suffers from a reproducibility crisis: A nature report from 2016 ( revealed that 70% of scientists tried to reproduce their research and failed.

Since then, a lot of effort has been undertaken to make workflows more reusable and thereby making results reproducible. With our TOAR-II (Tropospheric Ozone Assessment Report phase II) database infrastructure we are supporting this approach. It has been lifted to a new level of FAIRness ( by integrating more of the FAIR principles through redesigning the database and related services. In addition, new concepts were developed to achieve reproducibility and reusability via standardized workflows and objects.

Canonical workflows consist of automated workflows or workflow fragments which allow for reusability of these snippets in different contexts. The development of reusable workflows and software for scientific data analysis depends on reusable data, which must be described appropriately and standardized to ensure reliable and meaningful analysis results.

We, therefore, developed a concept where we focus on two important, indispensable, and inseparable prerequisites for workflow sharing: data harmonization and documentation.

In our concept paper, we show that the necessary data harmonization for establishing online data analysis services goes much deeper than the obvious issues of common data formats, variable names, and measurement units, and we explore how the generation of FAIR Digital Objects (FDO) and Research Objects (RO) together with automatically generated documentation may support Canonical Analysis Workflows for air quality and related data. We are convinced that our experiences from the TOAR database will show that data harmonization alongside with documentation constitutes a big step towards realizing the potential of canonical workflows.

Integrating FDOs/ROs into the TOAR data ingestion workflow. For the TOAR database, an RO-Crate is used. All new data ingestions are registered as FDO packages within the RO over time. As new FDOs are created over time, the RO will be updated with a pointer to the latest FDO and associated data while earlier FDOs remain accessible. By this approach, a kind of snapshot enables the traceability of the database states.

Schröder et al., Enabling Canonical Analysis Workflows – Documented data harmonization on global air quality data, Data Intelligence Journal. 2022; in print

The TOAR database and its interfaces

The foundation of the IntelliAQ project is the TOAR database, the world’s largest collection of surface observation data of ozone, ozone precursor gases, meteorological variables, selected tracers for pollution source attributions, and selected results from numerical models of the atmospheric dynamic and chemical composition. Most of these data constitute timeseries of measurements at specific point locations, the “stations”. Various researchers directly submit their data to the TOAR data centre, where they are reformatted, quality controlled, and inserted to the TOAR database. If requested by the data submitters, the reformatted and augmented files from these direct submissions will also be published in a FAIR data service, including a doi for reference in journal publications, presentations, and elsewhere. However, the majority of data in the TOAR database is not “primary data”, but a copy of data from other databases and repositories.

Particular strengths of this globally harmonized database are a unified access to the data and the application of consistent statistical methods everywhere, which make the results comparable.
The data and metrics are made available via a graphical web interface (for documentation see: ), a REST service (for documentation see: ) and as aggregated products on PANGAEA ( ). A Collection of software tools to facilitate access to and processing of data can be found in the Gitlab-Repository

Figure 2:The JOIN Web-Interface to TOAR data showing ozone data from one station over a time span of 23 years
Figure 2:The JOIN Web-Interface to TOAR data showing ozone data from one station over a time span of 23 years
Figure 3:REST API, available parameters for surfacedataFigure 4: REST API, queries for detailed information


Figure 1: Europe’s nighttime light brightness as exemplary illustration of GeoDataServices’ underlying high resolution data (own graph). Image and Data is processed by NOAA’s National Geophysical Data Center and collected by the US Air Force Weather Agency.

With GeoDataServices (Schultz, M.G., 2018), we enable an automated and flexible characterisation of an arbitrary point location using high resolution data. These services are accessible through a standardised REST API and can therefore easily be used by both human and machine. In the current version, GeoDataServices includes geographical information on topography and dominant land surface covers, anthropogenic data about urbanisation (human settlements, built-up areas, nighttime light brightness, population density and streets) and agriculture yields (rice and wheat), climatological and environmental data as NOx emissions and climatic zones. Combining this data, GeoDataServices can characterise any point location and therefore enables users to compare locations in a personalised -use case driven- way. For this personalisation, each query needs to be specified with a radius around the point location, in which a given statistical aggregation function is applied. Beside this personalised information preparation, the GeoDataServices are also included in the TOAR database metadata creation. GeoDataServices is currently under development and not yet accessible for the public. However, interested people are encouraged to contact us for more insight into GeoDataServices.

Reference: Schultz, Martin G., et al. “A web service architecture for objective station classification purposes.” 2018 IEEE 14th International Conference on e-Science (e-Science). IEEE, 2018.