Canonical Analysis Workflows – reproducibility and reusability on air quality data

The scientific research area suffers from a reproducibility crisis: A nature report from 2016 (https://www.nature.com/articles/d41586-019-00067-3) revealed that 70% of scientists tried to reproduce their research and failed.

Since then, a lot of effort has been undertaken to make workflows more reusable and thereby making results reproducible. With our TOAR-II (Tropospheric Ozone Assessment Report phase II) database infrastructure we are supporting this approach. It has been lifted to a new level of FAIRness (https://www.go-fair.org/fair-principles/) by integrating more of the FAIR principles through redesigning the database and related services. In addition, new concepts were developed to achieve reproducibility and reusability via standardized workflows and objects.

Canonical workflows consist of automated workflows or workflow fragments which allow for reusability of these snippets in different contexts. The development of reusable workflows and software for scientific data analysis depends on reusable data, which must be described appropriately and standardized to ensure reliable and meaningful analysis results.

We, therefore, developed a concept where we focus on two important, indispensable, and inseparable prerequisites for workflow sharing: data harmonization and documentation.

In our concept paper, we show that the necessary data harmonization for establishing online data analysis services goes much deeper than the obvious issues of common data formats, variable names, and measurement units, and we explore how the generation of FAIR Digital Objects (FDO) and Research Objects (RO) together with automatically generated documentation may support Canonical Analysis Workflows for air quality and related data. We are convinced that our experiences from the TOAR database will show that data harmonization alongside with documentation constitutes a big step towards realizing the potential of canonical workflows.

Integrating FDOs/ROs into the TOAR data ingestion workflow. For the TOAR database, an RO-Crate is used. All new data ingestions are registered as FDO packages within the RO over time. As new FDOs are created over time, the RO will be updated with a pointer to the latest FDO and associated data while earlier FDOs remain accessible. By this approach, a kind of snapshot enables the traceability of the database states.

Schröder et al., Enabling Canonical Analysis Workflows – Documented data harmonization on global air quality data, Data Intelligence Journal. 2022; in print