Skip to content

Pipeline Overview

The MSE-KG is constructed, validated, and published through an automated pipeline orchestrated by Apache Airflow. The pipeline comprises eleven Directed Acyclic Graphs (DAGs) organised into three architectural tiers: a core processing chain, a parallel harvester tier, and a release tier.

Ontology transformations throughout the pipeline rely on ROBOT, an open-source command-line tool for automating OWL ontology development tasks including template processing, merging, reasoning, module extraction, and quality reporting.

Pipeline Architecture

graph TD
    subgraph "📋 Template Tier"
        SHEETS["27 Google Sheets<br/>(TSV templates)"]
    end

    subgraph "⚙️ Core Pipeline"
        PS["<b>process_spreadsheets</b><br/>ROBOT template → 27 OWL modules"]
        MG["<b>merge</b><br/>ROBOT merge → spreadsheets_asserted.ttl"]
        RS["<b>reason_openllet_new</b><br/>Openllet → inferences.ttl"]
        VC["<b>validation_checks</b><br/>HermiT + SHACL + SPARQL verify"]
        PV["<b>publish_to_virtuoso</b><br/>Upload 5 named graphs"]
    end

    subgraph "🌐 Harvester Tier (weekly)"
        HZ["harvester_zenodo"]
        HE["harvester_endpoints"]
        HP["harvester_pmd"]
    end

    subgraph "📦 Release Tier"
        DB["<b>dashboard</b><br/>Daily stats → SQLite → Superset"]
        DA["<b>dump_and_archive</b><br/>Versioned dumps → Zenodo + DOI"]
        CQ["<b>cq_tester</b><br/>Competency question checks"]
    end

    SHEETS --> PS --> MG --> RS --> VC --> PV
    HZ --> RS
    HE --> RS
    HP --> RS
    PV --> DB
    PV --> DA
    PV --> CQ

    style PS fill:#e3f2fd,stroke:#1565c0
    style MG fill:#e3f2fd,stroke:#1565c0
    style RS fill:#e3f2fd,stroke:#1565c0
    style VC fill:#e3f2fd,stroke:#1565c0
    style PV fill:#e3f2fd,stroke:#1565c0
    style HZ fill:#fff3e0,stroke:#e65100
    style HE fill:#fff3e0,stroke:#e65100
    style HP fill:#fff3e0,stroke:#e65100
    style DB fill:#e8f5e9,stroke:#2e7d32
    style DA fill:#e8f5e9,stroke:#2e7d32
    style CQ fill:#e8f5e9,stroke:#2e7d32

Core Pipeline

The core pipeline is a linear chain that transforms curated spreadsheet data into a published, validated knowledge graph:

Step DAG Tool Output
1 process_spreadsheets ROBOT template + merge 27 individual OWL modules
2 merge ROBOT merge + HermiT consistency check spreadsheets_asserted.ttl
3 reason_openllet_new Openllet extract spreadsheets_inferences.ttl
4 validation_checks HermiT + SHACL + SPARQL verify Validated, merged graph
5 publish_to_virtuoso Virtuoso CRUD API 5 named graphs live

ROBOT

ROBOT (ROBOT is an OBO Tool) is used throughout the pipeline for template processing, ontology merging, reasoning pre-filtering, format conversion, and quality checks. It provides a consistent, scriptable interface to OWL operations that integrates naturally with Airflow task orchestration.

Harvester Tier

Three independent harvesters run weekly and feed into the core pipeline at the reasoning stage:

graph LR
    HZ["harvester_zenodo<br/>━━━━━━━━━━━<br/>Zenodo REST API<br/>→ RDF conversion"] --> RS["reason_openllet_new"]
    HE["harvester_endpoints<br/>━━━━━━━━━━━<br/>SPARQL endpoints<br/>→ schema extraction"] --> RS
    HP["harvester_pmd<br/>━━━━━━━━━━━<br/>PMD platform<br/>→ 12 OWL modules"] --> RS
    RS --> VC["validation_checks"]

    style HZ fill:#fff3e0,stroke:#e65100
    style HE fill:#e3f2fd,stroke:#1565c0
    style HP fill:#e8f5e9,stroke:#2e7d32
    style RS fill:#f3e5f5,stroke:#6a1b9a
    style VC fill:#fce4ec,stroke:#b71c1c

Harvester triggering

All harvesters automatically trigger reason_openllet_new followed by validation_checks on completion. After all succeed, trigger publish_to_virtuoso manually.

Release Tier

DAG Schedule Purpose
dashboard Daily Aggregates SPARQL statistics into SQLite for Apache Superset
dump_and_archive Manual Creates versioned RDF dumps, uploads to Zenodo with DOI
cq_tester Manual Validates competency questions against the live endpoint

Named Graphs Published

The pipeline publishes five named graphs to the Virtuoso triplestore:

Named Graph Source Content
matwerk/spreadsheets_assertions Core pipeline Merged OWL modules from 27 templates
matwerk/spreadsheets_inferences Core pipeline Openllet-derived inferences
matwerk/spreadsheets_validated Core pipeline Merged assertions + inferences, validated
matwerk/zenodo_validated Zenodo harvester Zenodo community records as RDF
matwerk/endpoints_validated Endpoint harvester SPARQL endpoint metadata and statistics

All graph IRIs are prefixed with https://nfdi.fiz-karlsruhe.de/matwerk/.