Harvesters

Three harvester DAGs collect data from external sources, transform it into ontology-aligned RDF, and automatically trigger reasoning and validation. Each harvester produces deterministic, provenance-aware named graphs that are integrated into the MSE-KG.

Scheduling

All three harvesters run on a @weekly schedule and automatically trigger reason_openllet_new followed by validation_checks on completion.

Architecture Overview

graph TB
    subgraph External Sources
        Z[("Zenodo<br/>REST API")]
        S[("SPARQL<br/>Endpoints")]
        P[("PMD<br/>Platform")]
    end

    subgraph Harvester Tier
        HZ["harvester_zenodo<br/>━━━━━━━━━━━━━━<br/>• Export records as RDF<br/>• Download RDF files<br/>• Generate snapshots"]
        HE["harvester_endpoints<br/>━━━━━━━━━━━━━━<br/>• Extract class catalogues<br/>• Build T-Box snapshots<br/>• Compute VoID statistics"]
        HP["harvester_pmd<br/>━━━━━━━━━━━━━━<br/>• Build 12 OWL modules<br/>• Merge into single graph"]
    end

    subgraph Core Pipeline
        R[reason_openllet_new]
        V[validation_checks]
    end

    Z --> HZ
    S --> HE
    P --> HP
    HZ --> R
    HE --> R
    HP --> R
    R --> V

    style Z fill:#f5c542,stroke:#333
    style S fill:#42a5f5,stroke:#333
    style P fill:#66bb6a,stroke:#333
    style HZ fill:#fff3e0,stroke:#e65100
    style HE fill:#e3f2fd,stroke:#1565c0
    style HP fill:#e8f5e9,stroke:#2e7d32
    style R fill:#f3e5f5,stroke:#6a1b9a
    style V fill:#fce4ec,stroke:#b71c1c

harvester_zenodo

DAG ID: harvester_zenodo · Schedule: @weekly · File: dags/harvester_zenodo.py

What It Does

The Zenodo harvester converts records from the NFDI-MatWerk Zenodo community into ontology-aligned RDF. It operates in two phases: first exporting community records as RDF triples, then harvesting individual datasets referenced in the MSE-KG that point to Zenodo DOIs.

Task Chain

graph LR
    A["init_data_dir"] --> B["run_harvester"]
    B --> C["mark_success"]
    C --> D["trigger_reason"]
    C --> E["trigger_validation"]

    style A fill:#e8eaf6,stroke:#283593
    style B fill:#fff3e0,stroke:#e65100
    style C fill:#e8f5e9,stroke:#2e7d32
    style D fill:#f3e5f5,stroke:#6a1b9a
    style E fill:#fce4ec,stroke:#b71c1c

The run_harvester task performs two operations:

export_zenodo.run() — Fetches all records from the Zenodo community via REST API and converts them to RDF
fetch_zenodo.run() — Queries the merged MSE-KG for datasets with Zenodo URLs, then harvests each individually

RDF Modelling

Each Zenodo record is converted to a network of ontology-aligned individuals:

graph TD
    REC["<b>nfdicore:Dataset</b><br/>(or Publication, Software)<br/>━━━━━━━━━━━━━━<br/>IRI: msekg:zenodo/concept/{id}"]

    TITLE["<b>Value Node</b><br/>index 0<br/>━━━━━━━━━━━━━━<br/>Title literal"]

    DOI["<b>DOI Identifier</b><br/>━━━━━━━━━━━━━━<br/>doi:10.5281/..."]

    FILE1["<b>File Identifier</b><br/>━━━━━━━━━━━━━━<br/>msekg:zenodo/concept/{id}/file/{key}"]

    GRAPH1["<b>Named Graph</b><br/>━━━━━━━━━━━━━━<br/>Imported RDF content"]

    REC -->|"iao:denoted_by<br/>(IAO_0000235)"| TITLE
    REC -->|"nfdicore:has_identifier<br/>(NFDI_0001006)"| DOI
    REC -->|"bfo:has_part<br/>(BFO_0000051)"| FILE1
    FILE1 -.->|"linked to"| GRAPH1
    REC -->|"rdfs:label"| LABEL["Zenodo URL"]

    style REC fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    style TITLE fill:#fff3e0,stroke:#e65100
    style DOI fill:#f3e5f5,stroke:#6a1b9a
    style FILE1 fill:#e8f5e9,stroke:#2e7d32
    style GRAPH1 fill:#fce4ec,stroke:#b71c1c

Type Mapping

Zenodo resource_type values are mapped to MWO/NFDIcore classes:

Zenodo Type	OWL Class	IRI
`dataset`, `image`	`nfdicore:Dataset`	`NFDI_0000009`
`publication`, `article`, `book`, `thesis`, `report`, `preprint`	`nfdicore:Publication`	`NFDI_0000190`
`software`	`nfdicore:Software`	`NFDI_0000198`
`lesson`, `presentation`	`nfdicore:Lecture`	`NFDI_0010022`
`image` (also typed as)	`iao:Figure`	`IAO_0000308`

Deterministic IRI Minting

All IRIs are deterministic, ensuring idempotent re-runs:

Instance:       msekg:zenodo/concept/{conceptrecid}
Value node:     msekg:zenodo/concept/{conceptrecid}/node/{index}
File ID:        msekg:zenodo/concept/{conceptrecid}/file/{file_key}
File graph:     msekg:zenodo/concept/{conceptrecid}/graph/{file_key}

Base IRI

msekg: expands to https://nfdi.fiz-karlsruhe.de/matwerk/msekg/

File Handling

When a Zenodo record contains RDF files (.ttl, .owl, .rdf, .jsonld, .nq, .trig) or ZIP archives containing RDF:

Files are downloaded (max 500 MB per file)
RDF is parsed and imported into named graphs with deterministic IRIs
ZIP archives are extracted with zip-slip protection
For each imported graph, three snapshot views are generated:

Snapshot	SPARQL Pattern	Purpose
Classes	`?c a owl:Class`	All declared classes with labels
Class Hierarchy	`?c rdfs:subClassOf ?parent`	Subsumption structure
T-Box	Classes + properties + domain/range	Schema-level axioms only

Dataset URL Discovery

The harvester also queries the merged MSE-KG to find datasets referencing Zenodo:

SELECT DISTINCT ?dataset ?url WHERE {
  VALUES ?class {
    <https://nfdi.fiz-karlsruhe.de/ontology/NFDI_0000009>
    <https://nfdi.fiz-karlsruhe.de/ontology/MWO_0001058>
    <https://nfdi.fiz-karlsruhe.de/ontology/MWO_0001056>
    <https://nfdi.fiz-karlsruhe.de/ontology/MWO_0001057>
  }
  ?dataset a ?class .
  ?dataset <http://purl.obolibrary.org/obo/IAO_0000235> ?urlNode .
  ?urlNode <https://nfdi.fiz-karlsruhe.de/ontology/NFDI_0001008> ?u .
  BIND(STR(?u) AS ?url)
  FILTER(isIRI(?dataset) && CONTAINS(LCASE(?url), "zenodo"))
}

Input / Output

InputOutput

Source	Description
`matwerk_sharedfs`	Shared filesystem path
`matwerk_last_successful_merge_run`	Path to merged asserted TTL
Zenodo REST API	`https://zenodo.org/api/records`

File	Description
`zenodo.ttl`	Default graph — all record metadata as RDF
`datasets_urls.csv`	Discovered (dataset, URL) pairs
`harvested/`	Downloaded and processed RDF files
`*.nq`	Named graph snapshots (N-Quads)

Triggers Downstream

reason_openllet_new → artifact=zenodo, in_ttl=zenodo.ttl
validation_checks → artifact=zenodo, inferences_ttl=zenodo_inferences.ttl

harvester_endpoints

DAG ID: harvester_endpoints · Schedule: @weekly · File: dags/harvester_endpoints.py

What It Does

The SPARQL endpoint harvester discovers endpoints registered in the MSE-KG, extracts their schema-level structure, computes VoID statistics, and measures ontology reuse against MWO. This enables the MSE-KG to serve as a federated catalogue of distributed MSE data sources.

Task Chain

graph LR
    A["init_data_dir"] --> B["run_harvester"]
    B --> C["mark_success"]
    C --> D["trigger_reason"]
    C --> E["trigger_validation"]

    style A fill:#e8eaf6,stroke:#283593
    style B fill:#e3f2fd,stroke:#1565c0
    style C fill:#e8f5e9,stroke:#2e7d32
    style D fill:#f3e5f5,stroke:#6a1b9a
    style E fill:#fce4ec,stroke:#b71c1c

Endpoint Discovery

Endpoints are discovered from the merged MSE-KG by querying for instances of NFDI_0001095 (SPARQL Endpoint):

graph LR
    MSE["MSE-KG<br/>(merged TTL)"] -->|"query for<br/>NFDI_0001095"| DISC["Discover<br/>endpoint URLs"]
    DISC --> EP1["Endpoint 1"]
    DISC --> EP2["Endpoint 2"]
    DISC --> EP3["Endpoint N"]

    style MSE fill:#e3f2fd,stroke:#1565c0
    style DISC fill:#fff3e0,stroke:#e65100
    style EP1 fill:#e8f5e9,stroke:#2e7d32
    style EP2 fill:#e8f5e9,stroke:#2e7d32
    style EP3 fill:#e8f5e9,stroke:#2e7d32

Each endpoint URL is extracted via the path:

?endpoint a NFDI_0001095 .
?endpoint IAO_0000235 ?node .
?node NFDI_0001008 ?url .

Schema Extraction

For each discovered endpoint, three CONSTRUCT queries extract schema-level information:

graph TD
    EP["SPARQL Endpoint"]

    subgraph "Snapshot Graphs (per endpoint)"
        CL["<b>Classes</b><br/>━━━━━━━━━━━<br/>CONSTRUCT {<br/>  ?c a owl:Class ;<br/>     rdfs:label ?l .<br/>}"]
        CH["<b>Class Hierarchy</b><br/>━━━━━━━━━━━<br/>CONSTRUCT {<br/>  ?c a owl:Class ;<br/>     rdfs:subClassOf ?p .<br/>}"]
        TB["<b>T-Box</b><br/>━━━━━━━━━━━<br/>CONSTRUCT {<br/>  classes + properties<br/>  + domain/range<br/>}"]
    end

    EP --> CL
    EP --> CH
    EP --> TB

    style EP fill:#42a5f5,stroke:#1565c0,stroke-width:2px
    style CL fill:#e8f5e9,stroke:#2e7d32
    style CH fill:#fff3e0,stroke:#e65100
    style TB fill:#f3e5f5,stroke:#6a1b9a

Each snapshot is stored in a deterministic named graph:

IRI: {BASE}snapshot/sparql/{func_key}/{sha256_hash[:24]}

Where func_key ∈ {classes, classHierarchy, tbox} and the hash is computed from {func_key}|{normalized_url}.

VoID Statistics

For each endpoint, the harvester computes comprehensive statistics:

Metric	Method
Class count	`SELECT (COUNT(DISTINCT ?c) ...)`
Object property count	`SELECT (COUNT(DISTINCT ?p) ...)`
Data property count	`SELECT (COUNT(DISTINCT ?p) ...)`
Instance count	`SELECT (COUNT(DISTINCT ?s) ...)`
Class partitions	Per-class instance counts via `void:classPartition`
Vocabularies	All namespaces observed via `void:vocabulary`

MWO Reuse Analysis

A distinguishing feature of the endpoint harvester is its ontology reuse measurement. For each endpoint, the harvester checks which MWO classes and properties appear as types or predicates:

graph LR
    MWO["MWO Ontology<br/>━━━━━━━━━━━<br/>Classes &<br/>Properties"] -->|"batch query<br/>against endpoint"| CHECK{"MWO terms<br/>found?"}
    CHECK -->|"Yes"| REUSE["Record reuse<br/>metrics"]
    CHECK -->|"No"| SKIP["Skip"]

    style MWO fill:#f3e5f5,stroke:#6a1b9a
    style CHECK fill:#fff3e0,stroke:#e65100
    style REUSE fill:#e8f5e9,stroke:#2e7d32

The queries check for MWO terms in batches:

SELECT DISTINCT ?c WHERE {
  VALUES ?c { <mwo_class_1> <mwo_class_2> ... }
  { ?c a owl:Class . } UNION { ?x a ?c . }
}

Why Reuse Analysis Matters

By quantifying how much each external endpoint aligns with MWO, the MSE-KG can identify which endpoints are most suitable for federated queries, guide ontology mapping efforts, and track adoption of shared vocabularies across the NFDI-MatWerk ecosystem.

Statistics Annotation

All statistics are stored as structured rdfs:comment annotations on the endpoint individual:

SPARQL endpoint: https://example.org/sparql
Counts:
- classes: 42
- objectProperties: 18
- dataProperties: 7
- instances: 1,234

Vocabularies (namespaces observed):
- total: 5

Reused from MWO (heuristic):
- classes: 12 (cmso:AtomicScaleSample, ...)
- objectProperties: 3
- dataProperties: 2

Input / Output

InputOutput

Source	Description
`matwerk_sharedfs`	Shared filesystem path
`matwerk_last_successful_merge_run`	Path to merged asserted TTL
`matwerk_ontology`	MWO ontology URL (for reuse analysis)
External SPARQL endpoints	Discovered from asserted TTL

File	Description
`dataset_stats.ttl`	Unified statistics graph
`named_graphs/*.nq`	Per-endpoint snapshot named graphs (N-Quads)
`named_graphs/*.ttl`	Per-endpoint snapshots (Turtle)
`sparql_sources.json`	State: endpoint → graph IRI mapping
`sparql_sources_list.json`	Summary with counts and metadata

Triggers Downstream

reason_openllet_new → artifact=endpoints, in_ttl=dataset_stats.ttl
validation_checks → artifact=endpoints, inferences_ttl=endpoints_inferences.ttl

harvester_pmd

DAG ID: harvester_pmd · Schedule: @weekly · File: dags/harvester_pmd.py

What It Does

Harvests data from the Materials Platform for Data (PMD). Builds ROBOT templates from harvested CSV, generates 12 OWL modules, and merges them into a single asserted graph.

Task Chain

graph LR
    A["init_data_dir"] --> B["run_harvester"]
    B --> C["mark_success"]
    C --> D["trigger_reason"]
    C --> E["trigger_validation"]

    style A fill:#e8eaf6,stroke:#283593
    style B fill:#e8f5e9,stroke:#2e7d32
    style C fill:#e8f5e9,stroke:#2e7d32
    style D fill:#f3e5f5,stroke:#6a1b9a
    style E fill:#fce4ec,stroke:#b71c1c

Input / Output

InputOutput

Source	Description
`matwerk_sharedfs`	Shared filesystem
`matwerk_ontology`	Base ontology URL
`robotcmd`	ROBOT command
PMD API	External data source

File	Description
`pmd_asserted.ttl`	Merged asserted output
`modules/`	12 individual OWL modules

Triggers Downstream

reason_openllet_new → artifact=pmd, in_ttl=pmd_asserted.ttl
validation_checks → artifact=pmd, inferences_ttl=pmd_inferences.ttl

Named Graph Strategy

All harvesters follow a consistent named graph strategy that ensures provenance tracking and selective access:

graph TD
    subgraph "Published Named Graphs in Virtuoso"
        NG1["<b>spreadsheets_assertions</b><br/>Curated template data"]
        NG2["<b>spreadsheets_inferences</b><br/>Reasoned inferences"]
        NG3["<b>spreadsheets_validated</b><br/>Merged + validated"]
        NG4["<b>zenodo_validated</b><br/>Zenodo harvest"]
        NG5["<b>endpoints_validated</b><br/>SPARQL endpoints harvest"]
    end

    HZ["harvester_zenodo"] -->|"after reasoning<br/>+ validation"| NG4
    HE["harvester_endpoints"] -->|"after reasoning<br/>+ validation"| NG5
    HP["harvester_pmd"] -->|"feeds into<br/>spreadsheets pipeline"| NG1

    style NG1 fill:#e3f2fd,stroke:#1565c0
    style NG2 fill:#e3f2fd,stroke:#1565c0
    style NG3 fill:#e3f2fd,stroke:#1565c0
    style NG4 fill:#fff3e0,stroke:#e65100
    style NG5 fill:#e8f5e9,stroke:#2e7d32

Quality Assurance

All harvested content passes through the same reasoning and SHACL validation pipeline as manually curated template data. This ensures that externally sourced RDF is held to the same quality standard as curated content — no data enters the published graph without validation.