Harvesters
Three harvester DAGs collect data from external sources, transform it into ontology-aligned RDF, and automatically trigger reasoning and validation. Each harvester produces deterministic, provenance-aware named graphs that are integrated into the MSE-KG.
Scheduling
All three harvesters run on a @weekly schedule and automatically trigger reason_openllet_new followed by validation_checks on completion.
Architecture Overview
graph TB
subgraph External Sources
Z[("Zenodo<br/>REST API")]
S[("SPARQL<br/>Endpoints")]
P[("PMD<br/>Platform")]
end
subgraph Harvester Tier
HZ["harvester_zenodo<br/>━━━━━━━━━━━━━━<br/>• Export records as RDF<br/>• Download RDF files<br/>• Generate snapshots"]
HE["harvester_endpoints<br/>━━━━━━━━━━━━━━<br/>• Extract class catalogues<br/>• Build T-Box snapshots<br/>• Compute VoID statistics"]
HP["harvester_pmd<br/>━━━━━━━━━━━━━━<br/>• Build 12 OWL modules<br/>• Merge into single graph"]
end
subgraph Core Pipeline
R[reason_openllet_new]
V[validation_checks]
end
Z --> HZ
S --> HE
P --> HP
HZ --> R
HE --> R
HP --> R
R --> V
style Z fill:#f5c542,stroke:#333
style S fill:#42a5f5,stroke:#333
style P fill:#66bb6a,stroke:#333
style HZ fill:#fff3e0,stroke:#e65100
style HE fill:#e3f2fd,stroke:#1565c0
style HP fill:#e8f5e9,stroke:#2e7d32
style R fill:#f3e5f5,stroke:#6a1b9a
style V fill:#fce4ec,stroke:#b71c1c
harvester_zenodo
DAG ID: harvester_zenodo · Schedule: @weekly · File: dags/harvester_zenodo.py
What It Does
The Zenodo harvester converts records from the NFDI-MatWerk Zenodo community into ontology-aligned RDF. It operates in two phases: first exporting community records as RDF triples, then harvesting individual datasets referenced in the MSE-KG that point to Zenodo DOIs.
Task Chain
graph LR
A["init_data_dir"] --> B["run_harvester"]
B --> C["mark_success"]
C --> D["trigger_reason"]
C --> E["trigger_validation"]
style A fill:#e8eaf6,stroke:#283593
style B fill:#fff3e0,stroke:#e65100
style C fill:#e8f5e9,stroke:#2e7d32
style D fill:#f3e5f5,stroke:#6a1b9a
style E fill:#fce4ec,stroke:#b71c1c
The run_harvester task performs two operations:
export_zenodo.run()— Fetches all records from the Zenodo community via REST API and converts them to RDFfetch_zenodo.run()— Queries the merged MSE-KG for datasets with Zenodo URLs, then harvests each individually
RDF Modelling
Each Zenodo record is converted to a network of ontology-aligned individuals:
graph TD
REC["<b>nfdicore:Dataset</b><br/>(or Publication, Software)<br/>━━━━━━━━━━━━━━<br/>IRI: msekg:zenodo/concept/{id}"]
TITLE["<b>Value Node</b><br/>index 0<br/>━━━━━━━━━━━━━━<br/>Title literal"]
DOI["<b>DOI Identifier</b><br/>━━━━━━━━━━━━━━<br/>doi:10.5281/..."]
FILE1["<b>File Identifier</b><br/>━━━━━━━━━━━━━━<br/>msekg:zenodo/concept/{id}/file/{key}"]
GRAPH1["<b>Named Graph</b><br/>━━━━━━━━━━━━━━<br/>Imported RDF content"]
REC -->|"iao:denoted_by<br/>(IAO_0000235)"| TITLE
REC -->|"nfdicore:has_identifier<br/>(NFDI_0001006)"| DOI
REC -->|"bfo:has_part<br/>(BFO_0000051)"| FILE1
FILE1 -.->|"linked to"| GRAPH1
REC -->|"rdfs:label"| LABEL["Zenodo URL"]
style REC fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style TITLE fill:#fff3e0,stroke:#e65100
style DOI fill:#f3e5f5,stroke:#6a1b9a
style FILE1 fill:#e8f5e9,stroke:#2e7d32
style GRAPH1 fill:#fce4ec,stroke:#b71c1c
Type Mapping
Zenodo resource_type values are mapped to MWO/NFDIcore classes:
| Zenodo Type | OWL Class | IRI |
|---|---|---|
dataset, image |
nfdicore:Dataset |
NFDI_0000009 |
publication, article, book, thesis, report, preprint |
nfdicore:Publication |
NFDI_0000190 |
software |
nfdicore:Software |
NFDI_0000198 |
lesson, presentation |
nfdicore:Lecture |
NFDI_0010022 |
image (also typed as) |
iao:Figure |
IAO_0000308 |
Deterministic IRI Minting
All IRIs are deterministic, ensuring idempotent re-runs:
Instance: msekg:zenodo/concept/{conceptrecid}
Value node: msekg:zenodo/concept/{conceptrecid}/node/{index}
File ID: msekg:zenodo/concept/{conceptrecid}/file/{file_key}
File graph: msekg:zenodo/concept/{conceptrecid}/graph/{file_key}
Base IRI
msekg: expands to https://nfdi.fiz-karlsruhe.de/matwerk/msekg/
File Handling
When a Zenodo record contains RDF files (.ttl, .owl, .rdf, .jsonld, .nq, .trig) or ZIP archives containing RDF:
- Files are downloaded (max 500 MB per file)
- RDF is parsed and imported into named graphs with deterministic IRIs
- ZIP archives are extracted with zip-slip protection
- For each imported graph, three snapshot views are generated:
| Snapshot | SPARQL Pattern | Purpose |
|---|---|---|
| Classes | ?c a owl:Class |
All declared classes with labels |
| Class Hierarchy | ?c rdfs:subClassOf ?parent |
Subsumption structure |
| T-Box | Classes + properties + domain/range | Schema-level axioms only |
Dataset URL Discovery
The harvester also queries the merged MSE-KG to find datasets referencing Zenodo:
SELECT DISTINCT ?dataset ?url WHERE {
VALUES ?class {
<https://nfdi.fiz-karlsruhe.de/ontology/NFDI_0000009>
<https://nfdi.fiz-karlsruhe.de/ontology/MWO_0001058>
<https://nfdi.fiz-karlsruhe.de/ontology/MWO_0001056>
<https://nfdi.fiz-karlsruhe.de/ontology/MWO_0001057>
}
?dataset a ?class .
?dataset <http://purl.obolibrary.org/obo/IAO_0000235> ?urlNode .
?urlNode <https://nfdi.fiz-karlsruhe.de/ontology/NFDI_0001008> ?u .
BIND(STR(?u) AS ?url)
FILTER(isIRI(?dataset) && CONTAINS(LCASE(?url), "zenodo"))
}
Input / Output
| Source | Description |
|---|---|
matwerk_sharedfs |
Shared filesystem path |
matwerk_last_successful_merge_run |
Path to merged asserted TTL |
| Zenodo REST API | https://zenodo.org/api/records |
| File | Description |
|---|---|
zenodo.ttl |
Default graph — all record metadata as RDF |
datasets_urls.csv |
Discovered (dataset, URL) pairs |
harvested/ |
Downloaded and processed RDF files |
*.nq |
Named graph snapshots (N-Quads) |
Triggers Downstream
reason_openllet_new→ artifact=zenodo, in_ttl=zenodo.ttlvalidation_checks→ artifact=zenodo, inferences_ttl=zenodo_inferences.ttl
harvester_endpoints
DAG ID: harvester_endpoints · Schedule: @weekly · File: dags/harvester_endpoints.py
What It Does
The SPARQL endpoint harvester discovers endpoints registered in the MSE-KG, extracts their schema-level structure, computes VoID statistics, and measures ontology reuse against MWO. This enables the MSE-KG to serve as a federated catalogue of distributed MSE data sources.
Task Chain
graph LR
A["init_data_dir"] --> B["run_harvester"]
B --> C["mark_success"]
C --> D["trigger_reason"]
C --> E["trigger_validation"]
style A fill:#e8eaf6,stroke:#283593
style B fill:#e3f2fd,stroke:#1565c0
style C fill:#e8f5e9,stroke:#2e7d32
style D fill:#f3e5f5,stroke:#6a1b9a
style E fill:#fce4ec,stroke:#b71c1c
Endpoint Discovery
Endpoints are discovered from the merged MSE-KG by querying for instances of NFDI_0001095 (SPARQL Endpoint):
graph LR
MSE["MSE-KG<br/>(merged TTL)"] -->|"query for<br/>NFDI_0001095"| DISC["Discover<br/>endpoint URLs"]
DISC --> EP1["Endpoint 1"]
DISC --> EP2["Endpoint 2"]
DISC --> EP3["Endpoint N"]
style MSE fill:#e3f2fd,stroke:#1565c0
style DISC fill:#fff3e0,stroke:#e65100
style EP1 fill:#e8f5e9,stroke:#2e7d32
style EP2 fill:#e8f5e9,stroke:#2e7d32
style EP3 fill:#e8f5e9,stroke:#2e7d32
Each endpoint URL is extracted via the path:
Schema Extraction
For each discovered endpoint, three CONSTRUCT queries extract schema-level information:
graph TD
EP["SPARQL Endpoint"]
subgraph "Snapshot Graphs (per endpoint)"
CL["<b>Classes</b><br/>━━━━━━━━━━━<br/>CONSTRUCT {<br/> ?c a owl:Class ;<br/> rdfs:label ?l .<br/>}"]
CH["<b>Class Hierarchy</b><br/>━━━━━━━━━━━<br/>CONSTRUCT {<br/> ?c a owl:Class ;<br/> rdfs:subClassOf ?p .<br/>}"]
TB["<b>T-Box</b><br/>━━━━━━━━━━━<br/>CONSTRUCT {<br/> classes + properties<br/> + domain/range<br/>}"]
end
EP --> CL
EP --> CH
EP --> TB
style EP fill:#42a5f5,stroke:#1565c0,stroke-width:2px
style CL fill:#e8f5e9,stroke:#2e7d32
style CH fill:#fff3e0,stroke:#e65100
style TB fill:#f3e5f5,stroke:#6a1b9a
Each snapshot is stored in a deterministic named graph:
Where func_key ∈ {classes, classHierarchy, tbox} and the hash is computed from {func_key}|{normalized_url}.
VoID Statistics
For each endpoint, the harvester computes comprehensive statistics:
| Metric | Method |
|---|---|
| Class count | SELECT (COUNT(DISTINCT ?c) ...) |
| Object property count | SELECT (COUNT(DISTINCT ?p) ...) |
| Data property count | SELECT (COUNT(DISTINCT ?p) ...) |
| Instance count | SELECT (COUNT(DISTINCT ?s) ...) |
| Class partitions | Per-class instance counts via void:classPartition |
| Vocabularies | All namespaces observed via void:vocabulary |
MWO Reuse Analysis
A distinguishing feature of the endpoint harvester is its ontology reuse measurement. For each endpoint, the harvester checks which MWO classes and properties appear as types or predicates:
graph LR
MWO["MWO Ontology<br/>━━━━━━━━━━━<br/>Classes &<br/>Properties"] -->|"batch query<br/>against endpoint"| CHECK{"MWO terms<br/>found?"}
CHECK -->|"Yes"| REUSE["Record reuse<br/>metrics"]
CHECK -->|"No"| SKIP["Skip"]
style MWO fill:#f3e5f5,stroke:#6a1b9a
style CHECK fill:#fff3e0,stroke:#e65100
style REUSE fill:#e8f5e9,stroke:#2e7d32
The queries check for MWO terms in batches:
SELECT DISTINCT ?c WHERE {
VALUES ?c { <mwo_class_1> <mwo_class_2> ... }
{ ?c a owl:Class . } UNION { ?x a ?c . }
}
Why Reuse Analysis Matters
By quantifying how much each external endpoint aligns with MWO, the MSE-KG can identify which endpoints are most suitable for federated queries, guide ontology mapping efforts, and track adoption of shared vocabularies across the NFDI-MatWerk ecosystem.
Statistics Annotation
All statistics are stored as structured rdfs:comment annotations on the endpoint individual:
SPARQL endpoint: https://example.org/sparql
Counts:
- classes: 42
- objectProperties: 18
- dataProperties: 7
- instances: 1,234
Vocabularies (namespaces observed):
- total: 5
Reused from MWO (heuristic):
- classes: 12 (cmso:AtomicScaleSample, ...)
- objectProperties: 3
- dataProperties: 2
Input / Output
| Source | Description |
|---|---|
matwerk_sharedfs |
Shared filesystem path |
matwerk_last_successful_merge_run |
Path to merged asserted TTL |
matwerk_ontology |
MWO ontology URL (for reuse analysis) |
| External SPARQL endpoints | Discovered from asserted TTL |
| File | Description |
|---|---|
dataset_stats.ttl |
Unified statistics graph |
named_graphs/*.nq |
Per-endpoint snapshot named graphs (N-Quads) |
named_graphs/*.ttl |
Per-endpoint snapshots (Turtle) |
sparql_sources.json |
State: endpoint → graph IRI mapping |
sparql_sources_list.json |
Summary with counts and metadata |
Triggers Downstream
reason_openllet_new→ artifact=endpoints, in_ttl=dataset_stats.ttlvalidation_checks→ artifact=endpoints, inferences_ttl=endpoints_inferences.ttl
harvester_pmd
DAG ID: harvester_pmd · Schedule: @weekly · File: dags/harvester_pmd.py
What It Does
Harvests data from the Materials Platform for Data (PMD). Builds ROBOT templates from harvested CSV, generates 12 OWL modules, and merges them into a single asserted graph.
Task Chain
graph LR
A["init_data_dir"] --> B["run_harvester"]
B --> C["mark_success"]
C --> D["trigger_reason"]
C --> E["trigger_validation"]
style A fill:#e8eaf6,stroke:#283593
style B fill:#e8f5e9,stroke:#2e7d32
style C fill:#e8f5e9,stroke:#2e7d32
style D fill:#f3e5f5,stroke:#6a1b9a
style E fill:#fce4ec,stroke:#b71c1c
Input / Output
| Source | Description |
|---|---|
matwerk_sharedfs |
Shared filesystem |
matwerk_ontology |
Base ontology URL |
robotcmd |
ROBOT command |
| PMD API | External data source |
| File | Description |
|---|---|
pmd_asserted.ttl |
Merged asserted output |
modules/ |
12 individual OWL modules |
Triggers Downstream
reason_openllet_new→ artifact=pmd, in_ttl=pmd_asserted.ttlvalidation_checks→ artifact=pmd, inferences_ttl=pmd_inferences.ttl
Named Graph Strategy
All harvesters follow a consistent named graph strategy that ensures provenance tracking and selective access:
graph TD
subgraph "Published Named Graphs in Virtuoso"
NG1["<b>spreadsheets_assertions</b><br/>Curated template data"]
NG2["<b>spreadsheets_inferences</b><br/>Reasoned inferences"]
NG3["<b>spreadsheets_validated</b><br/>Merged + validated"]
NG4["<b>zenodo_validated</b><br/>Zenodo harvest"]
NG5["<b>endpoints_validated</b><br/>SPARQL endpoints harvest"]
end
HZ["harvester_zenodo"] -->|"after reasoning<br/>+ validation"| NG4
HE["harvester_endpoints"] -->|"after reasoning<br/>+ validation"| NG5
HP["harvester_pmd"] -->|"feeds into<br/>spreadsheets pipeline"| NG1
style NG1 fill:#e3f2fd,stroke:#1565c0
style NG2 fill:#e3f2fd,stroke:#1565c0
style NG3 fill:#e3f2fd,stroke:#1565c0
style NG4 fill:#fff3e0,stroke:#e65100
style NG5 fill:#e8f5e9,stroke:#2e7d32
Quality Assurance
All harvested content passes through the same reasoning and SHACL validation pipeline as manually curated template data. This ensures that externally sourced RDF is held to the same quality standard as curated content — no data enters the published graph without validation.