EGWG 2025-05-01 Agri-food Data Canada - Carly Huitema

Watch full recording on YouTube.

Status: Verified by Presenter

Please note that ToIP used Google NotebookLM to generate the following content, which the presenter has verified.

Google NotebookLM Podcast

Summary

This briefing document summarizes a presentation about Agri-food Data Canada’s Semantic Engine, a suite of tools designed to enhance research data management in the agri-food sector by making data Findable, Accessible, Interoperable, and Reusable (FAIR). A central focus is the use of machine-readable data schemas authored with the Overlays Capture Architecture (OCA) standard, which is highlighted for its use of derived identifiers (digests) over traditional assigned identifiers for improved reproducibility and authenticity. The document also details the Semantic Engine’s practical tools and ongoing efforts to integrate these standards into existing research infrastructure, addressing challenges like data context and decentralized ecosystems.

Briefing Document: Agri-food Data Canada and the Semantic Engine

Date: May 1, 2025

Subject: Review of Agri-food Data Canada (ADC) project and its Semantic Engine, with discussion on data schemas, identifiers, and integration into research infrastructure.

Sources:

Excerpts from “2025.05.01-ToIP_EGWG.pdf” (Slides)
Excerpts from “GMT20250501-145814_Recording.transcript.txt” (Transcript)
Excerpts from “GMT20250501-145814_Recording_1920x1080.mp4” (Video – used for verification of content and speakers)
Excerpts from “GMT20250501-145814_RecordingnewChat.txt” (Chat Log)

Attendees/Speakers: Carly Huitema (University of Guelph, ADC), Michelle Edwards (ADC, mentioned), Eric Drury (Forth Consulting), Scott Perry (Digital Governance Institute), Neil Thomson (QueryVision), Steven Milstein (Collab.Ventures), Donald Sheppard.

Overview

This briefing summarizes a presentation by Carly Huitema from Agri-food Data Canada (ADC) to the Trust over IP (ToIP) Ecosystem and Governance Working Group. The presentation focuses on ADC’s efforts to improve research data management in the agri-food sector, specifically through the development of the “Semantic Engine” suite of tools. A central theme is the importance of “FAIR” data (Findable, Accessible, Interoperable, Reusable) and how machine-readable data schemas, particularly using the Overlays Capture Architecture (OCA) standard, contribute to achieving this goal. The discussion also highlights the advantages of derived identifiers (digests) over assigned identifiers for reproducibility, authenticity, and decentralization, and ADC’s ongoing work to integrate their tools and schemas into existing research infrastructure.

Key Themes and Important Ideas

1. Improving Research Data Management in a Decentralized Ecosystem: The research data ecosystem is described as highly decentralized with independent research groups. While guided by best practices, mandates for standardized approaches are often slow to adopt. Incentives can be conflicting, particularly the “publish or perish” culture versus the time needed for thorough data documentation. Long-term planning (e.g., 50-year repository funding) is a crucial consideration. ADC, a project at the University of Guelph funded by multiple Canadian sources (CFREF, Genome Canada, UoG, OMAFA, Compute Ontario, etc.), aims to address these challenges by working directly with researchers.
2. Making Agri-food Data FAIR: A core objective of ADC is to make agri-food data FAIR:
  1. Findable: Ability to identify and locate data resources and their context.
  2. Accessible: Ability to access (with permission) and use data once found, often requiring open protocols.
  3. Interoperable: Using standards for data to ensure compatibility, including standard vocabularies.
  4. Reusable: Data with sufficient context (licenses, provenance, descriptions) can be reused by others for replication or new research. Carly Huitema states: “Findable is the ability to identify and find resources as well as their context. If it’s accessible that once found you can access it with permission and use it interoperable. Certainly lots of our work at Trust Over IP is about how to ensure interoperability of standards including vocabularies and reusable that that there are licenses and provenance and other things that help make this data reusable.”
3. Data Requires Context: Data alone is insufficient; it needs context to be useful. This context includes details like sample source, analysis methods, data schemas, catalogue information, data licenses, data governance agreements, associated publications, methodologies, scripts, and contributors.
4. The Semantic Engine: ADC is developing the Semantic Engine, a suite of tools designed to help researchers create “rich contextual and machine-readable data schemas.”
  1. The Semantic Engine aims to make the process of documenting data less daunting for researchers.
  2. It functions as a self-teaching web app, providing guidance and tutorials.
  3. The engine uses the Overlays Capture Architecture (OCA) standard for writing schemas.
5. Data Schemas and Overlays Capture Architecture (OCA):
  1. A data schema describes the attributes of a dataset (e.g., columns in a table) and provides detailed information about them (type, units, description, format, etc.).
  2. OCA is highlighted as an international and open standard for documenting schemas, developed by the Human Colossus Foundation. Two key advantages of OCA:
    1. Embeds Digests: OCA schemas can embed derived identifiers (digests/fingerprints) for the schema itself and for its constituent parts. This is crucial for reproducibility and authenticity of digital artifacts.
    2. Organized by Features: OCA structures schemas by features (e.g., all descriptions, all units) rather than attribute by attribute (e.g., JSON-LD, XML Schema). This organization offers advantages:
      1. Task-based Governance: Allows for governance at the feature level (e.g., assigning responsibility for translation features).
      2. Optimized for Feature Management: Facilitates adding or removing features (like languages or units) without altering the identifiers of other features.
      3. Mix-and-Match: Enables easier combination and reuse of different schema components.
      4. ADC has developed an “OCA Package” which wraps the core OCA standard with extensions for community-specific features and developing standards, allowing for gradual migration of these features to the core standard as they become accepted.
6. Assigned vs. Derived Identifiers:
  1. Assigned Identifiers (Names): Created by a governance body, linked to an object via a lookup table. Resolution requires trusting the authoritative governance body and their lookup table. “If you find an object, you cannot figure out the identifier – you must go to the authoritative body and look it up in their table.” Resolution services can only be hosted or delegated by the governance body.
  2. Derived Identifiers (Digests/Fingerprints): Calculated directly from a digital object using a hashing function. They are unique fingerprints for a specific version of the object. Key for reproducibility and authenticity: “You can identify the resource originally used. You can verify the resource is the same one that was originally used.” Anyone can calculate a derived identifier, build a resolution service, and verify the resolution service is pointing to the correct object. Derived identifiers enable objects to be hosted in multiple locations. Carly Huitema humorously quotes, “If you liked it, then you should have put a digest on it.” Derived identifiers are excellent for snapshots but do not handle dynamic content or versioning directly. Versioning requires a governing authority or a decentralized identifier (DID) system where subsequent versions are linked and controlled.
7. Tools Provided by the Semantic Engine:
  1. Schema Authoring Web App: Guides researchers through creating machine-readable schemas.
  2. Data Entry Excel Generator: Creates an Excel spreadsheet with headers and schema descriptions based on the authored schema, helping standardize data collection. Includes the schema’s derived identifier.
  3. Data Entry on the Web / Data Verification Engine: A tool to verify data sets against the rules defined in a schema. Allows researchers to quickly check for inconsistencies before combining data from multiple sources.
8. Integration into Research Infrastructure: ADC is working to integrate their schemas and tools into existing Canadian and international research infrastructure.
  1. Schemas can be deposited into long-term research data repositories (e.g., Borealis in Canada), often receiving assigned identifiers like DOIs.
  2. These schemas, with their embedded derived identifiers, can then be found through federated search engines that index multiple repositories (e.g., Lunaris in Canada, OpenAIRE in Europe).
  3. This allows researchers to publish papers referencing schemas by their identifiers, enabling others to find and verify the schema used.
9. Addressing IP and Sensitive Data:
  1. The Semantic Engine itself does not store user data or schemas, reducing IP concerns related to the platform.
  2. Schemas are generally less sensitive than the actual data, allowing them to be more openly shared. This enables discovery of datasets and potential collaborations without exposing proprietary information.
  3. Schemas can include flags for sensitive data attributes (e.g., farm location). While ADC’s tools don’t currently enforce access controls based on these flags, this information in the machine-readable schema can be used by internal pipelines or other systems to manage sensitive data appropriately (e.g., triggering anonymization).
10. Future Directions: ADC plans to continue integrating with research infrastructure, add digests as identifiers to more objects, and develop tools for other machine-readable standards (e.g., cataloging metadata, policy rules). They also aim to increase the number of features supported in the schema description process (e.g., range rules, ontology framing).

Key Takeaways for ToIP

The FAIR data principles are highly relevant to decentralized ecosystems and align well with ToIP goals.
Derived identifiers (digests) offer significant advantages for reproducibility, authenticity, and decentralized resolution, making them a powerful tool for digital objects within a trust framework.
The architecture of data schemas (feature-by-feature vs. attribute-by-attribute) has implications for governance, versioning, and the application of derived identifiers to schema components.
Integrating decentralized identity and verifiable credentials concepts (like OCA) into existing research infrastructure can enhance discoverability, interoperability, and trust in scientific data.
The Semantic Engine provides a practical example of building user-friendly tools to generate machine-readable metadata, addressing the challenge of widespread adoption of such standards.

For more details, including the meeting transcript, please see our wiki 2025-05-01 Agri-food Data Canada – Carly Huitema – Home – Confluence

EGWG 2025-05-01 Agri-food Data Canada – Carly Huitema