- YAML Frontmatter: Carries document metadata, provenance information, and processing instructions.
- Typed Fenced Blocks: Contain the data payloads, such as Turtle or JSON-LD graph data, SPARQL queries, or even AI prompts and manifests.
- Prose Sections: Provide the human-readable documentation and explanation for the data.
- Local Ground Truth: They allow semantic content to travel between different systems without losing meaning, acting as a "holon" where the boundary condition and context travel with the artifact itself.
- LLM Integration: In AI workflows, DataBooks invert the standard model. Instead of treating data as ephemeral context for an LLM, the DataBook becomes the persistent, auditable artifact, while the LLM acts as one of several "transformation engines" used to enrich or process it.
- Auditable Pipelines: They include "process stamps" that record which transformer (AI or human) operated on what inputs at what time, creating a forensic trail for auditing and re-running pipeline stages.
- RDF (Resource Description Framework): Used to represent the data and dependency graphs.
- SPARQL: Allows the contents and dependency manifests of DataBooks to be queried as first-class semantic artifacts.
- Encryption Profiles: Designed to support sensitive data through encrypted fenced blocks that parsers can either decrypt or gracefully skip.
The Ontologist
DataBooks: Markdown as Semantic Infrastructure
"Something has been missing from the semantic web stack for a long time, and it’s been hiding in plain sight.The RDF ecosystem has always known how to handle large, persistent, well-indexed knowledge graphs. Triple stores, SPARQL endpoints, federated query — these are mature, well-understood tools for managing graph data at scale. What the ecosystem has never handled well is everything else: the small, contextual, task-specific, ephemeral, or pipeline-stage graph content that makes up the majority of actual knowledge work. The data that doesn’t need a database. The graph that lives for the duration of a process and then needs to be archived, referenced, or passed downstream. The semantic content that a human needs to read and a machine needs to process."


