Story: Introduce ores.seeder component for database test-data generation
Table of Contents
This page documents a story in Sprint 19. It captures the goal, current status, acceptance criteria, and the tasks that compose it.
Goal
Introduce a new tooling component ores.seeder that owns
database-level synthetic test-data generation — distinct from
ores.synthetic which generates synthetic data inside the C++
process at runtime. The seeder emits SQL INSERT populate
scripts at codegen time; its first user is the existing slovaris
"imaginary world" reference-data set.
The story supersedes the slovaris-to-org migration task under the codegen org-mode migration story. The brainstorm concluded that bulk data (100 country/currency rows etc.) does not benefit from literate org-mode the way source-code descriptors do; the right move is to keep the JSON representation but extract the machinery into a dedicated component with proper documentation, compass plumbing, and a roadmap toward synthesised data.
Why now
- Slovaris is the last unresolved item in the codegen migration story; resolving it via this story closes that story out.
- Promoting test-data generation to a first-class component
surfaces a sibling concern that was hiding inside
ores.codegen'smodels/dir. - The codegen entity migrations (refdata, junctions, etc.) have
been preserving
generator_exprcolumn hints. Those hints are the seed of the documented-but-not-built future: a synthesised flavour that reads an entity model + the Faker hints and emits a Python-Faker script that produces SQL inserts. No use case yet — the future is a roadmap, not scope.
Status
| Field | Value |
|---|---|
| State | DONE |
| Parent sprint | Sprint 19 |
| Now | Nothing. |
| Waiting on | Nothing. |
| Next | Nothing. |
| Last touched | 2026-06-05 |
Acceptance
- New component
projects/ores.seeder/exists with the standard ORE Studio component shape:modeling/component_overview.org,seeder.shCLI,src/seeder.py(thin shim importingcodegen.generator),datasets/dir. - Slovaris is the first inhabitant of
datasets/. Six JSONs (manifest,model,catalogs,country_currency,datasets,tags) plusmethodology.txtmove fromprojects/ores.codegen/models/slovaris/toprojects/ores.seeder/datasets/slovaris/.models/slovaris/becomes empty and is removed. - Each dataset has its own
model.json(batch driver) — slovaris keeps its existing shape; new datasets follow the same convention. - The
generate_solvaris_refdata.shscript (or its successor) reads the new path and produces the same SQL artefacts byte-identically. - New compass subtype
dataset_overviewscaffoldsprojects/ores.seeder/datasets/<name>/dataset_overview.orgwith frontmatter (name,version,dataset_type,source_methodology) + sections (Summary,Contents,Why this dataset,See also). New mustache templatedoc_dataset.org.mustacheunderprojects/ores.codegen/library/templates/. ores.seederis added to System Model: Tooling Layer.- Templates (
sql_catalog_populate.mustache,sql_country_populate.mustache, etc.) stay inores.codegen/library/templates/.ores.seederis thin — it delegates rendering toores.codegen.generator.
Tasks
| Task | State | Start | End | Description |
|---|---|---|---|---|
| Scaffold ores.seeder component + relocate slovaris dataset | DONE | 2026-06-02 | 2026-06-05 | Component scaffold + slovaris move + tooling-page wire-in + sprint table wire-in. |
Decisions
Database-level test data is a distinct concern
ores.synthetic owns C++/in-process synthetic data; ores.seeder
owns file-output (SQL inserts) database-level data. The split
keeps the two concerns evolving independently — runtime synthetic
data ships in the binary; seeder output ships as `.sql` populate
scripts loaded at DB setup.
Crafted now, synthesised later
The MVP supports only crafted JSON-driven datasets (the existing
slovaris pattern). The future synthesised flavour — read a codegen
entity model, use the column generator_expr hints, emit a
Python+Faker script that produces SQL inserts — is documented in
the component overview as a roadmap. Not built; no use case yet.
The codegen entity migrations have already been preserving the
generator_expr column on junctions, lookup entities, etc. Those
hints are the substrate the synthesised flavour will consume.
Thin component delegating to ores.codegen
ores.seeder owns the CLI, the dataset descriptors, and the
per-dataset model.json. It does not own the renderer, the
mustache templates, or load_data. Those stay in ores.codegen
and ores.seeder's seeder.py is a thin shim that imports
them. Reasoning: avoid duplicating the renderer; let
ores.codegen keep evolving in one place.
No literate org for the data files
The brainstorm explicitly considered migrating slovaris JSONs to
org (the original task scope) and rejected it: bulk data does not
benefit from literate org the way source-code descriptors do
(generator expressions, FK relationships, domain rationale).
A 100-row org table is just an uglier JSON list. The dataset-level
documentation lives in dataset_overview.org; the rows
themselves stay JSON.
Per-dataset model.json keeps the existing shape
Each dataset has its own model.json declaring its
(template, output, source) tuples. Slovaris keeps its existing
file unchanged. Top-level registry would weaken per-dataset
self-containment; deriving from dataset_overview.org would
require the org route we declined for the data files.
Out of scope
- Synthesised data flavour (Python+Faker scripts derived from entity models). Roadmap only.
- Self-contained
ores.seedertemplates. Templates stay inores.codegenfor now; relocate if/when seeder evolves a distinct templating model. - Test data for entities other than the existing slovaris taxonomy. Future datasets follow the same convention but aren't in this story's scope.