Story: Data Quality subsystem and Data Librarian
Table of Contents
This page documents a story in Sprint 09. It captures the goal, current status, acceptance criteria, and the tasks that compose it.
Goal
Stand up Data Quality as a first-class subsystem of ORE Studio: concept model, domain types + FpML reference data, ER refactor, messaging subsystem with breaking protocol bump, Data Librarian UI, subject-area surface, circular-dependency resolution, publication workflow, and a fake-world validation dataset.
Status
| Field | Value |
|---|---|
| State | DONE |
| Parent sprint | Sprint 09 |
| Now | Completed 2026-01-20. |
| Waiting on | None. |
| Next | None. |
| Last touched | 2026-01-20 |
Continued in: Librarian polish (sprint 10) — PR-review feedback on the librarian landed here.
Acceptance
ores.dqcomponent present with 10 DQ domain types.- FpML non-ISO currencies / business centers / business processes integrated.
- ER diagram standardised on
<component>_<entity>_tbl. - Binary protocol carries DQ subsystem 0x6000-0x6FFF (protocol 22.0).
- Data Librarian three-panel MDI window exposes the subsystem.
- Subject areas have dedicated controller + detail + history dialogs.
- DQ-IAM circular dependency resolved by relocating change-reason constants to
ores.database. - Publication wizard promotes staged datasets with dependency resolution.
- Fake-world dataset round-trips through the system.
Tasks
| Task | State | Start | End | Description |
|---|---|---|---|---|
| Create Data Quality infrastructure | DONE | 2026-05-19 | 2026-01-15 | Stand up ores.dq with domain types backing the user-driven sample-data workflow: dataset, lineage, provenance, classification, temporal context, data passport, granularity options. |
| DQ domain types and FpML reference data | DONE | 2026-05-19 | 2026-01-15 | 10 DQ domain types (data_domain, subject_area, catalog, coding_scheme_authority_type, coding_scheme, origin_dimension, nature_dimension, treatment_dimension, methodology, dataset) with JSON + table I/O; FpML non-ISO currencies / business centers / business processes integrated; faker-cxx generators; year_month_day reflector. |
| DQ catalog and ER refactor | DONE | 2026-05-19 | 2026-01-15 | dq_catalog_tbl as top-level grouping; ER diagram (ores_schema.puml) standardised to <component>_<entity>_tbl naming and re-packaged by component prefix; missing DQ + telemetry + geo entities integrated. |
| Data Librarian UI | DONE | 2026-05-19 | 2026-01-19 | Three-panel Data Librarian MDI window: dataset browser, accession card details, methodology panel; IP geolocation catalog + iptoasn staging; Visual Assets catalog; TCP_NODELAY for parallel-request latency; menu restructuring (System / Data); Origin 'Source' renamed to 'Primary'. |
| DQ subject areas management | DONE | 2026-05-19 | 2026-01-19 | SubjectAreaController + Detail + History dialogs; menu integration; version history with revert; read-only versioned view. |
| DQ messaging subsystem | DONE | 2026-05-19 | 2026-01-19 | New 0x6000-0x6FFF subsystem in the binary protocol; CRUD + history for every DQ entity; change-management messages migrated from IAM (0x2050-0x2061) to DQ (0x6070-0x6081); protocol bump 21.3 -> 22.0. |
| Resolve DQ-IAM circular dependency | DONE | 2026-05-19 | 2026-01-19 | Relocate change_reason_constants.hpp to the more foundational ores.database module; system models + CMake deps synchronised; deterministic time values via make_timepoint helper for DQ tests; skill docs updated. |
| Data Librarian publication workflow | DONE | 2026-05-19 | 2026-01-20 | Multi-page publication wizard (upsert / insert_only / replace_all) with dependency resolution via Boost.Graph topological sort; stable dataset codes; publication history dialog; UI rename to 'methodology' and 'data passport'. |
| Add fake-world dataset | DONE | 2026-05-19 | 2026-01-20 | Use the previously spec'd 'fake world' dataset to validate the dataset infrastructure end-to-end. |
Decisions
- FpML as the reference-data anchor
- industry vocabulary by default; avoids inventing our own currencies/business-centers terminology.
- Lean into the librarian metaphor
- Catalog, Volume, Formula, Accession Card, Stacks — naming pays back in mental model.
- Change-management migrates to DQ subsystem
- it's data-quality machinery, not identity machinery; the IAM home was wrong.
- Boost.Graph for dependency resolution
- well-tested topo-sort is cheaper than rolling our own; we'll consume more graph algorithms as the system grows.
- Fake-world dataset as smoke test
- deliberately synthetic so we can validate the infrastructure without licensing concerns.
Out of scope
- Real-world data acquisition (vendor integrations) — separate workstream.
- DQ scoring / quality gates — covered conceptually but not implemented.
See also
- Database naming refactor — the ER refactor consumes its naming convention.
- Change management infrastructure (sprint 08) — change-management messages migrate from there into DQ here.