Story: Data Quality subsystem and Data Librarian

Goal
Status
Acceptance
Tasks
Decisions
Out of scope
See also

This page documents a story in Sprint 09. It captures the goal, current status, acceptance criteria, and the tasks that compose it.

Goal

Stand up Data Quality as a first-class subsystem of ORE Studio: concept model, domain types + FpML reference data, ER refactor, messaging subsystem with breaking protocol bump, Data Librarian UI, subject-area surface, circular-dependency resolution, publication workflow, and a fake-world validation dataset.

Status

Field	Value
State	DONE
Parent sprint	Sprint 09
Now	Completed 2026-01-20.
Waiting on	None.
Next	None.
Last touched	2026-01-20

Continued in: Librarian polish (sprint 10) — PR-review feedback on the librarian landed here.

Acceptance

ores.dq component present with 10 DQ domain types.
FpML non-ISO currencies / business centers / business processes integrated.
ER diagram standardised on <component>_<entity>_tbl.
Binary protocol carries DQ subsystem 0x6000-0x6FFF (protocol 22.0).
Data Librarian three-panel MDI window exposes the subsystem.
Subject areas have dedicated controller + detail + history dialogs.
DQ-IAM circular dependency resolved by relocating change-reason constants to ores.database.
Publication wizard promotes staged datasets with dependency resolution.
Fake-world dataset round-trips through the system.

Tasks

Task	State	Start	End	Description
Create Data Quality infrastructure	DONE	2026-05-19	2026-01-15	Stand up ores.dq with domain types backing the user-driven sample-data workflow: dataset, lineage, provenance, classification, temporal context, data passport, granularity options.
DQ domain types and FpML reference data	DONE	2026-05-19	2026-01-15	10 DQ domain types (data_domain, subject_area, catalog, coding_scheme_authority_type, coding_scheme, origin_dimension, nature_dimension, treatment_dimension, methodology, dataset) with JSON + table I/O; FpML non-ISO currencies / business centers / business processes integrated; faker-cxx generators; year_month_day reflector.
DQ catalog and ER refactor	DONE	2026-05-19	2026-01-15	dq_catalog_tbl as top-level grouping; ER diagram (ores_schema.puml) standardised to <component>_<entity>_tbl naming and re-packaged by component prefix; missing DQ + telemetry + geo entities integrated.
Data Librarian UI	DONE	2026-05-19	2026-01-19	Three-panel Data Librarian MDI window: dataset browser, accession card details, methodology panel; IP geolocation catalog + iptoasn staging; Visual Assets catalog; TCP_NODELAY for parallel-request latency; menu restructuring (System / Data); Origin 'Source' renamed to 'Primary'.
DQ subject areas management	DONE	2026-05-19	2026-01-19	SubjectAreaController + Detail + History dialogs; menu integration; version history with revert; read-only versioned view.
DQ messaging subsystem	DONE	2026-05-19	2026-01-19	New 0x6000-0x6FFF subsystem in the binary protocol; CRUD + history for every DQ entity; change-management messages migrated from IAM (0x2050-0x2061) to DQ (0x6070-0x6081); protocol bump 21.3 -> 22.0.
Resolve DQ-IAM circular dependency	DONE	2026-05-19	2026-01-19	Relocate change_reason_constants.hpp to the more foundational ores.database module; system models + CMake deps synchronised; deterministic time values via make_timepoint helper for DQ tests; skill docs updated.
Data Librarian publication workflow	DONE	2026-05-19	2026-01-20	Multi-page publication wizard (upsert / insert_only / replace_all) with dependency resolution via Boost.Graph topological sort; stable dataset codes; publication history dialog; UI rename to 'methodology' and 'data passport'.
Add fake-world dataset	DONE	2026-05-19	2026-01-20	Use the previously spec'd 'fake world' dataset to validate the dataset infrastructure end-to-end.

Decisions

FpML as the reference-data anchor: industry vocabulary by default; avoids inventing our own currencies/business-centers terminology.
Lean into the librarian metaphor: Catalog, Volume, Formula, Accession Card, Stacks — naming pays back in mental model.
Change-management migrates to DQ subsystem: it's data-quality machinery, not identity machinery; the IAM home was wrong.
Boost.Graph for dependency resolution: well-tested topo-sort is cheaper than rolling our own; we'll consume more graph algorithms as the system grows.
Fake-world dataset as smoke test: deliberately synthetic so we can validate the infrastructure without licensing concerns.

Out of scope

Real-world data acquisition (vendor integrations) — separate workstream.
DQ scoring / quality gates — covered conceptually but not implemented.