Story: Data Quality subsystem and Data Librarian

Table of Contents

This page documents a story in Sprint 09. It captures the goal, current status, acceptance criteria, and the tasks that compose it.

Goal

Stand up Data Quality as a first-class subsystem of ORE Studio: concept model, domain types + FpML reference data, ER refactor, messaging subsystem with breaking protocol bump, Data Librarian UI, subject-area surface, circular-dependency resolution, publication workflow, and a fake-world validation dataset.

Status

Field Value
State DONE
Parent sprint Sprint 09
Now Completed 2026-01-20.
Waiting on None.
Next None.
Last touched 2026-01-20

Continued in: Librarian polish (sprint 10) — PR-review feedback on the librarian landed here.

Acceptance

  • ores.dq component present with 10 DQ domain types.
  • FpML non-ISO currencies / business centers / business processes integrated.
  • ER diagram standardised on <component>_<entity>_tbl.
  • Binary protocol carries DQ subsystem 0x6000-0x6FFF (protocol 22.0).
  • Data Librarian three-panel MDI window exposes the subsystem.
  • Subject areas have dedicated controller + detail + history dialogs.
  • DQ-IAM circular dependency resolved by relocating change-reason constants to ores.database.
  • Publication wizard promotes staged datasets with dependency resolution.
  • Fake-world dataset round-trips through the system.

Tasks

Task State Start End Description
Create Data Quality infrastructure DONE 2026-05-19 2026-01-15 Stand up ores.dq with domain types backing the user-driven sample-data workflow: dataset, lineage, provenance, classification, temporal context, data passport, granularity options.
DQ domain types and FpML reference data DONE 2026-05-19 2026-01-15 10 DQ domain types (data_domain, subject_area, catalog, coding_scheme_authority_type, coding_scheme, origin_dimension, nature_dimension, treatment_dimension, methodology, dataset) with JSON + table I/O; FpML non-ISO currencies / business centers / business processes integrated; faker-cxx generators; year_month_day reflector.
DQ catalog and ER refactor DONE 2026-05-19 2026-01-15 dq_catalog_tbl as top-level grouping; ER diagram (ores_schema.puml) standardised to <component>_<entity>_tbl naming and re-packaged by component prefix; missing DQ + telemetry + geo entities integrated.
Data Librarian UI DONE 2026-05-19 2026-01-19 Three-panel Data Librarian MDI window: dataset browser, accession card details, methodology panel; IP geolocation catalog + iptoasn staging; Visual Assets catalog; TCP_NODELAY for parallel-request latency; menu restructuring (System / Data); Origin 'Source' renamed to 'Primary'.
DQ subject areas management DONE 2026-05-19 2026-01-19 SubjectAreaController + Detail + History dialogs; menu integration; version history with revert; read-only versioned view.
DQ messaging subsystem DONE 2026-05-19 2026-01-19 New 0x6000-0x6FFF subsystem in the binary protocol; CRUD + history for every DQ entity; change-management messages migrated from IAM (0x2050-0x2061) to DQ (0x6070-0x6081); protocol bump 21.3 -> 22.0.
Resolve DQ-IAM circular dependency DONE 2026-05-19 2026-01-19 Relocate change_reason_constants.hpp to the more foundational ores.database module; system models + CMake deps synchronised; deterministic time values via make_timepoint helper for DQ tests; skill docs updated.
Data Librarian publication workflow DONE 2026-05-19 2026-01-20 Multi-page publication wizard (upsert / insert_only / replace_all) with dependency resolution via Boost.Graph topological sort; stable dataset codes; publication history dialog; UI rename to 'methodology' and 'data passport'.
Add fake-world dataset DONE 2026-05-19 2026-01-20 Use the previously spec'd 'fake world' dataset to validate the dataset infrastructure end-to-end.

Decisions

FpML as the reference-data anchor
industry vocabulary by default; avoids inventing our own currencies/business-centers terminology.
Lean into the librarian metaphor
Catalog, Volume, Formula, Accession Card, Stacks — naming pays back in mental model.
Change-management migrates to DQ subsystem
it's data-quality machinery, not identity machinery; the IAM home was wrong.
Boost.Graph for dependency resolution
well-tested topo-sort is cheaper than rolling our own; we'll consume more graph algorithms as the system grows.
Fake-world dataset as smoke test
deliberately synthetic so we can validate the infrastructure without licensing concerns.

Out of scope

  • Real-world data acquisition (vendor integrations) — separate workstream.
  • DQ scoring / quality gates — covered conceptually but not implemented.

See also

Emacs 29.1 (Org mode 9.6.6)