Service Account RBAC Design
Table of Contents
- Context
- Phase 1 — PostgreSQL GRANT-Level Permissioning
- Phase 2 — NATS RBAC
- Files Changed
- Phase 3 — Mutual TLS for NATS (per-service client certificates)
- Phase 4 — Rename
comms_usertoshell_user - Open Questions
Context
The service-to-service authentication PR (#574) established the JWT plumbing: each backend service now authenticates with IAM at startup, receives a JWT, and attaches it as a Bearer token to every outbound NATS request.
However, the receiving side performs only cryptographic validation (RS256 signature + expiry). Any service holding a valid JWT can call any NATS subject. This plan adds fine-grained RBAC at two orthogonal levels:
- PostgreSQL GRANTs — each service runs as a dedicated DB user; PostgreSQL enforces which tables that user may read or write directly.
- NATS RBAC — each service account carries a role in its JWT; handlers enforce which subjects (cross-service calls) the caller is authorised to invoke.
These two layers are complementary, not duplicated. DB GRANTs protect direct database access (a compromised service cannot `UPDATE` a table it only reads). NATS RBAC protects the service mesh (a compromised service cannot trigger operations in other services it has no business calling).
Guiding Principles
- One role per service. Each service account gets a dedicated role
(
role_iam_service,role_reporting_service, etc.) containing exactly the permissions that service needs. No shared wildcard roles. - Principle of least privilege. A service gets
readon tables it reads,writeon tables it owns,deleteonly where it genuinely deletes records. Cross-service calls grant only the specific subjects called, not a broad component wildcard. - No backwards compatibility. The existing permission seed data is replaced wholesale. All permission codes, roles, and assignments are re-seeded from scratch.
- Same validation path for humans and services. No special-casing in
handlers. A
check_permission()call in the handler body works identically for both. - Roles travel in the JWT. The
service_loginresponse already returns a JWT; we add the account's roles toclaims.rolesso handlers can validate permissions without an extra database round-trip per request. - Short-lived service tokens. Service account JWTs use the same short TTL as user tokens and the same reactive refresh mechanism. There is no special long-lived service token.
Phase 1 — PostgreSQL GRANT-Level Permissioning
Model
Each service has a dedicated PostgreSQL role/user (e.g. ores_reporting,
ores_scheduler). PostgreSQL GRANTs are the enforcement mechanism: the DB
kernel rejects any DML that exceeds the GRANT, independent of the application.
The DB user is the service's database.user config value (already used for
connection pooling).
GRANT matrix
The table below lists the GRANTs each service holds. "Own component"
means full SELECT/INSERT/UPDATE/DELETE on all ores_<S>_* tables.
"Cross-component" entries are tracked violations with a phase target in
the isolation plan.
Cross-component tenant and change-reason validation no longer requires
SELECT grants: ores_iam_validate_tenant_fn and
ores_dq_validate_change_reason_fn are SECURITY DEFINER functions that
run as the DDL owner. PostgreSQL FK enforcement similarly requires no
SELECT on the referenced table at runtime.
| Service DB user | Own component | Cross-component (reason / phase) |
|---|---|---|
ores_iam |
ores_iam_ |
SELECT variability.system_settings (IAM settings / Ph 4.1) |
DML refdata.parties (provisioning / Ph 4.2–4.3) |
||
ores_refdata |
ores_refdata_ |
— |
ores_dq |
ores_dq_ |
— |
ores_variability |
ores_variability_ |
— |
ores_assets |
ores_assets_ |
— |
ores_scheduler |
ores_scheduler_ |
— |
ores_reporting |
ores_reporting_ |
— |
ores_trading |
ores_trading_ |
SELECT ores_refdata_* (trigger validation / Ph 5.1) |
ores_compute |
ores_compute_ |
— |
ores_telemetry |
ores_telemetry_ |
— |
ores_workflow |
ores_workflow_ |
DML ores_iam_*, DML ores_refdata_parties (onboarding / Ph 5.2) |
ores_ore |
ores_ore_ |
DML ores_workflow_* (job queue / Ph 5.3) |
ores_marketdata |
ores_marketdata_ |
— |
ores_controller |
ores_controller_ |
— |
ores_analytics |
ores_analytics_ |
— |
ores_synthetic |
tooling role | SELECT all schemas (developer tool — permanent exception) |
Implementation Steps
Step 1a — Create per-service DB users
New script: ores.sql/create/iam/iam_service_db_users_create.sql
Creates one PostgreSQL role per service (ores_iam, ores_refdata, …).
Idempotent: CREATE ROLE IF NOT EXISTS.
Step 1b — GRANT scripts
New script: ores.sql/create/iam/iam_service_db_grants_create.sql
Encodes the GRANT matrix above. Uses GRANT SELECT, INSERT, UPDATE, DELETE ON
ALL TABLES IN SCHEMA <schema> TO <role> for owned schemas, and explicit
GRANT SELECT ON <schema>.<table> TO <role> for shared reads.
Step 1c — Update service configs
Each service's database config points database.user to its dedicated DB user.
This is an env/config change, not a code change.
Phase 2 — NATS RBAC
Permission Scheme
Permissions follow the pattern component:resource:action where action is one
of read, write, delete. A wildcard component:* grants all actions on
all resources within a component (used only for the component's own service
account).
IAM
| Permission code | Meaning |
|---|---|
iam:accounts:read |
List / fetch accounts |
iam:accounts:write |
Create / update accounts |
iam:accounts:delete |
Delete accounts |
iam:roles:read |
List / fetch roles |
iam:roles:write |
Create / update roles |
iam:roles:delete |
Delete roles |
iam:permissions:read |
List / fetch permissions |
iam:permissions:write |
Create / update permissions |
iam:permissions:delete |
Delete permissions |
iam:tenants:read |
List / fetch tenants (used by all svcs) |
iam:tenants:write |
Create / update tenants |
iam:tenants:delete |
Delete tenants |
iam:sessions:read |
List / fetch sessions |
iam:sessions:write |
Create sessions (login / service-login) |
Reference Data
refdata:countries:read |
refdata:countries:write |
refdata:countries:delete |
refdata:currencies:read |
refdata:currencies:write |
refdata:currencies:delete |
refdata:parties:read |
refdata:parties:write |
refdata:parties:delete |
refdata:counterparties:read |
refdata:counterparties:write |
refdata:counterparties:delete |
refdata:books:read |
refdata:books:write |
refdata:books:delete |
refdata:portfolios:read |
refdata:portfolios:write |
refdata:portfolios:delete |
refdata:business-units:read |
refdata:business-units:write |
refdata:business-units:delete |
| (all other refdata resources follow the same pattern) |
Scheduler
scheduler:job-definitions:read |
List / fetch job definitions |
scheduler:job-definitions:write |
Schedule / update jobs |
scheduler:job-definitions:delete |
Unschedule jobs |
Reporting
reporting:report-definitions:read |
:write |
:delete |
reporting:report-instances:read |
:write |
:delete |
reporting:report-types:read |
:write |
:delete |
reporting:concurrency-policies:read |
:write |
:delete |
Data Quality
dq:change-reasons:read |
:write |
:delete |
dq:datasets:read |
:write |
:delete |
| (all other dq resources follow the same pattern) |
Other Components
| Component | Permission pattern |
|---|---|
| Variability | variability:settings:read/write/delete |
| Assets | assets:images:read/write/delete |
| Trading | trading:instruments:read/write/delete |
trading:trades:read/write/delete |
|
| Compute | compute:apps:read/write/delete |
compute:batches:read/write/delete |
|
compute:hosts:read/write |
|
| Telemetry | telemetry:logs:write |
telemetry:samples:write |
|
| Synthetic | synthetic:* (wildcard; generates data) |
Role → Permission Assignment
Each service gets one role. The role contains:
- A
component:*wildcard for the component the service owns - Individual
readpermissions for shared/cross-cutting data it reads - Individual permissions for specific outbound NATS subjects it calls
| Service account | Role | Own component | Shared reads | Outbound call permissions |
|---|---|---|---|---|
iam_service |
role_iam_service |
iam:* |
— | — |
refdata_service |
role_refdata_service |
refdata:* |
iam:tenants:read |
— |
dq_service |
role_dq_service |
dq:* |
iam:tenants:read |
— |
variability_service |
role_variability_service |
variability:* |
iam:tenants:read |
— |
assets_service |
role_assets_service |
assets:* |
iam:tenants:read |
— |
scheduler_service |
role_scheduler_service |
scheduler:* |
iam:tenants:read, dq:change-reasons:read |
— |
reporting_service |
role_reporting_service |
reporting:* |
iam:tenants:read, dq:change-reasons:read |
iam:tenants:read, scheduler:job-definitions:write, scheduler:job-definitions:delete |
trading_service |
role_trading_service |
trading:* |
iam:tenants:read, refdata:*:read, dq:change-reasons:read |
— |
compute_service |
role_compute_service |
compute:* |
iam:tenants:read, refdata:parties:read |
— |
telemetry_service |
role_telemetry_service |
telemetry:* |
iam:tenants:read |
— |
synthetic_service |
role_synthetic_service |
synthetic:* |
iam:tenants:read, most components :read |
most services :read |
Implementation Steps
Step 2a — SQL: permissions, roles, assignments
New populate scripts (replacing existing permission seed data):
iam_permissions_populate.sql- Drop all existing permission rows and
re-seed using the full
component:resource:actionscheme above. iam_roles_populate.sql- Drop all existing roles and re-seed one role per
service (
role_iam_service,role_refdata_service, …). iam_role_permissions_populate.sql- Assign permissions to roles per the matrix above.
iam_service_account_roles_populate.sql- Assign each service account its
role. This script runs after
iam_service_accounts_populate.sql.
All scripts are idempotent (upsert / on conflict do nothing).
Step 2b — service_login JWT includes roles
File: projects/ores.iam.core/include/ores.iam.core/messaging/auth_handler.hpp
After start_service_session() succeeds, fetch the account's roles:
auto roles = authz_svc.get_account_roles(sess->account_id); std::vector<std::string> role_names; for (const auto& r : roles) role_names.push_back(r.name); claims.roles = std::move(role_names);
The interactive login handler already does this in the HTTP layer; mirror the
pattern here. No schema change required — jwt_claims.roles already exists.
Step 2c — request_context exposes roles
File: projects/ores.service/src/service/request_context.cpp
The context currently extracts tenant_id, username, party_id from the
JWT. Extend it to also extract claims.roles and store them so handlers can
read them without re-parsing the token.
File: projects/ores.service/include/ores.service/service/request_context.hpp
Add a roles field (std::vector<std::string>) to the context struct.
Step 2d — check_permission() helper in handler_helpers
File: projects/ores.service/include/ores.service/messaging/handler_helpers.hpp
Add a helper that takes the request_context and a required permission code,
calls authorization_service::check_permission(), and returns an error reply if
the check fails. Signature sketch:
[[nodiscard]] bool require_permission( const request_context& ctx, std::string_view permission, ores::nats::service::client& nats, const ores::nats::message& msg);
Returns true if the caller has the permission; otherwise sends an
error_reply with error_code::forbidden and returns false, so the handler
can early-return cleanly.
Step 2e — Add permission checks to NATS handlers
For each handler that writes (save, delete, trigger, etc.), add a
require_permission() call at the top of the handler body, before any business
logic. Read handlers can be gated too but are lower priority.
Start with the handlers that service accounts actually call:
schedulerwrite handlers (called byreporting_service)iamtenants.list(called byreporting_service)reportinghandlers (called by the scheduler trigger subject)
Roll out to remaining handlers in a follow-up.
Step 2f — Short-lived service tokens with reactive refresh
Service account tokens must use the same short TTL as user tokens (not a
long-lived 13-hour token). The make_service_token_provider in
ores.iam.client already supports proactive refresh via refresh_if_needed();
this step ensures:
- IAM issues service tokens with the standard user token TTL (configurable, default ~15 minutes).
service_token_providerdetects expiry (viaX-Error: token_expiredon the NATS response) and re-authenticates reactively, mirroring the interactive path innats_client.- The proactive background refresh margin is tuned to refresh well before expiry (e.g. at 80% of TTL).
This ensures a compromised service token has the same limited validity window as a user token.
Files Changed
| File | Change |
|---|---|
projects/ores.sql/create/iam/iam_service_db_users_create.sql |
New — one PostgreSQL role per service |
projects/ores.sql/create/iam/iam_service_db_grants_create.sql |
New — GRANT matrix per Phase 1 |
projects/ores.sql/populate/iam/iam_permissions_populate.sql |
Replace with full permission scheme |
projects/ores.sql/populate/iam/iam_roles_populate.sql |
New — one role per service |
projects/ores.sql/populate/iam/iam_role_permissions_populate.sql |
New — role → permission assignments |
projects/ores.sql/populate/iam/iam_service_account_roles_populate.sql |
New — service account → role assignments |
projects/ores.iam.core/include/ores.iam.core/messaging/auth_handler.hpp |
Add roles to service_login JWT claims |
projects/ores.service/include/ores.service/service/request_context.hpp |
Add roles field |
projects/ores.service/src/service/request_context.cpp |
Extract roles from JWT claims |
projects/ores.service/include/ores.service/messaging/handler_helpers.hpp |
Add require_permission() helper |
projects/ores.iam.client/src/client/service_token_provider.cpp |
Reactive re-auth on token_expired; short TTL |
projects/ores.*.core/include/.../messaging/*_handler.hpp (write handlers) |
Add require_permission() calls |
Phase 3 — Mutual TLS for NATS (per-service client certificates)
Goal
Encrypt all NATS traffic and cryptographically authenticate every service at the transport layer. After this phase a service cannot connect to the NATS broker at all unless it presents a valid client certificate issued by the project CA — independent of JWT-level authentication. This gives defence in depth: even if a JWT is leaked, it cannot be used from a host that does not hold the corresponding private key.
Model
A single internal CA (ores-ca) issues:
- One server certificate for the NATS broker.
- One client certificate per service (
ores.iam.service,ores.reporting.service, …).
The CA certificate is the only trust anchor distributed to all parties. No public CA is involved; everything is self-contained within the deployment.
Certificates use 4096-bit RSA or P-256 ECDSA (preferred — smaller, faster). Validity: 1 year for the CA, 90 days for leaf certificates, rotated by the key-management script (see Step 3b).
Why per-service keys (not one shared client cert)?
- Revocation is surgical: compromising one service does not require rotating every other service's certificate.
- Audit logs can attribute connections to a specific service identity at the TLS layer, independently of the JWT claim.
- Aligns with the principle of least privilege already established for DB users and NATS RBAC roles.
Implementation Steps
Step 3a — CA and certificate generation script
File: build/scripts/generate_nats_certs.sh
A script (following the same bash-wrapper-over-python pattern) that:
- Creates
build/keys/nats/directory (already git-ignored viabuild/keys/*.pem/build/keys/*.key). - Generates the internal CA (
ca.key,ca.crt) if not already present. - Generates a server keypair (
nats-server.key,nats-server.crt) signed by the CA, with SANlocalhostand the deployment hostname. - For each service name in a hardcoded list generates a client keypair
(
<service>.key,<service>.crt) signed by the CA, with CN set to the service name (e.g.ores.reporting.service).
The script is idempotent: existing files are not overwritten unless
--force is passed. This allows certs to be regenerated on rotation without
accidentally overwriting a key that is still in use.
Certificates and keys are written under build/keys/nats/ and are never
committed (already covered by the existing build/keys/*.pem gitignore rule;
extend it to cover build/keys/nats/).
Step 3b — NATS server configuration
File: build/config/nats.conf (new)
port: 4222 tls { cert_file: "build/keys/nats/nats-server.crt" key_file: "build/keys/nats/nats-server.key" ca_file: "build/keys/nats/ca.crt" verify: true # require client certificates (mTLS) timeout: 5 }
The verify: true field enables mutual TLS: the broker rejects any connection
that does not present a certificate signed by ca.crt.
Update build/scripts/start-services.sh to pass --config build/config/nats.conf
when launching nats-server. The URL passed to services changes from
nats://localhost:4222 to tls://localhost:4222.
Step 3c — nats_options gains TLS fields
File: projects/ores.nats/include/ores.nats/config/nats_options.hpp
struct nats_options final { std::string url = "nats://localhost:4222"; std::string subject_prefix; // mTLS — all three must be set together or all left empty. std::string tls_ca_cert; // path to CA certificate (ca.crt) std::string tls_client_cert; // path to client certificate (<service>.crt) std::string tls_client_key; // path to client private key (<service>.key) };
Step 3d — client.cpp applies TLS options
File: projects/ores.nats/src/service/client.cpp
After natsOptions_SetURL, add:
if (!impl_->opts.tls_ca_cert.empty()) { natsOptions_SetSecure(opts, true); natsOptions_LoadCATrustedCertificates(opts, impl_->opts.tls_ca_cert.c_str()); natsOptions_LoadCertificatesChain(opts, impl_->opts.tls_client_cert.c_str(), impl_->opts.tls_client_key.c_str()); }
The nats.c library's TLS API (natsOptions_SetSecure,
natsOptions_LoadCATrustedCertificates, natsOptions_LoadCertificatesChain)
maps directly to these fields. No new library dependency is required.
Step 3e — nats_configuration reads TLS fields from CLI / env
File: projects/ores.nats/src/config/nats_configuration.cpp
Add three new CLI options (--nats-tls-ca, --nats-tls-cert,
--nats-tls-key) and corresponding environment variable fallbacks
(ORES_NATS_TLS_CA, ORES_NATS_TLS_CERT, ORES_NATS_TLS_KEY) following the
same pattern as --nats-url.
The init-environment.sh script populates these variables per-service,
pointing each service at its own keypair under build/keys/nats/.
Step 3f — CI key generation
Add a step to the CI workflow (and init-environment.sh) that calls
generate_nats_certs.sh before starting the NATS server. In CI, the
--force flag regenerates keys on every run (ephemeral). In developer
environments, keys are generated once and reused.
Files Changed
| File | Change |
|---|---|
build/scripts/generate_nats_certs.sh |
New — CA + per-service cert generation |
build/config/nats.conf |
New — NATS server config with mTLS |
build/scripts/start-services.sh |
Pass --config to nats-server; use tls:// URL |
build/scripts/init-environment.sh |
Add ORES_NATS_TLS_* env vars per service |
.gitignore |
Extend to cover build/keys/nats/ |
projects/ores.nats/include/.../nats_options.hpp |
Add tls_ca_cert, tls_client_cert, tls_client_key |
projects/ores.nats/src/service/client.cpp |
Apply TLS options via natsOptions_* API |
projects/ores.nats/src/config/nats_configuration.cpp |
Parse TLS CLI flags / env vars |
Open Questions
- Certificate rotation automation. 90-day leaf certs require rotation
before expiry. For developer environments a manual
generate_nats_certs.shre-run suffices. In production, considercertbotor a Vault PKI backend. Out of scope for this phase; document the manual rotation procedure. - NATS subject-level authorisation via NKey. NATS also supports NKey-based
identity and
accountsblocks in the server config for subject-level access control. This is an alternative to JWT-based RBAC at the NATS layer. Evaluate whether NKey accounts would replace or complement Phase 2 in a follow-up.
Phase 4 — Rename comms_user to shell_user
Context
The database user ores_<env>_comms_user (and the corresponding
ORES_DB_COMMS_USER / ORES_DB_COMMS_PASSWORD env vars) is used exclusively
by the interactive shell (ores.shell), which reads credentials via
make_mapper("COMMS_SHELL"). The name "comms" is misleading — it refers to the
binary comms protocol the shell uses internally, not to any communications
service. Renaming it to shell_user makes the purpose immediately obvious.
Scope
This is a pure rename — no schema changes, no privilege changes. The new user
gets the same rw_role membership that comms_user has today.
Implementation Steps
Step 4a — SQL: rename variables
In every SQL and shell script that references comms_user / comms_password:
projects/ores.sql/setup_user.sql- rename variable and validation block
projects/ores.sql/recreate_database.sql- rename
comms_userblocks projects/ores.sql/recreate_database.sh- rename variable names and help text
projects/ores.sql/setup_database.sh- rename
-v comms_userflag projects/ores.sql/setup_schema.sql- rename if referenced (currently not)
projects/ores.sql/drop_roles.sql- rename
comms_userentry in user array projects/ores.sql/populate/iam/iam_service_accounts_populate.sql- rename
the upsert call from
:'comms_user'to:'shell_user'
Step 4b — Init script and .env
In build/scripts/init-environment.sh:
- Rename
ORES_DB_COMMS_USER→ORES_DB_SHELL_USER - Rename
ORES_DB_COMMS_PASSWORD→ORES_DB_SHELL_PASSWORD - Rename the emitted section from
ORES_COMMS_SHELL_DB_*→ORES_SHELL_DB_USER,ORES_SHELL_DB_PASSWORD,ORES_SHELL_DB_DATABASE
Step 4c — C++ config
In projects/ores.shell/src/config/parser.cpp:
Change make_mapper("COMMS_SHELL") to make_mapper("SHELL") so the
binary reads ORES_SHELL_DB_* from the environment.
Step 4d — Documentation and recipes
Update all references in:
projects/ores.sql/modeling/database_lifecycle.orgdoc/recipes/shell_recipes.org(ORES_COMMS_SHELL_LOGIN_PASSWORD→ORES_SHELL_LOGIN_PASSWORD;ores_comms_user→ores_shell_user)- Any
.claude/skillsordoc/llm/skillsfiles referencingCOMMS_SHELL
Files Changed
| File | Change |
|---|---|
projects/ores.sql/setup_user.sql |
comms_user / comms_password → shell_user / shell_password |
projects/ores.sql/recreate_database.sql |
Same rename |
projects/ores.sql/recreate_database.sh |
Same rename + help text |
projects/ores.sql/setup_database.sh |
Same rename |
projects/ores.sql/drop_roles.sql |
Same rename |
projects/ores.sql/populate/iam/iam_service_accounts_populate.sql |
comms_user → shell_user upsert parameter |
build/scripts/init-environment.sh |
ORES_DB_COMMS_* → ORES_DB_SHELL_* |
projects/ores.shell/src/config/parser.cpp |
make_mapper("COMMS_SHELL") → make_mapper("SHELL") |
doc/recipes/shell_recipes.org |
Update all COMMS_SHELL / comms_user refs |
projects/ores.sql/modeling/database_lifecycle.org |
Update user table |
Open Questions
- Wildcard matching in
check_permission(). The existing implementation may support only a literal*permission (grants everything) but not prefix wildcards likerefdata:*matchingrefdata:currencies:read. This needs investigation beforecomponent:*roles are useful at the NATS layer. If prefix wildcards are not supported, theauthorization_servicemust be extended. HTTP server single-account blast radius. The HTTP server currently runs as a single service account that fronts all user-facing API calls. Under the new permission scheme, this account would need to hold every permission that any human user could exercise — effectively a super-account with a very large blast radius. Several options exist:
a. Keep a single HTTP account with all permissions. Simple to implement but a compromised HTTP layer has full access to every service. Acceptable only if the HTTP layer is fully trusted (e.g., in a private network behind a gateway).
b. Pass-through caller identity. The HTTP server forwards the human user's JWT (or a derived token carrying the user's roles) into the NATS request rather than substituting its own identity. Handlers then check the forwarded identity. Cleaner blast-radius model but requires protocol changes.
c. Dedicated per-endpoint HTTP sub-accounts. Heavy operational overhead; unlikely to be worth it.
The recommended approach is (b), but it requires more design work. For the initial implementation, option (a) is acceptable as a stepping stone provided it is explicitly documented and the HTTP account is treated as high-privilege.
- Roles in refresh JWT. Roles are embedded at login time and persist until the token expires. If a service account's role is changed mid-session, the in-flight JWT still carries the old roles until the next reactive refresh. With short-lived tokens this window is small (~15 minutes). Acceptable; document the assumption explicitly.