Service-to-Service Authentication Design
Table of Contents
- Context
- Architecture Overview
- Infrastructure Already in Place
- What Needs to Be Built
- Configuration Changes
- Security Considerations
- Roles for Service Accounts
- Rollout Order
- Affected Files
Context
All NATS write handlers in the system require a valid RS256 JWT in the
Authorization: Bearer <token> header. The make_request_context() helper
validates the token and stamps domain objects with tenant_id / modified_by
from the claims, providing an auditable chain of custody.
Until now, only the Qt shell client authenticates before making NATS calls.
When one backend service calls another (e.g. the reporting service scheduling a
job via the scheduler service) no JWT is attached, so the receiving handler
returns an error_reply with an empty body instead of a typed response. The
caller cannot parse an empty body as the expected response type and logs a
"parse error".
The temporary workaround on branch feature/compute-reporting removed JWT
validation from the scheduler's write handlers. This plan replaces that
workaround with a proper service-to-service authentication mechanism.
Guiding Principles
- Reuse existing infrastructure. Service accounts are already seeded in
ores_iam_accounts_tbl. The RS256 JWT signer already lives in the IAM service. Theservice_session_servicealready creates sessions for non-user accounts. We build on top of this, not beside it. - No privileged short-circuit paths. Every NATS write goes through the same
make_request_context()/ JWT validation path, regardless of whether the caller is a human or a service. The scheduler does not know or care that its caller is the reporting service. - Credentials from the environment. Each service already has a dedicated
Postgres DB user and password in
.env. We use the DB password as the shared secret that proves a service's identity to IAM. This avoids a separate credential management system. - Token refresh, not re-login. Services acquire a JWT at startup and refresh
it before it expires using the existing
iam.v1.auth.refreshendpoint, exactly as the shell client does.
Architecture Overview
Service (e.g. reporting) IAM Service
────────────────────────── ────────────
startup:
send iam.v1.auth.service-login ──► validate DB password hash
(username, db_password) start_service_session()
◄── JWT token ──────────────────── sign + return JWT
every outbound NATS call:
request_sync(subject, body,
{"Authorization": "Bearer <jwt>"})
before token expires:
send iam.v1.auth.refresh ────────► validate (allow-expired)
◄── new JWT ─────────────────────── re-sign
Why DB password as credential?
Each service already connects to Postgres with a dedicated user and password
(ORES_REPORTING_SERVICE_DB_USER / ORES_REPORTING_SERVICE_DB_PASSWORD,
etc.). The password is already rotatable, stored in .env, and different per
environment. Rather than inventing a new secret store, we store a bcrypt hash
of the DB password in the corresponding IAM service account row during database
population. The service presents the plaintext password at startup; IAM checks
the bcrypt hash.
Infrastructure Already in Place
| Component | Location | Status |
|---|---|---|
| Service accounts seeded | ores.sql/populate/iam/iam_service_accounts_populate.sql |
✅ 18 accounts |
service_session_service |
ores.iam.core/service/service_session_service |
✅ Implemented |
| RS256 JWT signer | ores.security/jwt/jwt_authenticator |
✅ Implemented |
| JWT refresh endpoint | iam.v1.auth.refresh |
✅ Implemented |
authenticated_request() helper |
ores.shell/src/service/nats_session.cpp |
✅ Used by Qt |
| Service DB passwords | .env |
✅ Defined |
What Needs to Be Built
Phase 1 — IAM: service-login endpoint
1.1 Add service_password_hash column to service accounts
Add a nullable service_password_hash text column to
ores_iam_accounts_tbl. Only populated for account_type = 'service'.
The population script currently calls ores_iam_service_accounts_upsert_fn()
without a password. Extend the function to accept an optional password, bcrypt
it, and store the hash:
create or replace function ores_iam_service_accounts_upsert_fn( p_db_user text, p_username text, p_description text, p_password text default null -- ← new ) returns void ...
1.2 Populate hashes from .env
Update iam_service_accounts_populate.sql to pass the DB password from a psql
variable:
\set reporting_pw `echo $ORES_REPORTING_SERVICE_DB_PASSWORD` select ores_iam_service_accounts_upsert_fn( :'reporting_service_user', 'reporting_service@system.ores', 'System service account for Reporting NATS domain service', :'reporting_pw' );
The psql `echo $VAR` back-tick syntax expands the shell variable at execution
time, so the password never appears in a committed SQL file.
1.3 Add iam.v1.auth.service-login NATS subject
Add to ores.iam.api/messaging/login_protocol.hpp:
struct service_login_request { static constexpr std::string_view nats_subject = "iam.v1.auth.service-login"; std::string username; // e.g. "reporting_service@system.ores" std::string password; // plaintext DB password (only over loopback/NATS) }; struct service_login_response { bool success = false; std::string token; // JWT on success std::string message; // error description on failure };
1.4 Implement handler method auth_handler::service_login()
In ores.iam.core/messaging/auth_handler.hpp:
- Decode
service_login_request. - Look up account by username; verify
account_type !'user'=. - Check
bcrypt_verify(req.password, account.service_password_hash). - Call
service_session_service::start_service_session(account.id, "ores.service.binary"). - Build
jwt_claimsfrom the session (same fields as human login but roles reflect the service account's assigned roles). - Call
signer_.create_token(claims)and returnservice_login_response{.token = jwt}.
Register in registrar.cpp:
nats_.subscribe("iam.v1.auth.service-login", [this](auto msg) { handler_.service_login(std::move(msg)); });
Phase 2 — Shared helper: service_nats_client
Create ores.service/include/ores.service/messaging/service_nats_client.hpp (a
thin RAII wrapper that every backend service can use):
class service_nats_client { public: service_nats_client( ores::nats::service::client& nats, std::string username, // e.g. "reporting_service@system.ores" std::string password, // DB password from env std::chrono::seconds refresh_margin = std::chrono::seconds(60)); // Blocking call at startup; throws on failure. void authenticate(); // Attach Bearer header and call request_sync. ores::nats::message authenticated_request( std::string_view subject, std::span<const std::byte> body, std::chrono::milliseconds timeout = std::chrono::seconds(5)); private: void refresh_if_needed(); ores::nats::service::client& nats_; std::string username_; std::string password_; std::string jwt_; std::chrono::system_clock::time_point expires_at_; std::chrono::seconds refresh_margin_; };
authenticate() sends iam.v1.auth.service-login and stores the returned JWT.
authenticated_request() calls refresh_if_needed() (which fires
iam.v1.auth.refresh if within the margin window) then calls
nats_.request_sync(subject, body, {{"Authorization", "Bearer " + jwt_}}).
Phase 3 — Wire into outbound-calling services
Services that make authenticated NATS calls to other services hold a
service_nats_client instead of a raw ores::nats::service::client& for
inter-service calls.
Example: reporting service
In application.cpp:
auto svc_client = ores::service::messaging::service_nats_client( nats, cfg.service_username, // "reporting_service@system.ores" cfg.service_password); // ORES_REPORTING_SERVICE_DB_PASSWORD svc_client.authenticate();
Pass svc_client to report_scheduling_service instead of the raw
nats::service::client. Inside schedule_one() replace:
// before const auto reply_msg = nats_.request_sync( schedule_job_request::nats_subject, body); // after const auto reply_msg = svc_nats_.authenticated_request( schedule_job_request::nats_subject, body);
Services with outbound calls to update
| Calling Service | Callee | Handler to update |
|---|---|---|
ores.reporting |
ores.scheduler |
report_scheduling_service |
| (future) Any service calling another |
Phase 4 — Restore JWT validation in scheduler write handlers
Revert the temporary workaround in
ores.scheduler.core/messaging/job_definition_handler.hpp: re-add
make_request_context() calls to schedule(), schedule_batch(), and
unschedule().
The reporting service will now supply a valid JWT, so those handlers will
authenticate successfully and stamp() will set tenant_id / modified_by
from the claims.
Configuration Changes
New environment variables
No new variables needed. Re-use existing:
ORES_REPORTING_SERVICE_DB_USER+ORES_REPORTING_SERVICE_DB_PASSWORD- (future services follow the same pattern)
New config fields in service configuration structs
Add to each outbound-calling service's config:
std::string service_username; // e.g. "reporting_service@system.ores" std::string service_password; // DB password
Read from environment in application.cpp alongside the existing DB options.
Security Considerations
- The DB password is sent as plaintext inside a NATS message. NATS operates
over TLS in production (
nats://→tls://) and over loopback-only connections in development. This is acceptable for service-to-service calls on the same host or encrypted transport. - The hash stored in IAM is bcrypt with cost ≥ 12. A compromised IAM DB row does not immediately expose the credential.
- JWTs are short-lived (default 1800 s, same as user tokens). Compromised tokens expire quickly.
service_password_hashis only set foraccount_type !'user'= accounts. The human login path is unchanged.- A service account JWT carries the same claims structure as a human JWT. Downstream handlers apply the same RBAC rules. Service accounts must be assigned appropriate roles during seeding.
Roles for Service Accounts
Service accounts need roles to pass RBAC checks in downstream handlers. For the initial implementation:
| Service Account | Minimum Role |
|---|---|
reporting_service@system.ores |
system_service (read + write own domain) |
scheduler_service@system.ores |
system_service |
| (others) | system_service |
Add role assignments to iam_service_accounts_populate.sql.
Rollout Order
- IAM first: Implement Phase 1 (service-login endpoint + password hash). Deploy IAM. Verify with a manual NATS call.
- Shared helper: Implement Phase 2 (
service_nats_client) with unit tests. - Outbound callers: Implement Phase 3 service by service, starting with
ores.reporting. - Restore guards: Once all callers supply JWTs, revert Phase 4.
Affected Files
New files
projects/ores.service/include/ores.service/messaging/service_nats_client.hppprojects/ores.service/src/messaging/service_nats_client.cppprojects/ores.service/tests/service_nats_client_tests.cpp
Modified files
projects/ores.iam.api/include/ores.iam.api/messaging/login_protocol.hpp— addservice_login_request/service_login_responseprojects/ores.iam.core/include/ores.iam.core/messaging/auth_handler.hpp— addservice_login()methodprojects/ores.iam.core/src/messaging/registrar.cpp— registeriam.v1.auth.service-loginprojects/ores.sql/create/iam/iam_functions_create.sql— extendores_iam_service_accounts_upsert_fn()projects/ores.sql/populate/iam/iam_service_accounts_populate.sql— pass passwordsprojects/ores.reporting.service/src/app/application.cpp— createservice_nats_client, pass to scheduling serviceprojects/ores.reporting.core/src/service/report_scheduling_service.cpp— useauthenticated_request()projects/ores.scheduler.core/include/ores.scheduler.core/messaging/job_definition_handler.hpp— restoremake_request_context()in write handlers