Service-to-Service Authentication Design

Context
- Guiding Principles
Architecture Overview
- Why DB password as credential?
Infrastructure Already in Place
What Needs to Be Built
Configuration Changes
- New environment variables
- New config fields in service configuration structs
Security Considerations
Roles for Service Accounts
Rollout Order
Affected Files
- New files
- Modified files

Context

All NATS write handlers in the system require a valid RS256 JWT in the Authorization: Bearer <token> header. The make_request_context() helper validates the token and stamps domain objects with tenant_id / modified_by from the claims, providing an auditable chain of custody.

Until now, only the Qt shell client authenticates before making NATS calls. When one backend service calls another (e.g. the reporting service scheduling a job via the scheduler service) no JWT is attached, so the receiving handler returns an error_reply with an empty body instead of a typed response. The caller cannot parse an empty body as the expected response type and logs a "parse error".

The temporary workaround on branch feature/compute-reporting removed JWT validation from the scheduler's write handlers. This plan replaces that workaround with a proper service-to-service authentication mechanism.

Guiding Principles

Reuse existing infrastructure. Service accounts are already seeded in ores_iam_accounts_tbl. The RS256 JWT signer already lives in the IAM service. The service_session_service already creates sessions for non-user accounts. We build on top of this, not beside it.
No privileged short-circuit paths. Every NATS write goes through the same make_request_context() / JWT validation path, regardless of whether the caller is a human or a service. The scheduler does not know or care that its caller is the reporting service.
Credentials from the environment. Each service already has a dedicated Postgres DB user and password in .env. We use the DB password as the shared secret that proves a service's identity to IAM. This avoids a separate credential management system.
Token refresh, not re-login. Services acquire a JWT at startup and refresh it before it expires using the existing iam.v1.auth.refresh endpoint, exactly as the shell client does.

Architecture Overview

Service (e.g. reporting)           IAM Service
──────────────────────────         ────────────
startup:
  send iam.v1.auth.service-login ──► validate DB password hash
  (username, db_password)            start_service_session()
  ◄── JWT token ──────────────────── sign + return JWT

every outbound NATS call:
  request_sync(subject, body,
    {"Authorization": "Bearer <jwt>"})

before token expires:
  send iam.v1.auth.refresh ────────► validate (allow-expired)
  ◄── new JWT ─────────────────────── re-sign

Why DB password as credential?

Each service already connects to Postgres with a dedicated user and password (ORES_REPORTING_SERVICE_DB_USER / ORES_REPORTING_SERVICE_DB_PASSWORD, etc.). The password is already rotatable, stored in .env, and different per environment. Rather than inventing a new secret store, we store a bcrypt hash of the DB password in the corresponding IAM service account row during database population. The service presents the plaintext password at startup; IAM checks the bcrypt hash.

Infrastructure Already in Place

Component	Location	Status
Service accounts seeded	`ores.sql/populate/iam/iam_service_accounts_populate.sql`	✅ 18 accounts
`service_session_service`	`ores.iam.core/service/service_session_service`	✅ Implemented
RS256 JWT signer	`ores.security/jwt/jwt_authenticator`	✅ Implemented
JWT refresh endpoint	`iam.v1.auth.refresh`	✅ Implemented
`authenticated_request()` helper	`ores.shell/src/service/nats_session.cpp`	✅ Used by Qt
Service DB passwords	`.env`	✅ Defined

What Needs to Be Built

Phase 1 — IAM: service-login endpoint

1.1 Add `service_password_hash` column to service accounts

Add a nullable service_password_hash text column to ores_iam_accounts_tbl. Only populated for account_type = 'service'.

The population script currently calls ores_iam_service_accounts_upsert_fn() without a password. Extend the function to accept an optional password, bcrypt it, and store the hash:

create or replace function ores_iam_service_accounts_upsert_fn(
    p_db_user    text,
    p_username   text,
    p_description text,
    p_password   text default null  -- ← new
) returns void ...

1.2 Populate hashes from .env

Update iam_service_accounts_populate.sql to pass the DB password from a psql variable:

\set reporting_pw `echo $ORES_REPORTING_SERVICE_DB_PASSWORD`

select ores_iam_service_accounts_upsert_fn(
    :'reporting_service_user',
    'reporting_service@system.ores',
    'System service account for Reporting NATS domain service',
    :'reporting_pw'
);

The psql `echo $VAR` back-tick syntax expands the shell variable at execution time, so the password never appears in a committed SQL file.

1.3 Add `iam.v1.auth.service-login` NATS subject

Add to ores.iam.api/messaging/login_protocol.hpp:

struct service_login_request {
    static constexpr std::string_view nats_subject =
        "iam.v1.auth.service-login";
    std::string username;   // e.g. "reporting_service@system.ores"
    std::string password;   // plaintext DB password (only over loopback/NATS)
};

struct service_login_response {
    bool success = false;
    std::string token;      // JWT on success
    std::string message;    // error description on failure
};

1.4 Implement handler method `auth_handler::service_login()`

In ores.iam.core/messaging/auth_handler.hpp:

Decode service_login_request.
Look up account by username; verify account_type ! 'user'=.
Check bcrypt_verify(req.password, account.service_password_hash).
Call service_session_service::start_service_session(account.id, "ores.service.binary").
Build jwt_claims from the session (same fields as human login but roles reflect the service account's assigned roles).
Call signer_.create_token(claims) and return service_login_response{.token = jwt}.

nats_.subscribe("iam.v1.auth.service-login",
    [this](auto msg) { handler_.service_login(std::move(msg)); });

Phase 2 — Shared helper: `service_nats_client`

Create ores.service/include/ores.service/messaging/service_nats_client.hpp (a thin RAII wrapper that every backend service can use):

class service_nats_client {
public:
    service_nats_client(
        ores::nats::service::client& nats,
        std::string username,       // e.g. "reporting_service@system.ores"
        std::string password,       // DB password from env
        std::chrono::seconds refresh_margin = std::chrono::seconds(60));

    // Blocking call at startup; throws on failure.
    void authenticate();

    // Attach Bearer header and call request_sync.
    ores::nats::message authenticated_request(
        std::string_view subject,
        std::span<const std::byte> body,
        std::chrono::milliseconds timeout = std::chrono::seconds(5));

private:
    void refresh_if_needed();

    ores::nats::service::client& nats_;
    std::string username_;
    std::string password_;
    std::string jwt_;
    std::chrono::system_clock::time_point expires_at_;
    std::chrono::seconds refresh_margin_;
};

authenticate() sends iam.v1.auth.service-login and stores the returned JWT.

authenticated_request() calls refresh_if_needed() (which fires iam.v1.auth.refresh if within the margin window) then calls nats_.request_sync(subject, body, {{"Authorization", "Bearer " + jwt_}}).

Phase 3 — Wire into outbound-calling services

Services that make authenticated NATS calls to other services hold a service_nats_client instead of a raw ores::nats::service::client& for inter-service calls.

Example: reporting service

In application.cpp:

auto svc_client = ores::service::messaging::service_nats_client(
    nats,
    cfg.service_username,   // "reporting_service@system.ores"
    cfg.service_password);  // ORES_REPORTING_SERVICE_DB_PASSWORD
svc_client.authenticate();

Pass svc_client to report_scheduling_service instead of the raw nats::service::client. Inside schedule_one() replace:

// before
const auto reply_msg = nats_.request_sync(
    schedule_job_request::nats_subject, body);

// after
const auto reply_msg = svc_nats_.authenticated_request(
    schedule_job_request::nats_subject, body);

Services with outbound calls to update

Calling Service	Callee	Handler to update
`ores.reporting`	`ores.scheduler`	`report_scheduling_service`
(future) Any service calling another

Phase 4 — Restore JWT validation in scheduler write handlers

Revert the temporary workaround in ores.scheduler.core/messaging/job_definition_handler.hpp: re-add make_request_context() calls to schedule(), schedule_batch(), and unschedule().

The reporting service will now supply a valid JWT, so those handlers will authenticate successfully and stamp() will set tenant_id / modified_by from the claims.

Configuration Changes

New environment variables

No new variables needed. Re-use existing:

ORES_REPORTING_SERVICE_DB_USER + ORES_REPORTING_SERVICE_DB_PASSWORD
(future services follow the same pattern)

New config fields in service configuration structs

Add to each outbound-calling service's config:

std::string service_username;  // e.g. "reporting_service@system.ores"
std::string service_password;  // DB password

Read from environment in application.cpp alongside the existing DB options.

Security Considerations

The DB password is sent as plaintext inside a NATS message. NATS operates over TLS in production (nats:// → tls://) and over loopback-only connections in development. This is acceptable for service-to-service calls on the same host or encrypted transport.
The hash stored in IAM is bcrypt with cost ≥ 12. A compromised IAM DB row does not immediately expose the credential.
JWTs are short-lived (default 1800 s, same as user tokens). Compromised tokens expire quickly.
service_password_hash is only set for account_type ! 'user'= accounts. The human login path is unchanged.
A service account JWT carries the same claims structure as a human JWT. Downstream handlers apply the same RBAC rules. Service accounts must be assigned appropriate roles during seeding.

Roles for Service Accounts

Service accounts need roles to pass RBAC checks in downstream handlers. For the initial implementation:

Service Account	Minimum Role
`reporting_service@system.ores`	`system_service` (read + write own domain)
`scheduler_service@system.ores`	`system_service`
(others)	`system_service`

Add role assignments to iam_service_accounts_populate.sql.

Rollout Order

IAM first: Implement Phase 1 (service-login endpoint + password hash). Deploy IAM. Verify with a manual NATS call.
Shared helper: Implement Phase 2 (service_nats_client) with unit tests.
Outbound callers: Implement Phase 3 service by service, starting with ores.reporting.
Restore guards: Once all callers supply JWTs, revert Phase 4.

Affected Files

New files

projects/ores.service/include/ores.service/messaging/service_nats_client.hpp
projects/ores.service/src/messaging/service_nats_client.cpp
projects/ores.service/tests/service_nats_client_tests.cpp

Modified files

projects/ores.iam.api/include/ores.iam.api/messaging/login_protocol.hpp — add service_login_request / service_login_response
projects/ores.iam.core/include/ores.iam.core/messaging/auth_handler.hpp — add service_login() method
projects/ores.iam.core/src/messaging/registrar.cpp — register iam.v1.auth.service-login
projects/ores.sql/create/iam/iam_functions_create.sql — extend ores_iam_service_accounts_upsert_fn()
projects/ores.sql/populate/iam/iam_service_accounts_populate.sql — pass passwords
projects/ores.reporting.service/src/app/application.cpp — create service_nats_client, pass to scheduling service
projects/ores.reporting.core/src/service/report_scheduling_service.cpp — use authenticated_request()
projects/ores.scheduler.core/include/ores.scheduler.core/messaging/job_definition_handler.hpp — restore make_request_context() in write handlers