Controller Startup and Shutdown Latency Fix
Move blocking DB writes off the io_context thread in process_supervisor
Table of Contents
Problem Statement
Shutdown: controller must be killed
Running ./build/scripts/stop-services.sh takes roughly 30 seconds and the
controller process must be killed rather than exiting gracefully:
Waiting for controller to exit............................................................... done killed PID 428643 (did not exit within grace period)
The 30-second grace period in stop-services.sh is not long enough for the
controller to drain all its DB writes.
Startup: slow time-to-ready for all services
start-services.sh exits as soon as the controller process is launched — it
does not wait for the controller to finish starting all supervised services.
The controller runs start_all() asynchronously after start-services.sh
returns, so from the operator's perspective services appear unavailable for a
significant period after the script completes.
Additionally, start_all() performs the same synchronous blocking DB writes
per service (in do_launch()) that cause the shutdown problem. Each service
launch blocks the io_context while writing a "started" event to the DB,
serialising the startup of services that have no dependency relationship and
could otherwise be launched in parallel.
Root Cause
Blocking DB writes inside ASIO coroutines
monitor_process() in process_supervisor.cpp is a co_spawn-ed ASIO
coroutine that runs on the controller's single-threaded io_context. When a
supervised service exits, monitor_process() performs several synchronous,
blocking libpqxx calls:
// process_supervisor.cpp:559-588 try { repository::service_instance_repository inst_repo; inst_repo.update_phase(db_ctx_, *existing); // blocking repository::service_event_repository ev_repo; ev_repo.insert(db_ctx_, ev); // blocking — up to 7 s } catch (...) { ... }
These calls block the io_context thread. While one coroutine is blocked
inside a DB write, every other pending coroutine is frozen: async_wait
completions for other exiting processes queue up and cannot fire. The result
is that all service-exit notifications are processed serially, gated by each
blocking DB write.
Evidence from controller log
The pattern repeats for every service during shutdown:
21:08:42.317 Shutdown signal received. Draining... 21:08:42.330 Stopped service ores.assets.service[0]. ← exits immediately after SIGTERM 21:08:42.393 Inserting service event.. ← DB write starts 21:08:42.393 Pool connection dead → rebuilding ← dead connection overhead 21:08:49.830 Finished Inserting service event.. ← 7.4 s blocked! 21:08:49.830 Stopped service ores.variability.service[0] ← only NOW can this fire
ores.variability.service likely exited within a second of receiving SIGTERM,
but its async_wait completion could not fire until the io_context was
unblocked by the end of the assets DB write.
Individual DB write durations observed:
| Service | DB write duration |
|---|---|
| ores.assets.service | 7.4 s |
| ores.marketdata | 7.0 s |
| ores.telemetry | 5.3 s |
| ores.dq.service | 1.7 s |
| ores.compute.service | 1.2 s |
| ores.ore.service | 3.0 s |
With N services stopping, worst-case shutdown time ≈ sum of all DB write
durations, serialised.
Secondary: dead pool connection on every operation
Every DB operation during shutdown triggers "Pool connection dead (ROLLBACK failed: Query execution failed: no connection to the server)". The pool rebuilds in ~50 ms and the operation succeeds, so this is not the primary cause of the long shutdown — but it does add overhead and noise.
The likely cause is PostgreSQL's server-side idle connection reaping
(tcp_keepalives_idle or idle_in_transaction_session_timeout) killing
connections that have been idle during steady-state operation. The controller
only writes to the DB when services start or stop, so its pool connections sit
idle for extended periods between events.
Startup: do_launch() DB writes
do_launch() writes a "started" event to the DB synchronously inside the
coroutine before returning. Because start_all() calls co_await do_launch()
for each service sequentially, each blocking write delays the launch of the
next service. For services with no dependency relationship, this is pure
wasted time.
Startup: start-services.sh does not wait for readiness
start-services.sh currently exits immediately after starting the controller.
The operator has no way to know when the full service stack is ready from the
terminal, and any elapsed-time measurement via the new Time : line only
reflects the time to launch the controller process itself (~2–3 s), not the
total time until services are usable.
Proposed Solution
Primary fix: DB thread pool for blocking writes
Add a boost::asio::thread_pool db_pool_{1} to process_supervisor. Before
each blocking DB section in monitor_process() and do_launch(), switch the
coroutine to the thread pool executor; switch back to the io_context afterwards
if further shared-state access is needed.
// In monitor_process() — after all processes_ map mutations are done: // Switch to DB thread pool so the io_context is not blocked. co_await boost::asio::post( boost::asio::bind_executor(db_pool_.get_executor(), boost::asio::use_awaitable)); try { // All blocking DB calls go here. inst_repo.update_phase(db_ctx_, *existing); ev_repo.insert(db_ctx_, ev); } catch (const std::exception& e) { BOOST_LOG_SEV(lg(), warn) << "DB update failed: " << e.what(); } if (entry->stop_requested) co_return; // no more shared state — staying on db_pool thread is fine // Switch back to io_context before accessing processes_ for restart. co_await boost::asio::post( boost::asio::bind_executor(ioc_.get_executor(), boost::asio::use_awaitable)); // Restart logic (accesses processes_ map) continues here on io_context thread.
A single-thread pool (thread_pool{1}) is sufficient and avoids concurrent DB
connection pressure. All processes_ map access remains on the io_context
thread, so no locks are needed on that map.
Secondary fix: idle connection keepalive
Configure the database pool to issue a lightweight keepalive ping (SELECT 1)
before returning a connection to a caller, or enable TCP keepalives in the
libpqxx connection string. This removes the "Pool connection dead" log
noise and the ~50 ms pool-rebuild overhead on every event write. It is not
required for the primary fix but is easy to add.
Startup fix: log "all services started" with timestamp in the controller
process_supervisor::start_all() already logs "All services started." at
INFO when it completes. That timestamp in the controller log gives the exact
time-to-ready. No code change is required here — the operator can inspect the
log. However, adding a dedicated structured log line with elapsed time would
make it searchable:
// At the end of start_all(): const auto elapsed = std::chrono::duration_cast<std::chrono::seconds>( std::chrono::steady_clock::now() - start_ts).count(); BOOST_LOG_SEV(lg(), info) << "All services started in " << elapsed << "s.";
Startup fix: wait for controller readiness in start-services.sh
Extend start-services.sh to wait until the controller log shows
"All services started" before printing the final timing line. This gives
the operator an accurate wall-clock cost for the full startup and a clear
signal that the stack is ready. Use the existing wait_for_ready shell
function pattern (tail the log file, poll for the string, timeout after 120 s):
wait_for_ready_string "ores.controller.service" "All services started"
The Time : line added to start-services.sh will then reflect the true
time from launch to full service availability.
Files to Change
| File | Change |
|---|---|
projects/ores.controller.core/include/ores.controller.core/service/process_supervisor.hpp |
Add #include <boost/asio/thread_pool.hpp>; add db_pool_ member |
projects/ores.controller.core/src/service/process_supervisor.cpp |
monitor_process(): switch to db_pool_ before DB writes, back to ioc_ before restart |
projects/ores.controller.core/src/service/process_supervisor.cpp |
do_launch(): switch to db_pool_ for post-launch DB writes |
projects/ores.controller.core/src/service/process_supervisor.cpp |
start_all(): log elapsed startup time at completion |
build/scripts/start-services.sh |
Add wait_for_ready_string call for controller "All services started"; Time: line reflects true total startup |
No schema changes. No other services are affected.
Implementation Steps
Step 1: Add DB thread pool to process_supervisor
Add boost::asio::thread_pool db_pool_{1} as a member of process_supervisor
(inline initialiser; destroyed after ioc_ so any in-flight DB work completes
before the destructor returns). One thread is sufficient — avoids concurrent
DB connections and matches the single-connection pool pattern already in use.
Step 2: Offload monitor_process() DB writes
In monitor_process(), all processes_ map accesses complete by line 534
(remove_pid_file). After that line, switch to the DB pool executor:
co_await boost::asio::post( boost::asio::bind_executor(db_pool_.get_executor(), boost::asio::use_awaitable)); // blocking DB writes go here (unchanged) try { ... inst_repo.update_phase(...); ev_repo.insert(...); } catch (...) { ... } if (entry->stop_requested) co_return; // no more shared state — ok to stay on db_pool thread // Switch back for restart logic (accesses processes_ map, calls do_launch) co_await boost::asio::post( boost::asio::bind_executor(ioc_.get_executor(), boost::asio::use_awaitable)); // restart logic continues here on io_context thread
Step 3: Offload do_launch() DB writes
In do_launch(), the DB try-catch block (lines 443–480) runs after
processes_[key] = entry and co_spawn monitor_process. Switch to
db_pool_ immediately before the try-catch. Since do_launch ends after
those writes, no switch-back is needed.
Step 4: Elapsed time in start_all()
Record std::chrono::steady_clock::now() at the top of start_all() and
log the elapsed duration at the end.
Step 5: wait_for_ready_string in start-services.sh
Add a shell helper wait_for_ready_string(name, pattern) analogous to the
existing wait_for_ready function but matching a fixed string instead of
"Service ready". Call it after launching the controller:
wait_for_ready_string "ores.controller.service" "All services started"
Update the final Time : line so it prints after this wait completes.
Step 6 (optional): Idle connection keepalive
In ores.database.domain.tenant_aware_pool, add a SELECT 1 validation
in acquire() before returning a connection. On failure, rebuild the pool
and retry once. This removes the rebuild noise from normal shutdown and from
any future long idle periods.
Expected Outcome
| Metric | Before | After |
|---|---|---|
| Shutdown time | ~30 s, killed | ~5–8 s, clean exit |
| Startup script exit time | ~2–3 s (controller launched only) | Reflects true time-to-ready |
| io_context blocked during shutdown | Yes, serially per service | No — DB writes on thread pool |
| "Pool connection dead" log noise | Every operation | Eliminated (Step 6) |