|
Is this an overkill? I don't know.
The thing is, correctly intercepting SIGTERM (also SIGINT, etc.) is
incredibly tricky. For example, before this commit, my I/O loops in
server.c and worker.c were inherently racy.
This was immediately obvious if you tried to run the tests. The tests
(especially the Valgrind flavour) would run a worker, wait until it
prints a "Waiting for a new command" line, and try to kill it using
SIGTERM. The problem is, the global_stop_flag check could have already
been executed by the worker, and it would hang forever in recv().
The solution seems to be to use signalfd and select()/poll(). I've never
used either before, but it seems to work well enough - at least the very
same tests pass and don't hang now.
|
|
OK, this is a major rework.
* tcp_server: connection threads are not detached anymore, the caller has
to clean them up. This was done so that the server can clean up the
threads cleanly.
* run_queue: simple refactoring, run_queue_entry is called just run now.
* server: worker threads are now killed when a run is assigned to a
worker.
* worker: the connection to server is no longer persistent. A worker
sends "new-worker", waits for a task, closes the connection, and when
it's done, sends the "complete" message and waits for a new task.
This is supposed to improve resilience, since the worker-server
connections don't have to be maintained while the worker is doing a CI
run.
|