Sunday, September 25, 2011

linux threads and forking (and zeroc ice)

so we have this nice icegrid setup with python nodes tied with sip to underlying c++ code. on new rpc call, ice creates a new thread to handle it, but as the underlying c++ code is not threadsafe we fork from this thread and continue execution in a new process.

everything works perfectly, but one day we decide to remove the python layer and go with c++ all the way

and then something weird started happening - forked processes started hanging, crashing, a total mess...after some research we found out several articles mentioning that it is a bad idea to mix linux threads and forks because of possible copied locked mutexes in the child process and this is exactly what we had observed.

so how we solve this? and why the python+cpp solution worked fine?

as a second solution we modified our code to use a thin python wrapper again over the main c++ functionality in attempt to copy the behavior from the original solution, but again the forked child processes had copied locked mutexes that caused them to hang

so, how come? in the original solution A the underlying code is pretty much the same as B and C, what appears is that some ice threads are calling localtime_r in the moment when we fork and when the forked code tries to execure localtime_r it locks.
but why does it not happen in the original python/cpp solution? why does it happen in the new python/cpp solution? are there some python flags to help avoid this? the original code would spend some time jumping from python to cpp, while the new code would only enter cpp code once and return the results.

in any case it seems a mess so we went for another completely different solution, but the headache was/is huge