so we have this nice icegrid setup with python nodes tied with sip to underlying c++ code. on new rpc call, ice creates a new thread to handle it, but as the underlying c++ code is not threadsafe we fork from this thread and continue execution in a new process.
everything works perfectly, but one day we decide to remove the python layer and go with c++ all the way
and then something weird started happening - forked processes started hanging, crashing, a total mess...after some research we found out several articles mentioning that it is a bad idea to mix linux threads and forks because of possible copied locked mutexes in the child process and this is exactly what we had observed.
so how we solve this? and why the python+cpp solution worked fine?
as a second solution we modified our code to use a thin python wrapper again over the main c++ functionality in attempt to copy the behavior from the original solution, but again the forked child processes had copied locked mutexes that caused them to hang
so, how come? in the original solution A the underlying code is pretty much the same as B and C, what appears is that some ice threads are calling localtime_r in the moment when we fork and when the forked code tries to execure localtime_r it locks.
but why does it not happen in the original python/cpp solution? why does it happen in the new python/cpp solution? are there some python flags to help avoid this? the original code would spend some time jumping from python to cpp, while the new code would only enter cpp code once and return the results.
in any case it seems a mess so we went for another completely different solution, but the headache was/is huge
Sunday, September 25, 2011
Saturday, April 30, 2011
using apache qpid with persistence
that's just a quick post about two weeks struggle with a problem which was eventually solved in half an hour.
as at my company we (i) decided to use apache QPID as a message queue framework, we certainly needed to use persistance for a queues and messages. building qpid itself was really easy and straight forward, and for building the persistance module msgstore.so i was following the directions posted at Lahiru Gunathilake's Blog here. i build the whole setup on my local machine and everything was working just perfectly - messages were sent persitently, queues were durable etc.
then i started deploying on the dev servers
the qpid broker started crashing on startup.
the qpid broker started crashing on startup.
the qpid broker started crashing on startup.
and on and on and on
it was easy to detect that the problem appeared when the broker was started with --load-module=msgstore.so and that it seg faulted when attempting to create a berkeley db database. but why??
well...on my local machine i am running fedora, and the dev server rules with CentOs.
my fedora has berkeley db 4.8 and the server has 4.5. this was the first 'ding'.
now, how to trick the qpid configure into using db4.8 instead of db4.5? we couldn't just install the new libs as there might have been incompatibilities with the product, so what to do? we tried modifying configure scripts, playing with sym links etc - i myself don't have much experience with linux so was relying mostly on the admin guy, but he was helpless with this too - seg fault after segfaul, while on my machine the broker hummed silently, transfering persistant messages and maintaining durable queues.
then after a loooot of reading we came to an obvious solution - the qpid configure itself gave it too us, and hadn't i been too shortsighted, this would have been done days ago -
just build berkeley 4.8 with --perfix to place it in a separate directory (say $BDB48), away from the db4.5 libs the product needs, and then, before running qpid configure, run these
(i'm writing these by memory, check qpid's configure script with --help to see the correct ones)
after that run configure and it will find and use the berkeley libs you wanted it too!
awesome!
now lets do the same in production where we have RedHad 5. BOOM! here goes the so familiar seg fault that we managed to escape by the clever export trick. why? what is so different on redhat???
another week of research, testing, breaking, hair-tearing followed, this time with no result.
i was even contemplating to run a virtual fedora machine on the redhat server, just and so to have the familiar setup and start the broker with persitancy there.
luckily i didn't have to -
one day i sat with the it manager to explain him the situation and while we were scratching our heads we looked at persistance module's readme.txt where it said that it was tested with berkeley db 4.3.
so what? we have db4.8 and it should be fine, right?
what if we gave it another try, but not with the 4.8, but 4.3 set-up the same way?
this took about10 minutes to setup and when i hit 'enter' for qpidd --load-module=msgstore.so, before my eyes was the beautiful log dump, saying that the module was loaded and so on and so on ....
aaah....rtfm? the thing is we actually tested db4.3 when facing the initial problem of setting up the dev servers. this failed somehow, and so we didn't consider this when fighting with production setup
anyways - this post grew as long as the first two episodes of "game of thrones" that i watched today :D
have a good night and don't give up the fight!
as at my company we (i) decided to use apache QPID as a message queue framework, we certainly needed to use persistance for a queues and messages. building qpid itself was really easy and straight forward, and for building the persistance module msgstore.so i was following the directions posted at Lahiru Gunathilake's Blog here. i build the whole setup on my local machine and everything was working just perfectly - messages were sent persitently, queues were durable etc.
then i started deploying on the dev servers
the qpid broker started crashing on startup.
the qpid broker started crashing on startup.
the qpid broker started crashing on startup.
and on and on and on
it was easy to detect that the problem appeared when the broker was started with --load-module=msgstore.so and that it seg faulted when attempting to create a berkeley db database. but why??
well...on my local machine i am running fedora, and the dev server rules with CentOs.
my fedora has berkeley db 4.8 and the server has 4.5. this was the first 'ding'.
now, how to trick the qpid configure into using db4.8 instead of db4.5? we couldn't just install the new libs as there might have been incompatibilities with the product, so what to do? we tried modifying configure scripts, playing with sym links etc - i myself don't have much experience with linux so was relying mostly on the admin guy, but he was helpless with this too - seg fault after segfaul, while on my machine the broker hummed silently, transfering persistant messages and maintaining durable queues.
then after a loooot of reading we came to an obvious solution - the qpid configure itself gave it too us, and hadn't i been too shortsighted, this would have been done days ago -
just build berkeley 4.8 with --perfix to place it in a separate directory (say $BDB48), away from the db4.5 libs the product needs, and then, before running qpid configure, run these
export CPP_PATH=$BDB48/include
export LIB_PATH=$BDB48/libs
(i'm writing these by memory, check qpid's configure script with --help to see the correct ones)
after that run configure and it will find and use the berkeley libs you wanted it too!
awesome!
now lets do the same in production where we have RedHad 5. BOOM! here goes the so familiar seg fault that we managed to escape by the clever export trick. why? what is so different on redhat???
another week of research, testing, breaking, hair-tearing followed, this time with no result.
i was even contemplating to run a virtual fedora machine on the redhat server, just and so to have the familiar setup and start the broker with persitancy there.
luckily i didn't have to -
one day i sat with the it manager to explain him the situation and while we were scratching our heads we looked at persistance module's readme.txt where it said that it was tested with berkeley db 4.3.
so what? we have db4.8 and it should be fine, right?
what if we gave it another try, but not with the 4.8, but 4.3 set-up the same way?
this took about10 minutes to setup and when i hit 'enter' for qpidd --load-module=msgstore.so, before my eyes was the beautiful log dump, saying that the module was loaded and so on and so on ....
aaah....rtfm? the thing is we actually tested db4.3 when facing the initial problem of setting up the dev servers. this failed somehow, and so we didn't consider this when fighting with production setup
anyways - this post grew as long as the first two episodes of "game of thrones" that i watched today :D
have a good night and don't give up the fight!
Subscribe to:
Posts
(
Atom
)