[GE issues] [Issue 2837] New - execd job start might produce communication errors because of not posix compliant fork() system call
crei at sun.com
Wed Dec 17 08:20:23 GMT 2008
Summary|execd job start might produce communication errors bec
|ause of not posix compliant fork() system call
------- Additional comments from crei at sunsource.net Wed Dec 17 00:20:20 -0800 2008 -------
At compilation time the gridengine source code for solaris is linked with the
-lthreads library. The fork() implementation for this threading library also
duplicates the running communication threads for a process. This is not posix
compliant. On all other architectures the -lpthread library is used which only
copies the thread which is calling fork() to the new process.
On solaris 8 and 9 this can produce communication errors with the gridengine
execution daemon. The daemon is doing a fork() when starting a job. There is a
small time window before the new started process calls exec() which terminates
all threads. During this time window the also duplicated execd communication
threads might shutdown a connected client.
This problem especially occurs for large tight integrated job submissions
because the qrsh -inhert is connecting directly to the execution daemon to start
a slave task.
For standard jobs which are communicated via qmaster to the execd this is no
problem - the execd simply does
a reconnect to qmaster if the connection is lost and the job is resend if the
execd does not report the running job.
To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users