[GE issues] [Issue 2837] New - execd job start might produce communication errors because of not posix compliant fork() system call

crei crei at sun.com
Wed Dec 17 08:20:23 GMT 2008

                 Issue #|2837
                 Summary|execd job start might produce communication errors bec
                        |ause of not posix compliant fork() system call
       Status whiteboard|
              Issue type|DEFECT
             Assigned to|crei
             Reported by|crei

------- Additional comments from crei at sunsource.net Wed Dec 17 00:20:20 -0800 2008 -------
At compilation time the gridengine source code for solaris is linked with the
-lthreads library. The fork() implementation for this threading library also
duplicates the running communication threads for a process. This is not posix
compliant. On all other architectures the -lpthread library is used which only
copies the thread which is calling fork() to the new process.

On solaris 8 and 9 this can produce communication errors with the gridengine
execution daemon. The daemon is doing a fork() when starting a job. There is a
small time window before the new started process calls exec() which terminates
all threads. During this time window the also duplicated execd communication
threads might shutdown a connected client.

This problem especially occurs for large tight integrated job submissions
because the qrsh -inhert is connecting directly to the execution daemon to start
a slave task.

For standard jobs which are communicated via qmaster to the execd this is no
problem - the execd simply does
a reconnect to qmaster if the connection is lost and the job is resend if the
execd does not report the running job.


To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list