[GE issues] [Issue 2837] New - execd job start might produce communication errors because of not posix compliant fork() system call

crei crei at sun.com
Wed Dec 17 08:20:23 GMT 2008


http://gridengine.sunsource.net/issues/show_bug.cgi?id=2837
                 Issue #|2837
                 Summary|execd job start might produce communication errors bec
                        |ause of not posix compliant fork() system call
               Component|gridengine
                 Version|6.2
                Platform|Sun
                     URL|
              OS/Version|All
                  Status|NEW
       Status whiteboard|
                Keywords|
              Resolution|
              Issue type|DEFECT
                Priority|P3
            Subcomponent|communication
             Assigned to|crei
             Reported by|crei






------- Additional comments from crei at sunsource.net Wed Dec 17 00:20:20 -0800 2008 -------
At compilation time the gridengine source code for solaris is linked with the
-lthreads library. The fork() implementation for this threading library also
duplicates the running communication threads for a process. This is not posix
compliant. On all other architectures the -lpthread library is used which only
copies the thread which is calling fork() to the new process.

On solaris 8 and 9 this can produce communication errors with the gridengine
execution daemon. The daemon is doing a fork() when starting a job. There is a
small time window before the new started process calls exec() which terminates
all threads. During this time window the also duplicated execd communication
threads might shutdown a connected client.

This problem especially occurs for large tight integrated job submissions
because the qrsh -inhert is connecting directly to the execution daemon to start
a slave task.

For standard jobs which are communicated via qmaster to the execd this is no
problem - the execd simply does
a reconnect to qmaster if the connection is lost and the job is resend if the
execd does not report the running job.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=92916

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list