[GE issues] [Issue 3219] Parallel jobs failing randomly on solaris machines

juanjo juanjo.gutierrez at jeppesen.com
Thu Jan 7 17:10:58 GMT 2010


http://gridengine.sunsource.net/issues/show_bug.cgi?id=3219






------- Additional comments from juanjo at sunsource.net Thu Jan  7 09:10:57 -0800 2010 -------
btw, what we get in the helper job's trace file is this:

12/02/2009 12:46:09 [517:2906]: now running with uid=517, euid=517
12/02/2009 12:46:09 [517:2906]: args[0] = "/usr/local/share/sge6.2/utilbin/sol-amd64/qrsh_starter"

12/02/2009 12:46:09 [517:2906]: args[1] = "/usr/local/share/sge6.2/default/spool/sunvalley/active_jobs/59570.1/1.sunvalley"

12/02/2009 12:46:09 [517:2906]: execvp(/usr/local/share/sge6.2/utilbin/sol-amd64/qrsh_starter, ...);
12/02/2009 12:46:40 [211:2891]: commlib_to_pty: was connected and still have selectors, but lost connection -> exiting
12/02/2009 12:46:40 [0:2891]: found pid of qrsh client command: -2928
12/02/2009 12:46:40 [211:2891]: now sending signal KILL to pid -2928
12/02/2009 12:46:40 [211:2891]: pty_to_commlib: closing pipe to child
12/02/2009 12:46:40 [211:2891]: wait3 returned 2906 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=36&dsMessageId=237122

To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list