Opened 12 years ago

Last modified 11 years ago

#766 new defect

IZ3219: Parallel jobs failing randomly on solaris machines

Reported by: juanjo Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u3
Severity: Keywords: Solaris execution


[Imported from gridengine issuezilla]

        Issue #:      3219             Platform:     All       Reporter: juanjo (juanjo)
       Component:     gridengine          OS:        Solaris
     Subcomponent:    execution        Version:      6.2u3        CC:
                                                                         [_] uddeborg
                                                                         [_] Remove selected CCs
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
       * Summary:     Parallel jobs failing randomly on solaris machines
   Status whiteboard:
                      Date/filename:                               Description:                         Submitted by:
                      Thu Jan 7 10:08:00 -0700 2010: testcase.tbz2 test case (application/octet-stream) juanjo

     Issue 3219 blocks:
   Votes for issue 3219:

   Opened: Thu Jan 7 10:06:00 -0700 2010 

We have recently started to notice failures in parallel jobs when they get scheduled on solaris machines. This happens both for sparc64
machines and amd64, but it seems to affect more the amd64 machines for some reason. We have created a small test case that reproduces the
error in a cluster containing solaris machines. Output should be something like this:

juanjo@jouf ~/slask $ ./ make
running 50 working jobs
running 50 failing jobs

Test          Run      Failed
Working test  50        1
Failing test  50        33

The "working test" is exactly the same script as the failing test, with the small difference of having an empty "echo" that prints a blank
line at the beginning of the code intended for the slave job. This "workaround" is not 100% effective, though, and from time to time we get
also failing jobs. These failures produced when the workaround is active _might_ be completely unrelated.

Note that the errors are quite random and don't seem to have any connection with a specific machine, parallel environment, script
interpreter, etc.

We are running 6.2u3, but looking at our history we have discovered that the error is not new to that version, although we can't say when
exactly it started happening. We have also tested it in a new, empty 6.2u4 cluster and the error still happens.

   ------- Additional comments from juanjo Thu Jan 7 10:08:37 -0700 2010 -------
Created an attachment (id=196)
test case

   ------- Additional comments from juanjo Thu Jan 7 10:10:57 -0700 2010 -------
btw, what we get in the helper job's trace file is this:

12/02/2009 12:46:09 [517:2906]: now running with uid=517, euid=517
12/02/2009 12:46:09 [517:2906]: args[0] = "/usr/local/share/sge6.2/utilbin/sol-amd64/qrsh_starter"

12/02/2009 12:46:09 [517:2906]: args[1] = "/usr/local/share/sge6.2/default/spool/sunvalley/active_jobs/59570.1/1.sunvalley"

12/02/2009 12:46:09 [517:2906]: execvp(/usr/local/share/sge6.2/utilbin/sol-amd64/qrsh_starter, ...);
12/02/2009 12:46:40 [211:2891]: commlib_to_pty: was connected and still have selectors, but lost connection -> exiting
12/02/2009 12:46:40 [0:2891]: found pid of qrsh client command: -2928
12/02/2009 12:46:40 [211:2891]: now sending signal KILL to pid -2928
12/02/2009 12:46:40 [211:2891]: pty_to_commlib: closing pipe to child
12/02/2009 12:46:40 [211:2891]: wait3 returned 2906 (status: 0; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)

   ------- Additional comments from uddeborg Thu Jan 7 10:17:29 -0700 2010 -------

Attachments (1)

196 (940 bytes) - added by dlove 11 years ago.

Download all attachments as: .zip

Change History (1)

Changed 11 years ago by dlove

Note: See TracTickets for help on using tickets.