[GE issues] [Issue 3219] New - Parallel jobs failing randomly on solaris machines

juanjo juanjo.gutierrez at jeppesen.com
Thu Jan 7 17:06:49 GMT 2010

                 Issue #|3219
                 Summary|Parallel jobs failing randomly on solaris machines
       Status whiteboard|
              Issue type|DEFECT
             Assigned to|pollinger
             Reported by|juanjo

------- Additional comments from juanjo at sunsource.net Thu Jan  7 09:06:48 -0800 2010 -------
We have recently started to notice failures in parallel jobs when they get scheduled on solaris machines. This happens both for sparc64
machines and amd64, but it seems to affect more the amd64 machines for some reason. We have created a small test case that reproduces the
error in a cluster containing solaris machines. Output should be something like this:

juanjo at jouf ~/slask $ ./test.sh make
running 50 working jobs
running 50 failing jobs

Test          Run      Failed
Working test  50        1
Failing test  50        33

The "working test" is exactly the same script as the failing test, with the small difference of having an empty "echo" that prints a blank
line at the beginning of the code intended for the slave job. This "workaround" is not 100% effective, though, and from time to time we get
also failing jobs. These failures produced when the workaround is active _might_ be completely unrelated.

Note that the errors are quite random and don't seem to have any connection with a specific machine, parallel environment, script
interpreter, etc.

We are running 6.2u3, but looking at our history we have discovered that the error is not new to that version, although we can't say when
exactly it started happening. We have also tested it in a new, empty 6.2u4 cluster and the error still happens.


To unsubscribe from this discussion, e-mail: [issues-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list