[GE users] Jobs getting killed randomly on the Grid.

Sreenath Nampally sreenath at tigr.ORG
Tue May 1 15:32:03 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

All,

We have been seeing a weird behavior on the Grid where jobs are getting
killed by KILL signal randomly and intermittently.

The entries in the messages file look like below. 

05/01/2007 08:24:28|qmaster|apollo|W|job 4953721.1 failed on host
dell-2-0-9.tigr.org assumedly after job because: job 4953721.1 died
through signal KILL (9)
05/01/2007 08:24:28|qmaster|apollo|W|job 4953722.1 failed on host
dell-2-0-9.tigr.org assumedly after job because: job 4953722.1 died
through signal KILL (9)
05/01/2007 08:24:28|qmaster|apollo|W|job 4953723.1 failed on host
dell-2-0-9.tigr.org assumedly after job because: job 4953723.1 died
through signal KILL (9)
05/01/2007 08:24:28|qmaster|apollo|W|job 4953724.1 failed on host
dell-2-0-9.tigr.org assumedly after job because: job 4953724.1 died
through signal KILL (9)
05/01/2007 08:24:28|qmaster|apollo|W|job 4953725.1 failed on host
dell-2-0-9.tigr.org assumedly after job because: job 4953725.1 died
through signal KILL (9)
05/01/2007 08:24:28|qmaster|apollo|W|job 4953726.1 failed on host
dell-2-0-9.tigr.org assumedly after job because: job 4953726.1 died
through signal KILL (9)

These were not 'qdel' ed.  This seems to happen on different exec nodes
but whenever it happens, a bunch of jobs get killed at the same time. 
There are other jobs that do finish successfully around the same time.

We are having trouble tracking down the problem as there were no other
error messages written anywhere. We are catching
stderr at every possible step.

We recently upgraded to N1GE 6.0u10.  Could this be an issue related to
the u10 patch ?


Any help / pointers will be appreciated.   Let me know if you need more
info.

Thanks
Sree


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list