[GE users] Jobs getting killed randomly on the Grid.

Reuti reuti at staff.uni-marburg.de
Tue May 1 17:41:46 BST 2007


Hi,

Am 01.05.2007 um 16:32 schrieb Sreenath Nampally:

> We have been seeing a weird behavior on the Grid where jobs are  
> getting
> killed by KILL signal randomly and intermittently.
>
> The entries in the messages file look like below.
>
> 05/01/2007 08:24:28|qmaster|apollo|W|job 4953721.1 failed on host
> dell-2-0-9.tigr.org assumedly after job because: job 4953721.1 died
> through signal KILL (9)
> 05/01/2007 08:24:28|qmaster|apollo|W|job 4953722.1 failed on host
> dell-2-0-9.tigr.org assumedly after job because: job 4953722.1 died
> through signal KILL (9)
> 05/01/2007 08:24:28|qmaster|apollo|W|job 4953723.1 failed on host
> dell-2-0-9.tigr.org assumedly after job because: job 4953723.1 died
> through signal KILL (9)
> 05/01/2007 08:24:28|qmaster|apollo|W|job 4953724.1 failed on host
> dell-2-0-9.tigr.org assumedly after job because: job 4953724.1 died
> through signal KILL (9)
> 05/01/2007 08:24:28|qmaster|apollo|W|job 4953725.1 failed on host
> dell-2-0-9.tigr.org assumedly after job because: job 4953725.1 died
> through signal KILL (9)
> 05/01/2007 08:24:28|qmaster|apollo|W|job 4953726.1 failed on host
> dell-2-0-9.tigr.org assumedly after job because: job 4953726.1 died
> through signal KILL (9)
>
> These were not 'qdel' ed.  This seems to happen on different exec  
> nodes
> but whenever it happens, a bunch of jobs get killed at the same time.
> There are other jobs that do finish successfully around the same time.

- can the users login there and kill the jobs on their own?
- anything in the "messages" file of each node (because of any  
exceeded limit)?
- any process-cleaner running there?
- anything in /var/log/messages?

-- Reuti

> We are having trouble tracking down the problem as there were no other
> error messages written anywhere. We are catching
> stderr at every possible step.
>
> We recently upgraded to N1GE 6.0u10.  Could this be an issue  
> related to
> the u10 patch ?
>
>
> Any help / pointers will be appreciated.   Let me know if you need  
> more
> info.
>
> Thanks
> Sree
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list