[GE users] Job deletion problems

John Hearns john.hearns at streamline-computing.com
Wed Jan 16 16:47:40 GMT 2008


On a cluster yesterday I saw two instances of a job being stopped ans
shoing some strange behaviour. I'm asking if anyone has seen this
before.

The jobs are mpich parallel jobs, running over Myrinet-MX in a loose
integration, however the Myrinet specific part is I think irrelevant.

The jobs are being stopped when they reach a H_RT of 72:00:00

(One curious aside, the queue had been configured with S_RT at exactt;y
the same time, ie
s_rt  72:00:00
h_rt  72:00:00

this should not make any difference. Am I right?)

On the qmaster logs you get

execd at comp66 reports running job (1381.1/master) in queue
"parallel.q at comp66" that was not supposed to be there - killing

The messages log on comp66 repeats this message endlessly:

comp66|W|job 1381.1 exceeded hard wallclock time - initiate terminal
method


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list