[GE users] Job deletion problems

Ravi Chandra Nallan Ravichandra.Nallan at Sun.COM
Thu Jan 17 06:03:54 GMT 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Not sure if I understood your problem.
But if you mean that the jobs get deleted when h_rt of job = 72:00:00, 
and the queue is configured with a limit of h_rt=72:00:00, then what you 
are seeing is obvious.

 From queue_conf(5)

RESOURCE LIMITS
     The first two resource limit parameters, s_rt and h_rt,  are
     implemented  by  Grid Engine. They define the "real time" or
     also called "elapsed" or "wall clock" time which has  passed
     since  the  start  of  the job. If h_rt is exceeded by a job
     running in the queue, it is aborted via the  SIGKILL  signal
     (see  kill(1)).   If  s_rt  is  exceeded,  the  job is first
     "warned" via the SIGUSR1 signal (which can be caught by  the
     job) and finally aborted after the notification time defined
     in the queue configuration parameter notify (see above)  has
     passed.

Sorry of I got you wrong.
regards,
~Ravi

John Hearns wrote:
> On a cluster yesterday I saw two instances of a job being stopped ans
> shoing some strange behaviour. I'm asking if anyone has seen this
> before.
>
> The jobs are mpich parallel jobs, running over Myrinet-MX in a loose
> integration, however the Myrinet specific part is I think irrelevant.
>
> The jobs are being stopped when they reach a H_RT of 72:00:00
>
> (One curious aside, the queue had been configured with S_RT at exactt;y
> the same time, ie
> s_rt  72:00:00
> h_rt  72:00:00
>
> this should not make any difference. Am I right?)
>
> On the qmaster logs you get
>
> execd at comp66 reports running job (1381.1/master) in queue
> "parallel.q at comp66" that was not supposed to be there - killing
>
> The messages log on comp66 repeats this message endlessly:
>
> comp66|W|job 1381.1 exceeded hard wallclock time - initiate terminal
> method
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list