[GE users] Decoding "failed" messages

Reuti reuti at staff.uni-marburg.de
Wed Apr 13 20:18:04 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

you can set "qconf -mconf":

loglevel                     log_info

and will get:

"job 2707.1 exceeded hard wallclock time - initiate terminate method"

in the messages file of the node.

Cheers - Reuti


Quoting Iwona Sakrejda <isakrejda at lbl.gov>:

> Hi,
> 
> I owe everybody an explanation as we got to the botom of the
> "failed 100: assumedly after job" mistery.
> 
> The user in question set wall-clock limit for his jobs after
> timing them. He ran several different sets so limits were
> different for each set. Those limits were much shorted than
> the queue limits. He timed his jobs too well so the limit was
> really close to how much time a job needed. Sometimes a job
> would slow down, when the NFS mounted file system he was reading from
> had a heavier load and be killed. Same job re-run would be just fine
> if the diskvault was responding faster.
> 
> The user did not pay attention to his own limits, I did not see
> from qacct that the job had this extra requirements and since
> user had several sets of limits I did not see any systematics
> in the run time......
> 
> So in a way, yes, the user was killing his own job.
> Couldn' there be a message saying job exceeded self-imposed limits
> or something similar?
> 
> Is there a way to figure out what requirements were set by the user for
> his job after the job completion? qstat is showing all that, but not qacct.
> 
> 
> Iwona Sakrejda wrote:
> 
> > Hi,
> > 
> > My reply inserted after your question below to maintain the context.
> > But the short answer is no. There is actually no single entry
> > on the worker node that run the job for the whole day while
> > the master was reporting that "assumedly after job" failure...
> > 
> > Iwona
> > 
> > Reuti wrote:
> > 
> >>>>>> Was it killed by the user directly (outside of SGE)?
> >>>>>>
> >>>>> The user thinks he did not do it and he has no direct
> >>>>> access to the batch node, however under some circumstances he
> >>>>> might be able to run two jobs on the same host, so one job
> >>>>> could kill the other in principle. However he swears that
> >>>>
> >>>>
> >>>> Why is it a problem for your application to run two times on the 
> >>>> same node - can this be adjusted? Did the job exceed any requested 
> >>>> limits? - Reuti
> >>>>
> >>> There is no problems for the application to run twice  on the same node
> >>> and the job did not exceed any limits as far as I can tell. I 
> >>> mentioned the two job
> >>> scenario, because that would be the only way when a user can kill a 
> >>> process
> >>> belonging to a job from outside of that job.  That was a reply to the 
> >>> question whether
> >>> the user could kill a process directly (and not by killing the job).
> >>>
> >>> Actually I see this message for different users. I have about 44k 
> >>> entries
> >>> in the messages file and 1.5k of them are about killing "assumedly 
> >>> after job".
> >>
> >>
> >>
> >> can you have a look in the messages file of the node, where the job 
> >> was running, whether there is also something stated about the job abort?
> >>
> > I checked and there is no single message for a whole day (and the day 
> > before that)
> > on that node (as written above)...
> > 
> > Iwona
> > 
> > 
> >> CU - Reuti
> >>
> >>
> >>>
> >>> Iwona
> >>>
> >>>>
> >>>>> in none of his jobs any kill is issued.
> >>>>>
> >>>>> Iwona
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -Ron
> >>>>>>
> >>>>>>
> >>>>>> --- Iwona Sakrejda <isakrejda at lbl.gov> wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Where can I look up error codes shown after the
> >>>>>>> "failed" entry
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> from qacct -j?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Some of the jobs are failing with a message:
> >>>>>>> "failed       100 : assumedly after job"
> >>>>>>>
> >>>>>>> I was trying to find some more info about them and
> >>>>>>> the qmaster/messages file
> >>>>>>> has the following entry for this job:
> >>>>>>>
> >>>>>>> Tue Apr  5 12:29:51 2005|qmaster|pdsfcore03|W|job
> >>>>>>> 404796.1 failed on host pc2515.nersc.gov assumedly after job 
> >>>>>>> because: job 404796.1 died
> >>>>>>> through signal KILL (9)
> >>>>>>>
> >>>>>>> That did not explain much either. The job runs just
> >>>>>>> fine if resubmitted.
> >>>>>>> How to figure out why are those jobs dying?
> >>>>>>>
> >>>>>>> Any assistance in sorting this out would be
> >>>>>>> appreciated.....
> >>>>>>>
> >>>>>>> Iwona Sakrejda
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>> To unsubscribe, e-mail:
> >>>>>>> users-unsubscribe at gridengine.sunsource.net
> >>>>>>> For additional commands, e-mail:
> >>>>>>> users-help at gridengine.sunsource.net
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>        __________________________________ Do you Yahoo!? Yahoo! 
> >>>>>> Small Business - Try our new resources site!
> >>>>>> http://smallbusiness.yahoo.com/resources/
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list