[GE users] Decoding "failed" messages

Iwona Sakrejda isakrejda at lbl.gov
Wed Apr 13 19:52:31 BST 2005


Hi,

I owe everybody an explanation as we got to the botom of the
"failed 100: assumedly after job" mistery.

The user in question set wall-clock limit for his jobs after
timing them. He ran several different sets so limits were
different for each set. Those limits were much shorted than
the queue limits. He timed his jobs too well so the limit was
really close to how much time a job needed. Sometimes a job
would slow down, when the NFS mounted file system he was reading from
had a heavier load and be killed. Same job re-run would be just fine
if the diskvault was responding faster.

The user did not pay attention to his own limits, I did not see
from qacct that the job had this extra requirements and since
user had several sets of limits I did not see any systematics
in the run time......

So in a way, yes, the user was killing his own job.
Couldn' there be a message saying job exceeded self-imposed limits
or something similar?

Is there a way to figure out what requirements were set by the user for
his job after the job completion? qstat is showing all that, but not qacct.


Iwona Sakrejda wrote:

> Hi,
> 
> My reply inserted after your question below to maintain the context.
> But the short answer is no. There is actually no single entry
> on the worker node that run the job for the whole day while
> the master was reporting that "assumedly after job" failure...
> 
> Iwona
> 
> Reuti wrote:
> 
>>>>>> Was it killed by the user directly (outside of SGE)?
>>>>>>
>>>>> The user thinks he did not do it and he has no direct
>>>>> access to the batch node, however under some circumstances he
>>>>> might be able to run two jobs on the same host, so one job
>>>>> could kill the other in principle. However he swears that
>>>>
>>>>
>>>> Why is it a problem for your application to run two times on the 
>>>> same node - can this be adjusted? Did the job exceed any requested 
>>>> limits? - Reuti
>>>>
>>> There is no problems for the application to run twice  on the same node
>>> and the job did not exceed any limits as far as I can tell. I 
>>> mentioned the two job
>>> scenario, because that would be the only way when a user can kill a 
>>> process
>>> belonging to a job from outside of that job.  That was a reply to the 
>>> question whether
>>> the user could kill a process directly (and not by killing the job).
>>>
>>> Actually I see this message for different users. I have about 44k 
>>> entries
>>> in the messages file and 1.5k of them are about killing "assumedly 
>>> after job".
>>
>>
>>
>> can you have a look in the messages file of the node, where the job 
>> was running, whether there is also something stated about the job abort?
>>
> I checked and there is no single message for a whole day (and the day 
> before that)
> on that node (as written above)...
> 
> Iwona
> 
> 
>> CU - Reuti
>>
>>
>>>
>>> Iwona
>>>
>>>>
>>>>> in none of his jobs any kill is issued.
>>>>>
>>>>> Iwona
>>>>>
>>>>>
>>>>>
>>>>>> -Ron
>>>>>>
>>>>>>
>>>>>> --- Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>>>>>
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Where can I look up error codes shown after the
>>>>>>> "failed" entry
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> from qacct -j?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Some of the jobs are failing with a message:
>>>>>>> "failed       100 : assumedly after job"
>>>>>>>
>>>>>>> I was trying to find some more info about them and
>>>>>>> the qmaster/messages file
>>>>>>> has the following entry for this job:
>>>>>>>
>>>>>>> Tue Apr  5 12:29:51 2005|qmaster|pdsfcore03|W|job
>>>>>>> 404796.1 failed on host pc2515.nersc.gov assumedly after job 
>>>>>>> because: job 404796.1 died
>>>>>>> through signal KILL (9)
>>>>>>>
>>>>>>> That did not explain much either. The job runs just
>>>>>>> fine if resubmitted.
>>>>>>> How to figure out why are those jobs dying?
>>>>>>>
>>>>>>> Any assistance in sorting this out would be
>>>>>>> appreciated.....
>>>>>>>
>>>>>>> Iwona Sakrejda
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>> To unsubscribe, e-mail:
>>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail:
>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>        __________________________________ Do you Yahoo!? Yahoo! 
>>>>>> Small Business - Try our new resources site!
>>>>>> http://smallbusiness.yahoo.com/resources/
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list