[GE users] Decoding "failed" messages

Iwona Sakrejda isakrejda at lbl.gov
Wed Apr 13 19:57:00 BST 2005


I hit <CR> too sun on my previous e-mail.
Of course thanks to everybody who was helping and got us thinking
in the right direction.

Thanks again,
Iwona


Iwona Sakrejda wrote:

> Hi,
> 
> I owe everybody an explanation as we got to the botom of the
> "failed 100: assumedly after job" mistery.
> 
> The user in question set wall-clock limit for his jobs after
> timing them. He ran several different sets so limits were
> different for each set. Those limits were much shorted than
> the queue limits. He timed his jobs too well so the limit was
> really close to how much time a job needed. Sometimes a job
> would slow down, when the NFS mounted file system he was reading from
> had a heavier load and be killed. Same job re-run would be just fine
> if the diskvault was responding faster.
> 
> The user did not pay attention to his own limits, I did not see
> from qacct that the job had this extra requirements and since
> user had several sets of limits I did not see any systematics
> in the run time......
> 
> So in a way, yes, the user was killing his own job.
> Couldn' there be a message saying job exceeded self-imposed limits
> or something similar?
> 
> Is there a way to figure out what requirements were set by the user for
> his job after the job completion? qstat is showing all that, but not qacct.
> 
> 
> Iwona Sakrejda wrote:
> 
>> Hi,
>>
>> My reply inserted after your question below to maintain the context.
>> But the short answer is no. There is actually no single entry
>> on the worker node that run the job for the whole day while
>> the master was reporting that "assumedly after job" failure...
>>
>> Iwona
>>
>> Reuti wrote:
>>
>>>>>>> Was it killed by the user directly (outside of SGE)?
>>>>>>>
>>>>>> The user thinks he did not do it and he has no direct
>>>>>> access to the batch node, however under some circumstances he
>>>>>> might be able to run two jobs on the same host, so one job
>>>>>> could kill the other in principle. However he swears that
>>>>>
>>>>>
>>>>>
>>>>> Why is it a problem for your application to run two times on the 
>>>>> same node - can this be adjusted? Did the job exceed any requested 
>>>>> limits? - Reuti
>>>>>
>>>> There is no problems for the application to run twice  on the same node
>>>> and the job did not exceed any limits as far as I can tell. I 
>>>> mentioned the two job
>>>> scenario, because that would be the only way when a user can kill a 
>>>> process
>>>> belonging to a job from outside of that job.  That was a reply to 
>>>> the question whether
>>>> the user could kill a process directly (and not by killing the job).
>>>>
>>>> Actually I see this message for different users. I have about 44k 
>>>> entries
>>>> in the messages file and 1.5k of them are about killing "assumedly 
>>>> after job".
>>>
>>>
>>>
>>>
>>> can you have a look in the messages file of the node, where the job 
>>> was running, whether there is also something stated about the job abort?
>>>
>> I checked and there is no single message for a whole day (and the day 
>> before that)
>> on that node (as written above)...
>>
>> Iwona
>>
>>
>>> CU - Reuti
>>>
>>>
>>>>
>>>> Iwona
>>>>
>>>>>
>>>>>> in none of his jobs any kill is issued.
>>>>>>
>>>>>> Iwona
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -Ron
>>>>>>>
>>>>>>>
>>>>>>> --- Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Where can I look up error codes shown after the
>>>>>>>> "failed" entry
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> from qacct -j?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Some of the jobs are failing with a message:
>>>>>>>> "failed       100 : assumedly after job"
>>>>>>>>
>>>>>>>> I was trying to find some more info about them and
>>>>>>>> the qmaster/messages file
>>>>>>>> has the following entry for this job:
>>>>>>>>
>>>>>>>> Tue Apr  5 12:29:51 2005|qmaster|pdsfcore03|W|job
>>>>>>>> 404796.1 failed on host pc2515.nersc.gov assumedly after job 
>>>>>>>> because: job 404796.1 died
>>>>>>>> through signal KILL (9)
>>>>>>>>
>>>>>>>> That did not explain much either. The job runs just
>>>>>>>> fine if resubmitted.
>>>>>>>> How to figure out why are those jobs dying?
>>>>>>>>
>>>>>>>> Any assistance in sorting this out would be
>>>>>>>> appreciated.....
>>>>>>>>
>>>>>>>> Iwona Sakrejda
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> To unsubscribe, e-mail:
>>>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail:
>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>        __________________________________ Do you Yahoo!? Yahoo! 
>>>>>>> Small Business - Try our new resources site!
>>>>>>> http://smallbusiness.yahoo.com/resources/
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list