[GE users] Decoding "failed" messages

Iwona Sakrejda isakrejda at lbl.gov
Tue Apr 12 19:44:34 BST 2005


Hi,

My reply inserted after your question below to maintain the context.
But the short answer is no. There is actually no single entry
on the worker node that run the job for the whole day while
the master was reporting that "assumedly after job" failure...

Iwona

Reuti wrote:

>>>>> Was it killed by the user directly (outside of SGE)?
>>>>>
>>>> The user thinks he did not do it and he has no direct
>>>> access to the batch node, however under some circumstances he
>>>> might be able to run two jobs on the same host, so one job
>>>> could kill the other in principle. However he swears that
>>>
>>> Why is it a problem for your application to run two times on the same 
>>> node - can this be adjusted? Did the job exceed any requested limits? 
>>> - Reuti
>>>
>> There is no problems for the application to run twice  on the same node
>> and the job did not exceed any limits as far as I can tell. I 
>> mentioned the two job
>> scenario, because that would be the only way when a user can kill a 
>> process
>> belonging to a job from outside of that job.  That was a reply to the 
>> question whether
>> the user could kill a process directly (and not by killing the job).
>>
>> Actually I see this message for different users. I have about 44k entries
>> in the messages file and 1.5k of them are about killing "assumedly 
>> after job".
> 
> 
> can you have a look in the messages file of the node, where the job was 
> running, whether there is also something stated about the job abort?
> 
I checked and there is no single message for a whole day (and the day before that)
on that node (as written above)...

Iwona


> CU - Reuti
> 
> 
>>
>> Iwona
>>
>>>
>>>> in none of his jobs any kill is issued.
>>>>
>>>> Iwona
>>>>
>>>>
>>>>
>>>>> -Ron
>>>>>
>>>>>
>>>>> --- Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Where can I look up error codes shown after the
>>>>>> "failed" entry
>>>>>
>>>>>
>>>>>
>>>>>> from qacct -j?
>>>>>
>>>>>
>>>>>
>>>>>> Some of the jobs are failing with a message:
>>>>>> "failed       100 : assumedly after job"
>>>>>>
>>>>>> I was trying to find some more info about them and
>>>>>> the qmaster/messages file
>>>>>> has the following entry for this job:
>>>>>>
>>>>>> Tue Apr  5 12:29:51 2005|qmaster|pdsfcore03|W|job
>>>>>> 404796.1 failed on host pc2515.nersc.gov assumedly after job 
>>>>>> because: job 404796.1 died
>>>>>> through signal KILL (9)
>>>>>>
>>>>>> That did not explain much either. The job runs just
>>>>>> fine if resubmitted.
>>>>>> How to figure out why are those jobs dying?
>>>>>>
>>>>>> Any assistance in sorting this out would be
>>>>>> appreciated.....
>>>>>>
>>>>>> Iwona Sakrejda
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>>
>>>>>> To unsubscribe, e-mail:
>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
>>>>>> users-help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>        __________________________________ Do you Yahoo!? Yahoo! 
>>>>> Small Business - Try our new resources site!
>>>>> http://smallbusiness.yahoo.com/resources/
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list