[GE users] Decoding "failed" messages

Reuti reuti at staff.uni-marburg.de
Tue Apr 12 16:45:49 BST 2005


Hi again,

Iwona Sakrejda wrote:
> 
> 
> Reuti wrote:
> 
>> Quoting Iwona Sakrejda <isakrejda at lbl.gov>:
>>
>>
>>>
>>> Ron Chen wrote:
>>>
>>>
>>>> Was it killed by the user directly (outside of SGE)?
>>>>
>>>
>>> The user thinks he did not do it and he has no direct
>>> access to the batch node, however under some circumstances he
>>> might be able to run two jobs on the same host, so one job
>>> could kill the other in principle. However he swears that
>>
>>
>>
>> Why is it a problem for your application to run two times on the same 
>> node - can this be adjusted? Did the job exceed any requested limits? 
>> - Reuti
>>
> There is no problems for the application to run twice  on the same node
> and the job did not exceed any limits as far as I can tell. I mentioned 
> the two job
> scenario, because that would be the only way when a user can kill a process
> belonging to a job from outside of that job.  That was a reply to the 
> question whether
> the user could kill a process directly (and not by killing the job).
> 
> Actually I see this message for different users. I have about 44k entries
> in the messages file and 1.5k of them are about killing "assumedly after 
> job".

can you have a look in the messages file of the node, where the job was 
running, whether there is also something stated about the job abort?

CU - Reuti


> 
> Iwona
> 
>>
>>> in none of his jobs any kill is issued.
>>>
>>> Iwona
>>>
>>>
>>>
>>>> -Ron
>>>>
>>>>
>>>> --- Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> Where can I look up error codes shown after the
>>>>> "failed" entry
>>>>
>>>>
>>>>> from qacct -j?
>>>>
>>>>
>>>>> Some of the jobs are failing with a message:
>>>>> "failed       100 : assumedly after job"
>>>>>
>>>>> I was trying to find some more info about them and
>>>>> the qmaster/messages file
>>>>> has the following entry for this job:
>>>>>
>>>>> Tue Apr  5 12:29:51 2005|qmaster|pdsfcore03|W|job
>>>>> 404796.1 failed on host pc2515.nersc.gov assumedly after job 
>>>>> because: job 404796.1 died
>>>>> through signal KILL (9)
>>>>>
>>>>> That did not explain much either. The job runs just
>>>>> fine if resubmitted.
>>>>> How to figure out why are those jobs dying?
>>>>>
>>>>> Any assistance in sorting this out would be
>>>>> appreciated.....
>>>>>
>>>>> Iwona Sakrejda
>>>>>
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>>
>>>>
>>>>> To unsubscribe, e-mail:
>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
>>>>> users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>        
>>>> __________________________________ Do you Yahoo!? Yahoo! Small 
>>>> Business - Try our new resources site!
>>>> http://smallbusiness.yahoo.com/resources/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list