[GE users] FW: my sge system is not working with the fault tolerance

reuti reuti at staff.uni-marburg.de
Tue Mar 3 19:40:34 GMT 2009

Hi Tamara,

Am 03.03.2009 um 12:32 schrieb tamara:

> you could either define it in $SGE_ROOT/default/common/sge_request or
> in the queue definition in the entry "rerun                 TRUE"
> the only place i could find rerun is in the complex configuration  
> in qmon (is it the same queue configuration that you mentioned?)

yes, pane "General Configuration", bottom line "Rerun Jobs". In the  
text based edit mode "qconf -mq ..." it's literally the notation I  

> and i make it requestable

No, this you can revert. It's not like a resource request

> also i open sge_request but i can't find rerun there

The file is usally empty besides comments and you have to add a line:

-r y

> but i faced a lot of problems:
> for example job5 been sent to host1 and during its running the  
> network is disconnected and instead of job5 goes to pending then to  
> another host,

Are you sure the job was able to rerun like mentioned above?

> the job goes to finished jobs directly and when i tried to know  
> more about the job (qacct -j job5) it's says: error job id 5 is not  
> found

Yes, the records are generated on the node. You might look into  

> another problem is hosts in queue control is missing a lot of  
> information:
> Arch, MemUsed, VirtUsed and VirtTotal
> (but I'm not sure if this problem connected to configuring rerun)

This is just the output of "qstat -f", not "qhost". You can press the  
"Load" button to get more information.

-- Reuti


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list