[GE users] FW: my sge system is not working with the fault tolerance
reuti at staff.uni-marburg.de
Tue Mar 3 19:40:34 GMT 2009
Am 03.03.2009 um 12:32 schrieb tamara:
> you could either define it in $SGE_ROOT/default/common/sge_request or
> in the queue definition in the entry "rerun TRUE"
> the only place i could find rerun is in the complex configuration
> in qmon (is it the same queue configuration that you mentioned?)
yes, pane "General Configuration", bottom line "Rerun Jobs". In the
text based edit mode "qconf -mq ..." it's literally the notation I
> and i make it requestable
No, this you can revert. It's not like a resource request
> also i open sge_request but i can't find rerun there
The file is usally empty besides comments and you have to add a line:
> but i faced a lot of problems:
> for example job5 been sent to host1 and during its running the
> network is disconnected and instead of job5 goes to pending then to
> another host,
Are you sure the job was able to rerun like mentioned above?
> the job goes to finished jobs directly and when i tried to know
> more about the job (qacct -j job5) it's says: error job id 5 is not
Yes, the records are generated on the node. You might look into
> another problem is hosts in queue control is missing a lot of
> Arch, MemUsed, VirtUsed and VirtTotal
> (but I'm not sure if this problem connected to configuring rerun)
This is just the output of "qstat -f", not "qhost". You can press the
"Load" button to get more information.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users