[GE users] my sge system is not working with the fault tolerance

reuti reuti at staff.uni-marburg.de
Sun Mar 1 22:47:03 GMT 2009


Am 01.03.2009 um 18:21 schrieb tamara:

> I need the cluster configuration to tune the necessary parameters  
> to make the system faster with the fault tolerance
> but my system does not work with the fault tolerance, for example:
> i have 5 execution hosts
> i send 3 jobs that run on exec1, exec2, exec3 and during the job  
> running on exec1 i disconnect it and i wait for about 10 to 15 min  
> and still nothing happened and then i tried to connected again  
> after few 2 min the job running in exec1 is finished
> i changed: Loadreport Time (still not sure what's this parameter),  
> Max. Unheard, Reschedule Unknown (of General Settings of global in  
> Cluster Configuration)
> now the same example but nothing happened again (and by nothing i  
> mean the job is not killed and rescheduled for another host like it  
> suppose to be happening)
> and also when i connected the exec1 again the job goes to the  
> pending jobs list instead of being executed
> i need help to tune the parameters of Cluster Configuration (or any  
> other configuration) to kill the job and resend it to another host
> any advice?

you submit the jobs as being rerunable? Either with "qsub -r y" or by  
setting it in the queue configuration?

-- Reuti

> --tamara
> What can you do with the new Windows Live? Find out


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list