[GE users] my sge system is not working with the fault tolerance

tamara sgesystem at live.com
Sun Mar 1 17:21:55 GMT 2009

    [ The following text is in the "Windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


I need the cluster configuration to tune the necessary parameters to make the system faster with the fault tolerance
but my system does not work with the fault tolerance, for example:
i have 5 execution hosts
i send 3 jobs that run on exec1, exec2, exec3 and during the job running on exec1 i disconnect it and i wait for about 10 to 15 min and still nothing happened and then i tried to connected again after few 2 min the job running in exec1 is finished

i changed: Loadreport Time (still not sure what's this parameter), Max. Unheard, Reschedule Unknown (of General Settings of global in Cluster Configuration)
now the same example but nothing happened again (and by nothing i mean the job is not killed and rescheduled for another host like it suppose to be happening)
and also when i connected the exec1 again the job goes to the pending jobs list instead of being executed

i need help to tune the parameters of Cluster Configuration (or any other configuration) to kill the job and resend it to another host
any advice?


What can you do with the new Windows Live? Find out<http://www.microsoft.com/windows/windowslive/default.aspx>

More information about the gridengine-users mailing list