[GE users] sge 6u6 not recovering from shepherd failure
Federico.Sacerdoti at deshaw.com
Wed Jan 18 19:06:22 GMT 2006
My SGE installation on a 1000 node cluster has been working quite well.
It actually has difficulty when the number of nodes increases past 1024,
due to limits on the Linux select() call, but that is another issue that
I believe has been fixed with the USE_POLL compile option.
The reason for my mail is I have noticed rececently a failure state that
sge seems unable to recover from. On Jan 8th I hard rebooted some nodes
that had initated several qlogin interactive sessions. The jobs from
these sessions were never detected to be dead, although the sge_shepherd
on the target nodes have died.
Days later, I can still see the job in the queue, and query its state.
Are there no keepalive messages sent from the qmaster to its sheperds?
How can the qmaster know of failed/dead jobs?
Some evidence is below.
qstat from today showing dead job. Started 1/6.
3195 0.50659 interactiv moraes r 01/06/2006 13:49:03
drda-interactive.q at drda0011.ny 2
qstat -j showing sge_o_host desrad9 that had been rebooted on 1/8
# qstat -j 3195 | head -15
submission_time: Fri Jan 6 13:49:00 2006
mail_list: moraes at desrad9.nyc.deshaw.com
No shepherd running on drda0011 as thought by qmaster:
[fds at drda0011 ~]$ ps aux | grep sge
sge 3430 0.0 0.0 9224 2568 ? S Jan08 0:46
fds 9628 0.0 0.0 42300 604 pts/0 S+ 14:03 0:00 grep sge
[fds at drda0011 ~]$ date
Wed Jan 18 14:03:33 EST 2006
[fds at drda0011 ~]$
qacct does not know about it (because it has not ended presumably)
# qacct -j 3195
error: job id 3195 not found
The job is in a zombie state and will never recover.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users