My SGE installation on a 1000 node cluster has been working quite well.
It actually has difficulty when the number of nodes increases past 1024,
due to limits on the Linux select() call, but that is another issue that
I believe has been fixed with the USE_POLL compile option.

The reason for my mail is I have noticed rececently a failure state that
sge seems unable to recover from. On Jan 8th I hard rebooted some nodes
that had initated several qlogin interactive sessions. The jobs from
these sessions were never detected to be dead, although the sge_shepherd
on the target nodes have died.

Days later, I can still see the job in the queue, and query its state.
Are there no keepalive messages sent from the qmaster to its sheperds?
How can the qmaster know of failed/dead jobs?

Some evidence is below.


qstat from today showing dead job. Started 1/6.

   3195 0.50659 interactiv moraes       r     01/06/2006 13:49:03
drda-interactive.q at drda0011.ny    2

qstat -j showing sge_o_host desrad9 that had been rebooted on 1/8
# qstat -j 3195 | head -15
job_number:                 3195
submission_time:            Fri Jan  6 13:49:00 2006
owner:                      moraes
uid:                        10085
group:                      wheel
gid:                        10
sge_o_home:                 /u/moraes
sge_o_log_name:             moraes
sge_o_shell:                /usr/local/bin/tcsh
sge_o_workdir:              /u/moraes/src/snippets/vapi/comm_ib
sge_o_host:                 desrad9.nyc.deshaw.com
account:                    sge
mail_list:                  moraes at desrad9.nyc.deshaw.com
notify:                     FALSE

No shepherd running on drda0011 as thought by qmaster:
[fds at drda0011 ~]$ ps aux | grep sge
sge       3430  0.0  0.0  9224 2568 ?        S    Jan08   0:46
fds       9628  0.0  0.0 42300  604 pts/0    S+   14:03   0:00 grep sge
[fds at drda0011 ~]$ date
Wed Jan 18 14:03:33 EST 2006
[fds at drda0011 ~]$

qacct does not know about it (because it has not ended presumably)
# qacct -j 3195
error: job id 3195 not found

The job is in a zombie state and will never recover.

