[GE users] sge 6u6 not recovering from shepherd failure

christian reissmann Christian.Reissmann at Sun.COM
Mon Jan 23 09:06:19 GMT 2006


Hi,

the USE_POLL compile time switch is not tested, possibly there are still
problems. If you want to be sure you have to follow the instructions
and do the workaround in your linux system header:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1502

Best regards,

Christian


Sacerdoti, Federico wrote On 01/18/06 20:06,:
> Hi,
> 
> My SGE installation on a 1000 node cluster has been working quite well.
> It actually has difficulty when the number of nodes increases past 1024,
> due to limits on the Linux select() call, but that is another issue that
> I believe has been fixed with the USE_POLL compile option.
> 
> The reason for my mail is I have noticed rececently a failure state that
> sge seems unable to recover from. On Jan 8th I hard rebooted some nodes
> that had initated several qlogin interactive sessions. The jobs from
> these sessions were never detected to be dead, although the sge_shepherd
> on the target nodes have died.
> 
> Days later, I can still see the job in the queue, and query its state.
> Are there no keepalive messages sent from the qmaster to its sheperds?
> How can the qmaster know of failed/dead jobs?
> 
> Some evidence is below.
> 
> Thanks,
> -Federico
> 
> qstat from today showing dead job. Started 1/6.
> 
>    3195 0.50659 interactiv moraes       r     01/06/2006 13:49:03
> drda-interactive.q at drda0011.ny    2
> 
> qstat -j showing sge_o_host desrad9 that had been rebooted on 1/8
> # qstat -j 3195 | head -15
> job_number:                 3195
> submission_time:            Fri Jan  6 13:49:00 2006
> owner:                      moraes
> uid:                        10085
> group:                      wheel
> gid:                        10
> sge_o_home:                 /u/moraes
> sge_o_log_name:             moraes
> sge_o_path:
> /proj/desrad/opt/Linux-x86_64/mvapich-0.9.5-topspin-small/bin:/usr/local
> /topspin/bin:/proj/desrad/opt/Linux-x86_64/icewm-1.2.23/bin:/proj/desrad
> /opt/drda-tools/bin:/opt/gridengine/bin/lx26-amd64:/proj/desrad/opt/Linu
> x-x86_64/distcc-2.18.3/bin:/proj/desrad/opt/Linux-x86_64/gdb-6.3/bin:/pr
> oj/desrad/opt/Linux-x86_64/binutils-2.15/bin:/proj/desrad/opt/Linux-x86_
> 64/gcc-3.4.4/bin:/proj/desrad/opt/Linux-x86_64/jove-4.16/bin:/proj/desra
> d/opt/Linux-x86_64/bin:/proj/desrad/opt/bin:/u/moraes/bin/x86_64-Linux:/
> u/moraes/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/local/vnc/
> bin:/usr/local/etc:/usr/sbin:/sbin
> sge_o_shell:                /usr/local/bin/tcsh
> sge_o_workdir:              /u/moraes/src/snippets/vapi/comm_ib
> sge_o_host:                 desrad9.nyc.deshaw.com
> account:                    sge
> mail_list:                  moraes at desrad9.nyc.deshaw.com
> notify:                     FALSE
> ]#
> 
> No shepherd running on drda0011 as thought by qmaster:
> [fds at drda0011 ~]$ ps aux | grep sge
> sge       3430  0.0  0.0  9224 2568 ?        S    Jan08   0:46
> /opt/gridengine/bin/lx26-amd64/sge_execd
> fds       9628  0.0  0.0 42300  604 pts/0    S+   14:03   0:00 grep sge
> [fds at drda0011 ~]$ date
> Wed Jan 18 14:03:33 EST 2006
> [fds at drda0011 ~]$
> 
> qacct does not know about it (because it has not ended presumably)
> # qacct -j 3195
> error: job id 3195 not found
> #
> 
> The job is in a zombie state and will never recover.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Christian Reissmann    Tel: +49 (0)941 3075 112  mailto:crei at sun.com
Software Engineer      Fax: +49 (0)941 3075 222  http://www.sun.com/gridengine
Sun Microsystems GmbH, Dr.-Leo-Ritter-Str. 7,
D-93049 Regensburg,    Tel: +49 (0)941 3075 0

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list