[GE users] sge 6u6 not recovering from shepherd failure

Reuti reuti at staff.uni-marburg.de
Wed Jan 18 19:51:35 GMT 2006


HI,

you can remove the job and free slots on these machines by deleting  
the jobs with the -f flag to qdel.

For conventional batch jobs, you could also look at the parameters:

reschedule_unknown
max_unheard

in the SGE configuration. - Reuti


Am 18.01.2006 um 20:06 schrieb Sacerdoti, Federico:

> Hi,
>
> My SGE installation on a 1000 node cluster has been working quite  
> well.
> It actually has difficulty when the number of nodes increases past  
> 1024,
> due to limits on the Linux select() call, but that is another issue  
> that
> I believe has been fixed with the USE_POLL compile option.
>
> The reason for my mail is I have noticed rececently a failure state  
> that
> sge seems unable to recover from. On Jan 8th I hard rebooted some  
> nodes
> that had initated several qlogin interactive sessions. The jobs from
> these sessions were never detected to be dead, although the  
> sge_shepherd
> on the target nodes have died.
>
> Days later, I can still see the job in the queue, and query its state.
> Are there no keepalive messages sent from the qmaster to its sheperds?
> How can the qmaster know of failed/dead jobs?
>
> Some evidence is below.
>
> Thanks,
> -Federico
>
> qstat from today showing dead job. Started 1/6.
>
>    3195 0.50659 interactiv moraes       r     01/06/2006 13:49:03
> drda-interactive.q at drda0011.ny    2
>
> qstat -j showing sge_o_host desrad9 that had been rebooted on 1/8
> # qstat -j 3195 | head -15
> job_number:                 3195
> submission_time:            Fri Jan  6 13:49:00 2006
> owner:                      moraes
> uid:                        10085
> group:                      wheel
> gid:                        10
> sge_o_home:                 /u/moraes
> sge_o_log_name:             moraes
> sge_o_path:
> /proj/desrad/opt/Linux-x86_64/mvapich-0.9.5-topspin-small/bin:/usr/ 
> local
> /topspin/bin:/proj/desrad/opt/Linux-x86_64/icewm-1.2.23/bin:/proj/ 
> desrad
> /opt/drda-tools/bin:/opt/gridengine/bin/lx26-amd64:/proj/desrad/opt/ 
> Linu
> x-x86_64/distcc-2.18.3/bin:/proj/desrad/opt/Linux-x86_64/gdb-6.3/ 
> bin:/pr
> oj/desrad/opt/Linux-x86_64/binutils-2.15/bin:/proj/desrad/opt/Linux- 
> x86_
> 64/gcc-3.4.4/bin:/proj/desrad/opt/Linux-x86_64/jove-4.16/bin:/proj/ 
> desra
> d/opt/Linux-x86_64/bin:/proj/desrad/opt/bin:/u/moraes/bin/x86_64- 
> Linux:/
> u/moraes/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/local/ 
> vnc/
> bin:/usr/local/etc:/usr/sbin:/sbin
> sge_o_shell:                /usr/local/bin/tcsh
> sge_o_workdir:              /u/moraes/src/snippets/vapi/comm_ib
> sge_o_host:                 desrad9.nyc.deshaw.com
> account:                    sge
> mail_list:                  moraes at desrad9.nyc.deshaw.com
> notify:                     FALSE
> ]#
>
> No shepherd running on drda0011 as thought by qmaster:
> [fds at drda0011 ~]$ ps aux | grep sge
> sge       3430  0.0  0.0  9224 2568 ?        S    Jan08   0:46
> /opt/gridengine/bin/lx26-amd64/sge_execd
> fds       9628  0.0  0.0 42300  604 pts/0    S+   14:03   0:00 grep  
> sge
> [fds at drda0011 ~]$ date
> Wed Jan 18 14:03:33 EST 2006
> [fds at drda0011 ~]$
>
> qacct does not know about it (because it has not ended presumably)
> # qacct -j 3195
> error: job id 3195 not found
> #
>
> The job is in a zombie state and will never recover.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list