[GE users] sge 6u6 not recovering from shepherd failure

Reuti reuti at staff.uni-marburg.de
Mon Jan 23 21:34:35 GMT 2006


Hi again,

Am 23.01.2006 um 20:43 schrieb Sacerdoti, Federico:

> Reuti,
> Thanks for the reply. It seems that SGE can detect failed execution
> hosts, via the max_unheard heartbeat message. However, my data  
> suggests
> that it cannot detect when the _submit_ host goes down.
>

you stated, that the shepherd died. Was the submit host also an exec  
host, as the shepherd will be started on the exec host? Can you  
provide more details what exactly happened, maybe this can be an RFE.

-- Reuti

> I am writing a script to detect stale jobs using the logic:
>
> submit host=qstat -j ID | awk sge_o_host
> if uptime of submit host < age of job: print
>
> -fds
>
> -----Original Message-----
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Wednesday, January 18, 2006 2:52 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] sge 6u6 not recovering from shepherd failure
>
>
> HI,
>
> you can remove the job and free slots on these machines by deleting
> the jobs with the -f flag to qdel.
>
> For conventional batch jobs, you could also look at the parameters:
>
> reschedule_unknown
> max_unheard
>
> in the SGE configuration. - Reuti
>
>
> Am 18.01.2006 um 20:06 schrieb Sacerdoti, Federico:
>
>> Hi,
>>
>> My SGE installation on a 1000 node cluster has been working quite
>> well.
>> It actually has difficulty when the number of nodes increases past
>> 1024,
>> due to limits on the Linux select() call, but that is another issue
>> that
>> I believe has been fixed with the USE_POLL compile option.
>>
>> The reason for my mail is I have noticed rececently a failure state
>> that
>> sge seems unable to recover from. On Jan 8th I hard rebooted some
>> nodes
>> that had initated several qlogin interactive sessions. The jobs from
>> these sessions were never detected to be dead, although the
>> sge_shepherd
>> on the target nodes have died.
>>
>> Days later, I can still see the job in the queue, and query its  
>> state.
>> Are there no keepalive messages sent from the qmaster to its  
>> sheperds?
>> How can the qmaster know of failed/dead jobs?
>>
>> Some evidence is below.
>>
>> Thanks,
>> -Federico
>>
>> qstat from today showing dead job. Started 1/6.
>>
>>    3195 0.50659 interactiv moraes       r     01/06/2006 13:49:03
>> drda-interactive.q at drda0011.ny    2
>>
>> qstat -j showing sge_o_host desrad9 that had been rebooted on 1/8
>> # qstat -j 3195 | head -15
>> job_number:                 3195
>> submission_time:            Fri Jan  6 13:49:00 2006
>> owner:                      moraes
>> uid:                        10085
>> group:                      wheel
>> gid:                        10
>> sge_o_home:                 /u/moraes
>> sge_o_log_name:             moraes
>> sge_o_path:
>> /proj/desrad/opt/Linux-x86_64/mvapich-0.9.5-topspin-small/bin:/usr/
>> local
>> /topspin/bin:/proj/desrad/opt/Linux-x86_64/icewm-1.2.23/bin:/proj/
>> desrad
>> /opt/drda-tools/bin:/opt/gridengine/bin/lx26-amd64:/proj/desrad/opt/
>> Linu
>> x-x86_64/distcc-2.18.3/bin:/proj/desrad/opt/Linux-x86_64/gdb-6.3/
>> bin:/pr
>> oj/desrad/opt/Linux-x86_64/binutils-2.15/bin:/proj/desrad/opt/Linux-
>> x86_
>> 64/gcc-3.4.4/bin:/proj/desrad/opt/Linux-x86_64/jove-4.16/bin:/proj/
>> desra
>> d/opt/Linux-x86_64/bin:/proj/desrad/opt/bin:/u/moraes/bin/x86_64-
>> Linux:/
>> u/moraes/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/local/
>> vnc/
>> bin:/usr/local/etc:/usr/sbin:/sbin
>> sge_o_shell:                /usr/local/bin/tcsh
>> sge_o_workdir:              /u/moraes/src/snippets/vapi/comm_ib
>> sge_o_host:                 desrad9.nyc.deshaw.com
>> account:                    sge
>> mail_list:                  moraes at desrad9.nyc.deshaw.com
>> notify:                     FALSE
>> ]#
>>
>> No shepherd running on drda0011 as thought by qmaster:
>> [fds at drda0011 ~]$ ps aux | grep sge
>> sge       3430  0.0  0.0  9224 2568 ?        S    Jan08   0:46
>> /opt/gridengine/bin/lx26-amd64/sge_execd
>> fds       9628  0.0  0.0 42300  604 pts/0    S+   14:03   0:00 grep
>> sge
>> [fds at drda0011 ~]$ date
>> Wed Jan 18 14:03:33 EST 2006
>> [fds at drda0011 ~]$
>>
>> qacct does not know about it (because it has not ended presumably)
>> # qacct -j 3195
>> error: job id 3195 not found
>> #
>>
>> The job is in a zombie state and will never recover.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list