[GE users] Queue in Alarm State for no reason

Reuti reuti at staff.uni-marburg.de
Tue Aug 15 16:27:25 BST 2006


Hi,

Am 15.08.2006 um 16:47 schrieb Richard Hobbs:

> Hello,
>
> We are running SGE 5.3p6 on RedHat 8.0 at the moment and we recently
> experienced a problem where nearly all of the queues went into  
> alarm state.
>
> Performing a softstop and start on the qmaster made no difference, and
> restarting rcsge on one of the broken execution hosts made no  
> difference.
>
> Some of the queues were ok, but most of them were in alarm state.
>

qstat -alarm

shows anything?

> The qmaster logs did not show much useful information, but what we did
> see was this:
>
> ============================================================
> Sat Aug 12 20:12:01 2006|qmaster|stg2|E|format error while packing gdi
> request
> ============================================================
>
> We then restarted the qmaster, and saw this:
>
> ============================================================
> Sun Aug 13 10:24:26 2006|qmaster|stg2|W|starting program:
> /rmt/stg2_1/sge/bin/glinux/sge_commd
> Sun Aug 13 10:24:32 2006|qmaster|stg2|I|starting up 5.3p6 (sge)
> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter

This shouldn't be I think. What is the setting of max_u_jobs?

-- Reuti

> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> Sun Aug 13 10:24:37 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> ============================================================
>
> The "max_u_jobs" warnings we always seem to get a lot of, so i don't
> think they're relevant.
>
> This did not solve the problem. All of the queues that were previously
> in alarm state were still in alarm state, and the queues that were
> working fine were still working fine.
>
> I then tried to shutdown the qmaster again (softstop, i think) and  
> start
> it again, and i saw this:
>
> ============================================================
> Sun Aug 13 10:38:22 2006|qmaster|stg2|E|enrolled, but leave_commd()  
> call
> failed with status: CANNOT CONNECT
> Sun Aug 13 10:38:26 2006|qmaster|stg2|E|enroll failed with status:
> CANNOT CONNECT
> Sun Aug 13 10:38:30 2006|qmaster|stg2|E|commd is down: CANNOT CONNECT
> Sun Aug 13 10:38:30 2006|qmaster|stg2|E|can't send asynchronous  
> message
> to commproc (schedd:1) on host "stg2.crl.toshiba.co.uk": CANNOT  
> CONNECT
> Sun Aug 13 10:38:30 2006|qmaster|stg2|I|controlled shutdown 5.3p6  
> (sge)
> Sun Aug 13 10:38:48 2006|qmaster|stg2|W|starting program:
> /rmt/stg2_1/sge/bin/glinux/sge_commd
> Sun Aug 13 10:38:54 2006|qmaster|stg2|I|starting up 5.3p6 (sge)
> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> Sun Aug 13 10:39:28 2006|qmaster|stg2|W|could not decrease  
> "max_u_jobs"
> job counter
> ============================================================
>
> I think the above is normal, but what is going on here?
>
> Why did 95% of our queues go into alarm state? The execution hosts  
> were
> working fine, and restarting the daemon on the exec host did  
> nothing (it
> restarted ok), so what happened?
>
> Does anyone have any ideas?
>
> Thanks in advance,
> Hobbs.
>
> -- 
> Richard Hobbs (Systems Administrator)
> Toshiba Research Europe Ltd. - Speech Technology Group
> Web: http://www.toshiba-europe.com/research/
> Normal Email: richard.hobbs at crl.toshiba.co.uk
> Mobile Email: mobile at mongeese.co.uk
> Tel: +44 1223 376964        Mobile: +44 7811 803377
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list