[GE users] Queue in Alarm State for no reason

Richard Hobbs richard.hobbs at crl.toshiba.co.uk
Wed Aug 16 11:33:40 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

Reuti wrote:
> Hi,
> 
> Am 15.08.2006 um 16:47 schrieb Richard Hobbs:
> 
>> Hello,
>>
>> We are running SGE 5.3p6 on RedHat 8.0 at the moment and we recently
>> experienced a problem where nearly all of the queues went into alarm
>> state.
>>
>> Performing a softstop and start on the qmaster made no difference, and
>> restarting rcsge on one of the broken execution hosts made no difference.
>>
>> Some of the queues were ok, but most of them were in alarm state.
>>
> 
> qstat -alarm
> 
> shows anything?

Unfortunately, after rebooting the server, it is all ok now, so
everything looks ok. I shall be sure to run this command if it happens
again though.

>> The qmaster logs did not show much useful information, but what we did
>> see was this:
>>
>> ============================================================
>> Sat Aug 12 20:12:01 2006|qmaster|stg2|E|format error while packing gdi
>> request
>> ============================================================
>>
>> We then restarted the qmaster, and saw this:
>>
>> ============================================================
>> Sun Aug 13 10:24:26 2006|qmaster|stg2|W|starting program:
>> /rmt/stg2_1/sge/bin/glinux/sge_commd
>> Sun Aug 13 10:24:32 2006|qmaster|stg2|I|starting up 5.3p6 (sge)
>> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
> 
> This shouldn't be I think. What is the setting of max_u_jobs?

According to the global config in qmon's "Cluster Configuration" screen,
"max_u_jobs" is set to 0 (zero).

If this parameter is to limit the maximum number of jobs a user can have
running at any one time, we do not want to limit this in our
environment, so if zero is not a good value, what should it be set to?

Thanks again,
Richard.

> -- Reuti
> 
>> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> Sun Aug 13 10:24:37 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> ============================================================
>>
>> The "max_u_jobs" warnings we always seem to get a lot of, so i don't
>> think they're relevant.
>>
>> This did not solve the problem. All of the queues that were previously
>> in alarm state were still in alarm state, and the queues that were
>> working fine were still working fine.
>>
>> I then tried to shutdown the qmaster again (softstop, i think) and start
>> it again, and i saw this:
>>
>> ============================================================
>> Sun Aug 13 10:38:22 2006|qmaster|stg2|E|enrolled, but leave_commd() call
>> failed with status: CANNOT CONNECT
>> Sun Aug 13 10:38:26 2006|qmaster|stg2|E|enroll failed with status:
>> CANNOT CONNECT
>> Sun Aug 13 10:38:30 2006|qmaster|stg2|E|commd is down: CANNOT CONNECT
>> Sun Aug 13 10:38:30 2006|qmaster|stg2|E|can't send asynchronous message
>> to commproc (schedd:1) on host "stg2.crl.toshiba.co.uk": CANNOT CONNECT
>> Sun Aug 13 10:38:30 2006|qmaster|stg2|I|controlled shutdown 5.3p6 (sge)
>> Sun Aug 13 10:38:48 2006|qmaster|stg2|W|starting program:
>> /rmt/stg2_1/sge/bin/glinux/sge_commd
>> Sun Aug 13 10:38:54 2006|qmaster|stg2|I|starting up 5.3p6 (sge)
>> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> Sun Aug 13 10:39:28 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
>> job counter
>> ============================================================
>>
>> I think the above is normal, but what is going on here?
>>
>> Why did 95% of our queues go into alarm state? The execution hosts were
>> working fine, and restarting the daemon on the exec host did nothing (it
>> restarted ok), so what happened?
>>
>> Does anyone have any ideas?
>>
>> Thanks in advance,
>> Hobbs.
>>
>> --Richard Hobbs (Systems Administrator)
>> Toshiba Research Europe Ltd. - Speech Technology Group
>> Web: http://www.toshiba-europe.com/research/
>> Normal Email: richard.hobbs at crl.toshiba.co.uk
>> Mobile Email: mobile at mongeese.co.uk
>> Tel: +44 1223 376964        Mobile: +44 7811 803377
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> _____________________________________________________________________
> This e-mail has been scanned for viruses by Verizon Business Internet
> Managed Scanning Services - powered by MessageLabs. For further
> information visit http://www.mci.com
> 
> 

-- 
Richard Hobbs (Systems Administrator)
Toshiba Research Europe Ltd. - Speech Technology Group
Web: http://www.toshiba-europe.com/research/
Normal Email: richard.hobbs at crl.toshiba.co.uk
Mobile Email: mobile at mongeese.co.uk
Tel: +44 1223 376964        Mobile: +44 7811 803377

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list