[GE users] Queue in Alarm State for no reason

Richard Hobbs richard.hobbs at crl.toshiba.co.uk
Tue Aug 15 15:47:43 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

We are running SGE 5.3p6 on RedHat 8.0 at the moment and we recently
experienced a problem where nearly all of the queues went into alarm state.

Performing a softstop and start on the qmaster made no difference, and
restarting rcsge on one of the broken execution hosts made no difference.

Some of the queues were ok, but most of them were in alarm state.

The qmaster logs did not show much useful information, but what we did
see was this:

============================================================
Sat Aug 12 20:12:01 2006|qmaster|stg2|E|format error while packing gdi
request
============================================================

We then restarted the qmaster, and saw this:

============================================================
Sun Aug 13 10:24:26 2006|qmaster|stg2|W|starting program:
/rmt/stg2_1/sge/bin/glinux/sge_commd
Sun Aug 13 10:24:32 2006|qmaster|stg2|I|starting up 5.3p6 (sge)
Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:24:34 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:24:37 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
============================================================

The "max_u_jobs" warnings we always seem to get a lot of, so i don't
think they're relevant.

This did not solve the problem. All of the queues that were previously
in alarm state were still in alarm state, and the queues that were
working fine were still working fine.

I then tried to shutdown the qmaster again (softstop, i think) and start
it again, and i saw this:

============================================================
Sun Aug 13 10:38:22 2006|qmaster|stg2|E|enrolled, but leave_commd() call
failed with status: CANNOT CONNECT
Sun Aug 13 10:38:26 2006|qmaster|stg2|E|enroll failed with status:
CANNOT CONNECT
Sun Aug 13 10:38:30 2006|qmaster|stg2|E|commd is down: CANNOT CONNECT
Sun Aug 13 10:38:30 2006|qmaster|stg2|E|can't send asynchronous message
to commproc (schedd:1) on host "stg2.crl.toshiba.co.uk": CANNOT CONNECT
Sun Aug 13 10:38:30 2006|qmaster|stg2|I|controlled shutdown 5.3p6 (sge)
Sun Aug 13 10:38:48 2006|qmaster|stg2|W|starting program:
/rmt/stg2_1/sge/bin/glinux/sge_commd
Sun Aug 13 10:38:54 2006|qmaster|stg2|I|starting up 5.3p6 (sge)
Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:38:54 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
Sun Aug 13 10:39:28 2006|qmaster|stg2|W|could not decrease "max_u_jobs"
job counter
============================================================

I think the above is normal, but what is going on here?

Why did 95% of our queues go into alarm state? The execution hosts were
working fine, and restarting the daemon on the exec host did nothing (it
restarted ok), so what happened?

Does anyone have any ideas?

Thanks in advance,
Hobbs.

-- 
Richard Hobbs (Systems Administrator)
Toshiba Research Europe Ltd. - Speech Technology Group
Web: http://www.toshiba-europe.com/research/
Normal Email: richard.hobbs at crl.toshiba.co.uk
Mobile Email: mobile at mongeese.co.uk
Tel: +44 1223 376964        Mobile: +44 7811 803377

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list