[GE users] Upgrade to 6.2.u1. Unable to submit jobs: qmaster dies

Serge Nosov serge.nosov at gmail.com
Tue Dec 30 23:47:11 GMT 2008


Hi all,

I have been trying to upgrade from 6.1.u5 to 6.2.u1 for a couple of days
now. I attempted several upgrades. The upgrades went fine, the configuration
was imported into the new SGE, the qmaster started, qmon showed all the
configuration correctly, the execution hosts were also communicating
properly.
The problem occured when I tried to submit a job.

There are two queues: short.q and long.q. There are two complexes: "short"
and "long". They are boolean, forced and the default is 'false". short.q
queue satisfies the "short" complex, and long.q queue satisfies the "long"
complex. So the submissin looks like this:

        qsub -l long -j y script

This works no problem. When, however, I want to submit to a short.q queue:

        qsub -l short -j y script

I get an error:
-------------------------------
error: commlib error: got read error (closing "sge/qmaster/1")
error: commlib error: can't connect to service (Connection refused)
Unable to run job: unable to send message to qmaster using port 536 on host
"sge": got send error.
Exiting.
-------------------------------

qmaster wrote into the "messaes" file:
-------------------------------
12/30/2008 14:55:03|worker|sge|C|!!!!!!!!!! got NULL element for
SME_message_list !!!!!!!!!!
-------------------------------

When I started qmaster with debug level 1, I got the following:
-------------------------------
  2000  30100  listener000     listener000 added new packet
(packet_queue->counter = 1)
  2001  30100  listener000     listener000 notifys one worker
  2002  30100    worker000     worker000 takes packet from priority queue.
(packet_queue->counter = 0; packet_queue->waiting = 1)
  2003  30100    worker000     GDI ADD job (sge/qsub/1) (...snip...)
  2004  30100    worker000     job has access to queue "long.q"
  2005  30100    worker000     user <snip> got department
"defaultdepartment"
  2006  30100    worker000     verify schedulability = e
  2007  30100    worker000     cluster queue "long.q" might be suited
according -l short=TRUE
  2031  30100    worker000     cluster queue "short.q" might be suited
according -l short=TRUE
  2041  30100    worker000     global_time_by_slots() returns <at specified
time>
  2042  30100    worker000     rqs_by_slots(long.q at sge1) returns <at
specified time> 1140876096
  2043  30100    worker000     cluster queue "long.q" might be suited
according -l short=TRUE
  2044  30100    worker000     job 68020 does not request 'forced' resource
"long" of long.q at sge1
  2045  30100    worker000     ../libs/cull/cull_multitype.c 153 !!!!!!!!!!
got NULL element for SME_message_list !!!!!!!!!!
/etc/init.d/sgemaster: line 613: 30100 Aborted
$bin_dir/sge_qmaster

sge_qmaster didn't start!
Please check the messages file
-------------------------------

So for some reason the scheduler _incorrctly_ believes that it can satisfy
the "short" complex by running the job in a long.q queue. Then it does not
see a forced complex "long" being requested for the queue "long.q" and,
instead of rejecting the job, dies.
This configuration worked fine with 6.1.u5

The long.q queue configuration looks as follows:
 -------------------------------
qconf -sq long.q
qname                 long.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        long=TRUE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
-------------------------------

I created a new queue: test.q without any complexes and attempted to submit
a job to this queue:

        qsub -q test.q -j y script

The qmaster died with the following output:
-------------------------------
  2003  30934  listener000     listener000 added new packet
(packet_queue->counter = 1)
  2004  30934  listener000     listener000 notifys one worker
  2005  30934    worker000     worker000 takes packet from priority queue.
(packet_queue->counter = 0; packet_queue->waiting = 1)
  2006  30934    worker000     GDI ADD job (sge/qsub/1) (snip)
  2007  30934    worker000     after sge_resolve_host() which returned no
error happened
  2008  30934    worker000     after sge_resolve_host() - II
  2009  30934    worker000     job has access to queue "long.q"
  2010  30934    worker000     user <snip> got department
"defaultdepartment"
  2011  30934    worker000     verify schedulability = e
  2012  30934    worker000     Cluster Queue "long.q" is not contained in
the hard queue list (-q) that was requested by job 68021
  2013  30934    worker000     ../libs/cull/cull_multitype.c 153 !!!!!!!!!!
got NULL element for SME_message_list !!!!!!!!!!
/etc/init.d/sgemaster: line 613: 30934 Aborted
$bin_dir/sge_qmaster

sge_qmaster didn't start!
Please check the messages file
-------------------------------

Any ideas? Suggestions?

Thank you,
Serge.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=94957

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list