[GE users] sge_qmaster abort with "lGetList(): got NULL element for SME_message_list"

Jesse Becker jbecker at northwestern.edu
Wed May 11 00:06:19 BST 2005


I've been pulling my hair out over this for about a week now, and humbly
submit my problem to the list.

I have a situation where sge_qmaster is failing with an abort() call.
Running jobs continue to run on the nodes, but commd will fail when
trying to report back to the qmaster.  Likewise, anything that needs
to speak to qmaster directly, or indirectly through commd, will fail.
There is no shadow master.

I am running SGE 5.3p6, as shipped under with ROCKS 3.3.0.  This problem
also appears under SGE 5.3p5 running under.  The base OS was recently
upgraded, and was running fine with SGE 5.3p4 or 5.3p5 (I can't recall
which specifically).  In each case, I am running in SGEEEE mode.

In the qmaster/messages file, this entry is of interest:

  Tue May 10 13:43:08 2005|qmaster|hydra|C|!!!!!!!!!! lGetList(): got NULL element for SME_message_list !!!!!!!!!!

Logging is set to "log_info", and sometimes there is a message along
the lines of 'could not decrease "max_u_jobs" job counter' before this,
but not always.

The error is consistent, and here's a backtrace from running sge_qmaster
under gdb:

  Program received signal SIGABRT, Aborted.
  0x00bb8cdf in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:52
  52	  ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
    	  in ../nptl/sysdeps/unix/sysv/linux/raise.c
  (gdb) bt
  #0  0x00bb8cdf in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:52
  #1  0x00bba4e5 in abort () at ../sysdeps/generic/abort.c:88
  #2  0x080e3d27 in lGetList ()
  #3  0x080c603a in schedd_mes_rollback_job ()
  #4  0x080bb731 in available_slots_global ()
  #5  0x080ba5c8 in sge_replicate_queues_suitable4job ()
  #6  0x0806dde1 in verify_suitable_queues ()
  #7  0x08065cbf in sge_gdi_add_job ()
  #8  0x08050025 in sge_c_gdi_add ()
  #9  0x0804f426 in sge_c_gdi ()
  #10 0x0804b086 in main ()

After the abort(), sge_schedd and sge_commd (on the qmaster host)
usually both still run, although sge_schedd can be hard to kill sometimes.

I tried turning on various SGE_DEBUG_LEVEL settings, and got a few interesting
results (and large log files).  The last bit is similar on all the logs.


  396897  24491    GDI ADD job (hydra.local/qsub/101) (myuser/608/mygroup/1000)
  396898  24491    pe max = 10, pe min = 10
  396899  24491    job has access to queue "c0-1.q"
  396900  24491    user myuser got department "defaultdepartment"
  396901  24491    verify schedulability = e
  396902  24491    0: global 10 slots
  396903  24491    CAN'T SERVE MORE THAN 10 SLOTS AT HOST compute-0-2.local
  396904  24491    CAN'T SERVE MORE THAN 9 SLOTS AT HOST compute-0-2.local
  396905  24491    CAN'T SERVE MORE THAN 8 SLOTS AT HOST compute-0-2.local
  396906  24491    CAN'T SERVE MORE THAN 7 SLOTS AT HOST compute-0-2.local
  396907  24491    CAN'T SERVE MORE THAN 6 SLOTS AT HOST compute-0-2.local
  396908  24491    CAN'T SERVE MORE THAN 5 SLOTS AT HOST compute-0-2.local

<...snip...  300+ lines of this, iterating over all 32 compute nodes...>

  397211  24491    CAN'T SERVE MORE THAN 2 SLOTS AT HOST compute-0-1.local
  397212  24491    CAN'T SERVE MORE THAN 1 SLOTS AT HOST compute-0-1.local
  397213  24491    schedd_mes_rollback_job(0)
  397214  24491    ../libs/cull/cull_multitype.c 873 !!!!!!!!!! lGetList(): got NULL element for SME_message_list !!!!!!!!!!


The SGE configuration is mostly the same as it was a week ago, and I
have pasted various config files to the end of this email.

Not all jobs cause this condition.   None of my standard test jobs trigger
this, including those using a fairly standard MPI parallel environment.
That said, I've a few users who can trigger  it without fail (I think
I need to expand my test job collection).  Sometimes the job is a PE
job, but not always.  I think that non-PE jobs are more likely to *not*
cause problems though.  The wrapper script are all sane, with nothing
obviously out of the ordinary.

There are projects, userlists, and users defined and used in sharetree.
Each userlist refers to a unix group.  Each project has a single userlist
ACL, and each user is a member of a single project.  The sharetree is
set so that all projects have an equal number of shares.

There are no custom complexes in use.

Any help or pointers would be greatly appreciated, and I can easily provide
additional configuration settings or logg as needed.



============================================================
=== Various configuration setting are listed below here  ===
============================================================

[root at hydra qmaster]# qconf -sconf
global:
qmaster_spool_dir         /opt/gridengine/default/spool/qmaster
execd_spool_dir           /opt/gridengine/default/spool
binary_path               /opt/gridengine/bin
mailer                    /bin/mail
xterm                     /usr/bin/X11/xterm
load_sensor               none
prolog                    none
epilog                    none
shell_start_mode          unix_behavior
login_shells              sh,ksh,csh,tcsh,bash
min_uid                   0
min_gid                   0
user_lists                none
xuser_lists               none
projects                  none
xprojects                 none
enforce_project           true
enforce_user              true
load_report_time          00:00:40
stat_log_time             48:00:00
max_unheard               00:05:00
reschedule_unknown        00:00:00
loglevel                  log_info
administrator_mail        none
set_token_cmd             none
pag_cmd                   none
token_extend_time         none
shepherd_cmd              none
qmaster_params            none
schedd_params             SHARE_FUNCTIONAL_SHARES=1,POLICY_HIERARCHY=ODSF
execd_params              none
finished_jobs             100
gid_range                 20000-20100
admin_user                sge
qlogin_command            telnet
qlogin_daemon             /usr/sbin/in.telnetd
rlogin_daemon             /usr/sbin/sshd -i
rlogin_command            /usr/bin/ssh -t
rsh_daemon                /usr/sbin/sshd -i
rsh_command               /usr/bin/ssh -t
default_domain            none
ignore_fqdn               true
max_aj_instances          2000
max_aj_tasks              75000
max_u_jobs                1000



============================================================

[root at hydra qmaster]# qconf -ssconf
algorithm                  default
schedule_interval          0:0:15
maxujobs                   10
queue_sort_method          share
user_sort                  false
job_load_adjustments       np_load_avg=0.50
load_adjustment_decay_time 00:02:00
load_formula               np_load_avg
schedd_job_info            true
sgeee_schedule_interval    0:2:0
halftime                   672
usage_weight_list          cpu=0.735,mem=0.159,io=0.106
compensation_factor        10
weight_user                0.242
weight_project             0.758
weight_jobclass            0
weight_department          0
weight_job                 0
weight_tickets_functional  100000
weight_tickets_share       100000
weight_tickets_deadline    50000

============================================================
(All queues are the same)

[root at hydra qmaster]# qconf -sq c0-1.q
qname                c0-1.q
hostname             compute-0-1.local
seq_no               100
load_thresholds      np_load_avg=1.25
suspend_thresholds   NONE
nsuspend             1
suspend_interval     00:05:00
priority             0
min_cpu_interval     00:05:00
processors           2
qtype                BATCH INTERACTIVE PARALLEL 
rerun                TRUE
slots                2
tmpdir               /state/partition1
shell                /bin/sh
shell_start_mode     unix_behavior
prolog               NONE
epilog               NONE
starter_method       NONE
suspend_method       NONE
resume_method        NONE
terminate_method     NONE
notify               00:00:60
owner_list           NONE
user_lists           NONE
xuser_lists          NONE
subordinate_list     NONE
complex_list         NONE
complex_values       NONE
projects             NONE
xprojects            NONE
calendar             NONE
initial_state        default
fshare               0
oticket              0
s_rt                 INFINITY
h_rt                 INFINITY
s_cpu                INFINITY
h_cpu                INFINITY
s_fsize              INFINITY
h_fsize              INFINITY
s_data               INFINITY
h_data               INFINITY
s_stack              INFINITY
h_stack              INFINITY
s_core               INFINITY
h_core               INFINITY
s_rss                INFINITY
h_rss                INFINITY
s_vmem               INFINITY
h_vmem               INFINITY

============================================================
(All hosts are the same)

[root at hydra qmaster]# qconf -se compute-0-1
hostname                   compute-0-1.local
load_scaling               NONE
complex_list               NONE
complex_values             NONE
load_values                arch=glinux,num_proc=2,mem_total=2007.277344M,swap_total=996.207031M,virtual_total=3003.484375M,load_avg=0.000000,load_short=0.000000,load_medium=0.000000,load_long=0.000000,mem_free=1899.593750M,swap_free=996.207031M,virtual_free=2895.800781M,mem_used=107.683594M,swap_used=0.000000M,virtual_used=107.683594M,cpu=0.000000,np_load_avg=0.000000,np_load_short=0.000000,np_load_medium=0.000000,np_load_long=0.000000
processors                 2
user_lists                 NONE
xuser_lists                NONE
projects                   NONE
xprojects                  NONE
usage_scaling              NONE
resource_capability_factor 0.000000

============================================================



-- 
Jesse Becker
GPG-fingerprint: BD00 7AA4 4483 AFCC 82D0  2720 0083 0931 9A2B 06A2


    [ Part 2, Application/PGP-SIGNATURE 196 bytes. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list