[GE users] SGE 6.2: jobs queued indefinitely

Bart Willems b-willems at northwestern.edu
Tue Sep 23 14:37:08 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi All,

we have just upgraded from SGE 6.1u4 to SGE 6.2. All backed-up
configuration settings were restored successfully, but we are having
problems getting jobs to run. In particular, submitted jobs remain in the
queued state even with the cluster empty:

$ qstat -u bart
job-ID  prior   name       user         state submit/start at     queue   
                      slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  46003 0.00000 submit_hel bart         qw    09/23/2008 08:25:02         
                          1


Using qstat -j to get some more info starts of with a gdi error message:

$ qstat -j 46003
error: can't unpack gdi request
error: error unpacking gdi request: bad argument
failed receiving gdi request
==============================================================
job_number:                 46003
exec_file:                  job_scripts/46003
submission_time:            Tue Sep 23 08:25:02 2008
owner:                      bart
uid:                        505
group:                      bart
gid:                        505
sge_o_home:                 /home/bart
sge_o_log_name:             bart
sge_o_path:                
/export/apps/sm/bin:/opt/gridengine/bin/lx26-amd64:/opt/nwu/bin:/export/apps/mpich2/bin:/usr/kerberos/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/jdk1.5.0_10/bin:/export/apps/condor/bin:/export/apps/condor/sbin:/opt/atipa/acms/bin:/opt/atipa/acms/lib:/usr/local/bin:/bin:/usr/bin:/opt/Bio/ncbi/bin:/opt/Bio/mpiblast/bin/:/opt/Bio/hmmer/bin:/opt/Bio/EMBOSS/bin:/opt/Bio/clustalw/bin:/opt/Bio/t_coffee/bin:/opt/Bio/phylip/exe:/opt/Bio/mrbayes:/opt/Bio/fasta:/opt/Bio/glimmer/bin://opt/Bio/glimmer/scripts:/opt/Bio/gromacs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/pathscale/bin:/opt/rocks/bin:/opt/rocks/sbin:/home/bart/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /bigdisk/bart/test
sge_o_host:                 fugu
account:                    sge
cwd:                        /bigdisk/bart/test
merge:                      y
hard resource_list:         h_cpu=36000
mail_list:                  bart at fugu.local
notify:                     FALSE
job_name:                   submit_helloworld_short.sh
jobshare:                   0
shell_list:                 /bin/bash
env_list:
script_file:                submit_helloworld_short.sh


So there is no info on why the job won't run, even though job scheduling
info is set to true in qmon. But I don't see the associated variable in
the output of qconf -sconf:

# qconf -sconf
global:
execd_spool_dir              /opt/gridengine/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true
sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               /opt/gridengine/bin/rocks-qlogin.sh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd -i -o Protocol=2
qlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
rlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             1000
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
qrsh_command                 /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd
qrsh_daemon                  /usr/sbin/sshd
reprioritize                 0


The output of qstat -g c (some nodes are down so AVAIL < TOTAL)

# qstat -g c
CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS  cdsuE
-------------------------------------------------------------------------------
conference.q                      0.00      0    392    416      0     24
debug.q                           0.00      0    392    416      0     24
longserial.q                      0.00      1    392    416      0     24
shortparallel.q                   0.00      0     24     24      0      0
shortserial.q                     0.00      0    392    416      0     24


I also checked that /opt/gridengine/bin/lx26-amd64/sge_execd is running on
the compute nodes.

In case it helps: we also seem to have retained jobs that used
checkpointing and were running before the upgrade. These are now also in
the queued state.

Any help would be most appreciated.

Thanks,
Bart


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list