Hi All,

we have just upgraded from SGE 6.1u4 to SGE 6.2. All backed-up
configuration settings were restored successfully, but we are having
problems getting jobs to run. In particular, submitted jobs remain in the
queued state even with the cluster empty:

$ qstat -u bart
job-ID  prior   name       user         state submit/start at     queue   
                      slots ja-task-ID
  46003 0.00000 submit_hel bart         qw    09/23/2008 08:25:02         

Using qstat -j to get some more info starts of with a gdi error message:

$ qstat -j 46003
error: can't unpack gdi request
error: error unpacking gdi request: bad argument
failed receiving gdi request
job_number:                 46003
exec_file:                  job_scripts/46003
submission_time:            Tue Sep 23 08:25:02 2008
owner:                      bart
uid:                        505
group:                      bart
gid:                        505
sge_o_home:                 /home/bart
sge_o_log_name:             bart
sge_o_shell:                /bin/bash
sge_o_workdir:              /bigdisk/bart/test
sge_o_host:                 fugu
account:                    sge
cwd:                        /bigdisk/bart/test
merge:                      y
hard resource_list:         h_cpu=36000
mail_list:                  bart at fugu.local
notify:                     FALSE
job_name:                   submit_helloworld_short.sh
jobshare:                   0
shell_list:                 /bin/bash
script_file:                submit_helloworld_short.sh

So there is no info on why the job won't run, even though job scheduling
info is set to true in qmon. But I don't see the associated variable in
the output of qconf -sconf:

# qconf -sconf
execd_spool_dir              /opt/gridengine/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true
finished_jobs                100
gid_range                    20000-20100
qlogin_command               /opt/gridengine/bin/rocks-qlogin.sh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd -i -o Protocol=2
qlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
rlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
auto_user_oticket            0
auto_user_fshare             1000
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
qrsh_command                 /usr/bin/ssh
rsh_command                  /usr/bin/ssh
rlogin_command               /usr/bin/ssh
rsh_daemon                   /usr/sbin/sshd
qrsh_daemon                  /usr/sbin/sshd
reprioritize                 0

The output of qstat -g c (some nodes are down so AVAIL < TOTAL)

# qstat -g c
conference.q                      0.00      0    392    416      0     24
debug.q                           0.00      0    392    416      0     24
longserial.q                      0.00      1    392    416      0     24
shortparallel.q                   0.00      0     24     24      0      0
shortserial.q                     0.00      0    392    416      0     24

I also checked that /opt/gridengine/bin/lx26-amd64/sge_execd is running on
the compute nodes.

In case it helps: we also seem to have retained jobs that used
checkpointing and were running before the upgrade. These are now also in
the queued state.

Any help would be most appreciated.


