[GE users] SGE 6.2: jobs queued indefinitely

Bart Willems b-willems at northwestern.edu
Tue Sep 23 17:15:41 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Lubos,

> qstat -j 46003
> error: can't unpack gdi request
> error: error unpacking gdi request: bad argument
> failed receiving gdi request

Yes, I can reproduce this error every time.

> it suggests that you might be using incompatible versions (client,
> qmaster). Maybe a mix of 6.1u4 and 6.2 binaries?

It seems like it: see below.

> Also you may try to restart the qmaster or just the scheduler thread via
> qconf -kt scheduler ; qconf -at scheduler.

I get an error message when I try this:

#qconf -kt scheduler
error: "-kt" is not a valid option 2
GE 6.1u4

So this seems to refer to 6.1u4. If I do

# ls -l /opt/gridengine/bin/
total 12
drwxr-xr-x 2 root root 4096 Jul 23 04:44 lx24-amd64
drwxr-xr-x 2 root root 4096 May 30 16:09 lx26-amd64
-rwxr-xr-x 1 root root   54 Apr 28 21:43 rocks-qlogin.sh

the lx24-amd64 directory was not there before the upgrade. Is this ok?

How do you suggest to un-mix the two versions?

Thanks,
Bart


> On 09/23/08 16:33, Bart Willems wrote:
>> Hi Lubos,
>>
>> the job stayed in the qw state, but there were no errors/warnings
>>> Hi Bart,
>>> can you try following?
>>>
>>> Do following commands :
>>> qsub -b y sleep 5
>>> qconf -tsm
>>> qstat
>>> sleep 5
>>> qstat
>>>
>>> Did the job stayed in qw state? Where there any error/warning messages?
>>> If there were, try qping on master host and execd hosts. Does it work?
>>>
>>> Please attach the bootstrap file and qhost -q output.
>>>
>>> Lubos.
>>>
>>>
>>> On 09/23/08 15:37, Bart Willems wrote:
>>>
>>>> Hi All,
>>>>
>>>> we have just upgraded from SGE 6.1u4 to SGE 6.2. All backed-up
>>>> configuration settings were restored successfully, but we are having
>>>> problems getting jobs to run. In particular, submitted jobs remain in
>>>> the
>>>> queued state even with the cluster empty:
>>>>
>>>> $ qstat -u bart
>>>> job-ID  prior   name       user         state submit/start at
>>>> queue
>>>>                       slots ja-task-ID
>>>> -----------------------------------------------------------------------------------------------------------------
>>>>   46003 0.00000 submit_hel bart         qw    09/23/2008 08:25:02
>>>>                           1
>>>>
>>>>
>>>> Using qstat -j to get some more info starts of with a gdi error
>>>> message:
>>>>
>>>> $ qstat -j 46003
>>>> error: can't unpack gdi request
>>>> error: error unpacking gdi request: bad argument
>>>> failed receiving gdi request
>>>> ==============================================================
>>>> job_number:                 46003
>>>> exec_file:                  job_scripts/46003
>>>> submission_time:            Tue Sep 23 08:25:02 2008
>>>> owner:                      bart
>>>> uid:                        505
>>>> group:                      bart
>>>> gid:                        505
>>>> sge_o_home:                 /home/bart
>>>> sge_o_log_name:             bart
>>>> sge_o_path:
>>>> /export/apps/sm/bin:/opt/gridengine/bin/lx26-amd64:/opt/nwu/bin:/export/apps/mpich2/bin:/usr/kerberos/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/jdk1.5.0_10/bin:/export/apps/condor/bin:/export/apps/condor/sbin:/opt/atipa/acms/bin:/opt/atipa/acms/lib:/usr/local/bin:/bin:/usr/bin:/opt/Bio/ncbi/bin:/opt/Bio/mpiblast/bin/:/opt/Bio/hmmer/bin:/opt/Bio/EMBOSS/bin:/opt/Bio/clustalw/bin:/opt/Bio/t_coffee/bin:/opt/Bio/phylip/exe:/opt/Bio/mrbayes:/opt/Bio/fasta:/opt/Bio/glimmer/bin://opt/Bio/glimmer/scripts:/opt/Bio/gromacs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/pathscale/bin:/opt/rocks/bin:/opt/rocks/sbin:/home/bart/bin
>>>> sge_o_shell:                /bin/bash
>>>> sge_o_workdir:              /bigdisk/bart/test
>>>> sge_o_host:                 fugu
>>>> account:                    sge
>>>> cwd:                        /bigdisk/bart/test
>>>> merge:                      y
>>>> hard resource_list:         h_cpu=36000
>>>> mail_list:                  bart at fugu.local
>>>> notify:                     FALSE
>>>> job_name:                   submit_helloworld_short.sh
>>>> jobshare:                   0
>>>> shell_list:                 /bin/bash
>>>> env_list:
>>>> script_file:                submit_helloworld_short.sh
>>>>
>>>>
>>>> So there is no info on why the job won't run, even though job
>>>> scheduling
>>>> info is set to true in qmon. But I don't see the associated variable
>>>> in
>>>> the output of qconf -sconf:
>>>>
>>>> # qconf -sconf
>>>> global:
>>>> execd_spool_dir              /opt/gridengine/default/spool
>>>> mailer                       /bin/mail
>>>> xterm                        /usr/bin/X11/xterm
>>>> load_sensor                  none
>>>> prolog                       none
>>>> epilog                       none
>>>> shell_start_mode             posix_compliant
>>>> login_shells                 sh,ksh,csh,tcsh
>>>> min_uid                      0
>>>> min_gid                      0
>>>> user_lists                   none
>>>> xuser_lists                  none
>>>> projects                     none
>>>> xprojects                    none
>>>> enforce_project              false
>>>> enforce_user                 auto
>>>> load_report_time             00:00:40
>>>> max_unheard                  00:05:00
>>>> reschedule_unknown           00:00:00
>>>> loglevel                     log_warning
>>>> administrator_mail           none
>>>> set_token_cmd                none
>>>> pag_cmd                      none
>>>> token_extend_time            none
>>>> shepherd_cmd                 none
>>>> qmaster_params               none
>>>> execd_params                 none
>>>> reporting_params             accounting=true reporting=true \
>>>>                              flush_time=00:00:15 joblog=true
>>>> sharelog=00:00:00
>>>> finished_jobs                100
>>>> gid_range                    20000-20100
>>>> qlogin_command               /opt/gridengine/bin/rocks-qlogin.sh
>>>> rsh_command                  /usr/bin/ssh
>>>> rlogin_command               /usr/bin/ssh
>>>> rsh_daemon                   /usr/sbin/sshd -i -o Protocol=2
>>>> qlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
>>>> rlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
>>>> max_aj_instances             2000
>>>> max_aj_tasks                 75000
>>>> max_u_jobs                   0
>>>> max_jobs                     0
>>>> auto_user_oticket            0
>>>> auto_user_fshare             1000
>>>> auto_user_default_project    none
>>>> auto_user_delete_time        86400
>>>> delegated_file_staging       false
>>>> qrsh_command                 /usr/bin/ssh
>>>> rsh_command                  /usr/bin/ssh
>>>> rlogin_command               /usr/bin/ssh
>>>> rsh_daemon                   /usr/sbin/sshd
>>>> qrsh_daemon                  /usr/sbin/sshd
>>>> reprioritize                 0
>>>>
>>>>
>>>> The output of qstat -g c (some nodes are down so AVAIL < TOTAL)
>>>>
>>>> # qstat -g c
>>>> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS
>>>> cdsuE
>>>> -------------------------------------------------------------------------------
>>>> conference.q                      0.00      0    392    416      0
>>>> 24
>>>> debug.q                           0.00      0    392    416      0
>>>> 24
>>>> longserial.q                      0.00      1    392    416      0
>>>> 24
>>>> shortparallel.q                   0.00      0     24     24      0
>>>> 0
>>>> shortserial.q                     0.00      0    392    416      0
>>>> 24
>>>>
>>>>
>>>> I also checked that /opt/gridengine/bin/lx26-amd64/sge_execd is
>>>> running
>>>> on
>>>> the compute nodes.
>>>>
>>>> In case it helps: we also seem to have retained jobs that used
>>>> checkpointing and were running before the upgrade. These are now also
>>>> in
>>>> the queued state.
>>>>
>>>> Any help would be most appreciated.
>>>>
>>>> Thanks,
>>>> Bart
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list