[GE users] SGE 6.2: jobs queued indefinitely

Lubomir Petrik Lubomir.Petrik at Sun.COM
Tue Sep 23 16:17:06 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

That's really strange. If you can reproduce this error message

qstat -j 46003
error: can't unpack gdi request
error: error unpacking gdi request: bad argument
failed receiving gdi request


it suggests that you might be using incompatible versions (client, 
qmaster). Maybe a mix of 6.1u4 and 6.2 binaries?

Also you may try to restart the qmaster or just the scheduler thread via 
qconf -kt scheduler ; qconf -at scheduler.

Lubos.


On 09/23/08 16:33, Bart Willems wrote:
> Hi Lubos,
>
> the job stayed in the qw state, but there were no errors/warnings
>> Hi Bart,
>> can you try following?
>>
>> Do following commands :
>> qsub -b y sleep 5
>> qconf -tsm
>> qstat
>> sleep 5
>> qstat
>>
>> Did the job stayed in qw state? Where there any error/warning messages?
>> If there were, try qping on master host and execd hosts. Does it work?
>>
>> Please attach the bootstrap file and qhost -q output.
>>
>> Lubos.
>>
>>
>> On 09/23/08 15:37, Bart Willems wrote:
>>     
>>> Hi All,
>>>
>>> we have just upgraded from SGE 6.1u4 to SGE 6.2. All backed-up
>>> configuration settings were restored successfully, but we are having
>>> problems getting jobs to run. In particular, submitted jobs remain in
>>> the
>>> queued state even with the cluster empty:
>>>
>>> $ qstat -u bart
>>> job-ID  prior   name       user         state submit/start at     queue
>>>                       slots ja-task-ID
>>> -----------------------------------------------------------------------------------------------------------------
>>>   46003 0.00000 submit_hel bart         qw    09/23/2008 08:25:02
>>>                           1
>>>
>>>
>>> Using qstat -j to get some more info starts of with a gdi error message:
>>>
>>> $ qstat -j 46003
>>> error: can't unpack gdi request
>>> error: error unpacking gdi request: bad argument
>>> failed receiving gdi request
>>> ==============================================================
>>> job_number:                 46003
>>> exec_file:                  job_scripts/46003
>>> submission_time:            Tue Sep 23 08:25:02 2008
>>> owner:                      bart
>>> uid:                        505
>>> group:                      bart
>>> gid:                        505
>>> sge_o_home:                 /home/bart
>>> sge_o_log_name:             bart
>>> sge_o_path:
>>> /export/apps/sm/bin:/opt/gridengine/bin/lx26-amd64:/opt/nwu/bin:/export/apps/mpich2/bin:/usr/kerberos/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/jdk1.5.0_10/bin:/export/apps/condor/bin:/export/apps/condor/sbin:/opt/atipa/acms/bin:/opt/atipa/acms/lib:/usr/local/bin:/bin:/usr/bin:/opt/Bio/ncbi/bin:/opt/Bio/mpiblast/bin/:/opt/Bio/hmmer/bin:/opt/Bio/EMBOSS/bin:/opt/Bio/clustalw/bin:/opt/Bio/t_coffee/bin:/opt/Bio/phylip/exe:/opt/Bio/mrbayes:/opt/Bio/fasta:/opt/Bio/glimmer/bin://opt/Bio/glimmer/scripts:/opt/Bio/gromacs/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/pathscale/bin:/opt/rocks/bin:/opt/rocks/sbin:/home/bart/bin
>>> sge_o_shell:                /bin/bash
>>> sge_o_workdir:              /bigdisk/bart/test
>>> sge_o_host:                 fugu
>>> account:                    sge
>>> cwd:                        /bigdisk/bart/test
>>> merge:                      y
>>> hard resource_list:         h_cpu=36000
>>> mail_list:                  bart at fugu.local
>>> notify:                     FALSE
>>> job_name:                   submit_helloworld_short.sh
>>> jobshare:                   0
>>> shell_list:                 /bin/bash
>>> env_list:
>>> script_file:                submit_helloworld_short.sh
>>>
>>>
>>> So there is no info on why the job won't run, even though job scheduling
>>> info is set to true in qmon. But I don't see the associated variable in
>>> the output of qconf -sconf:
>>>
>>> # qconf -sconf
>>> global:
>>> execd_spool_dir              /opt/gridengine/default/spool
>>> mailer                       /bin/mail
>>> xterm                        /usr/bin/X11/xterm
>>> load_sensor                  none
>>> prolog                       none
>>> epilog                       none
>>> shell_start_mode             posix_compliant
>>> login_shells                 sh,ksh,csh,tcsh
>>> min_uid                      0
>>> min_gid                      0
>>> user_lists                   none
>>> xuser_lists                  none
>>> projects                     none
>>> xprojects                    none
>>> enforce_project              false
>>> enforce_user                 auto
>>> load_report_time             00:00:40
>>> max_unheard                  00:05:00
>>> reschedule_unknown           00:00:00
>>> loglevel                     log_warning
>>> administrator_mail           none
>>> set_token_cmd                none
>>> pag_cmd                      none
>>> token_extend_time            none
>>> shepherd_cmd                 none
>>> qmaster_params               none
>>> execd_params                 none
>>> reporting_params             accounting=true reporting=true \
>>>                              flush_time=00:00:15 joblog=true
>>> sharelog=00:00:00
>>> finished_jobs                100
>>> gid_range                    20000-20100
>>> qlogin_command               /opt/gridengine/bin/rocks-qlogin.sh
>>> rsh_command                  /usr/bin/ssh
>>> rlogin_command               /usr/bin/ssh
>>> rsh_daemon                   /usr/sbin/sshd -i -o Protocol=2
>>> qlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
>>> rlogin_daemon                /usr/sbin/sshd -i -o Protocol=2
>>> max_aj_instances             2000
>>> max_aj_tasks                 75000
>>> max_u_jobs                   0
>>> max_jobs                     0
>>> auto_user_oticket            0
>>> auto_user_fshare             1000
>>> auto_user_default_project    none
>>> auto_user_delete_time        86400
>>> delegated_file_staging       false
>>> qrsh_command                 /usr/bin/ssh
>>> rsh_command                  /usr/bin/ssh
>>> rlogin_command               /usr/bin/ssh
>>> rsh_daemon                   /usr/sbin/sshd
>>> qrsh_daemon                  /usr/sbin/sshd
>>> reprioritize                 0
>>>
>>>
>>> The output of qstat -g c (some nodes are down so AVAIL < TOTAL)
>>>
>>> # qstat -g c
>>> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS
>>> cdsuE
>>> -------------------------------------------------------------------------------
>>> conference.q                      0.00      0    392    416      0
>>> 24
>>> debug.q                           0.00      0    392    416      0
>>> 24
>>> longserial.q                      0.00      1    392    416      0
>>> 24
>>> shortparallel.q                   0.00      0     24     24      0
>>> 0
>>> shortserial.q                     0.00      0    392    416      0
>>> 24
>>>
>>>
>>> I also checked that /opt/gridengine/bin/lx26-amd64/sge_execd is running
>>> on
>>> the compute nodes.
>>>
>>> In case it helps: we also seem to have retained jobs that used
>>> checkpointing and were running before the upgrade. These are now also in
>>> the queued state.
>>>
>>> Any help would be most appreciated.
>>>
>>> Thanks,
>>> Bart
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>     
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   




More information about the gridengine-users mailing list