[GE users] core duo systems not accepting jobs

flengyel flengyel at gc.cuny.edu
Sun Jul 12 23:34:53 BST 2009


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I'll start with a new run. We have plenty of machines in the x86_64.q, which
contains the gauss parallel execution environment. The queue offers 2 slots per host
(the queue configuration is below). (Am I mistaken about this? Is the number of slots
the total for all hosts, or is this parameter the number of slots/host?)

The input script is as follows:

 more /usr/local/bin/gsub
#!/bin/bash
if [ $# -lt 1 ]; then
  echo "Usage: gsub gaussianfile [qsub options]"
  exit
fi
ARGS=("$@")
QOPTS=${ARGS[@]:1}
qsub $QOPTS <<__HereDocument__
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -N $1
#$ -pe gauss 2
#$ -q x86_64.q

export g03root=/usr/local/gaussian
. /usr/local/gaussian/g03/bsd/g03.profile
export SGE_ROOT=/usr/local/sge
.  /usr/local/sge/default/common/settings.sh
export GAUSS_SCRDIR=/tmp

g03 $1
__HereDocument__


The gauss parallel execution environment allows for plenty of slots:

[flengyel at nept olavinda]$ qconf -sp gauss
pe_name           gauss
slots             9999
user_lists        Research deadlineusers
xuser_lists       NONE
start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
allocation_rule   $fill_up
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min



Let's look at x86_64.q

[root at nept nept]# qconf -sq x86_64.q
qname                 x86_64.q
hostlist              @coreduos
seq_no                0
load_thresholds       np_load_avg=4.0
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               gauss mpich namd ompi
rerun                 FALSE
slots                 2
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            Research deadlineusers
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY


The parallel execution environment gauss is attached.

Now let's observe that qstat -j does not mention the hosts in x86_64!

[flengyel at nept olavinda]$ qstat -u flengyel
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  23526 0.25000 trop.com   flengyel     qw    07/12/2009 18:26:56                                    2

We do not see m31-m60 mentioned below!

[flengyel at nept olavinda]$ qstat -j 23526
==============================================================
job_number:                 23526
exec_file:                  job_scripts/23526
submission_time:            Sun Jul 12 18:26:56 2009
owner:                      flengyel
uid:                        1007
group:                      domusers
gid:                        1000
sge_o_home:                 /home/nept/flengyel
sge_o_log_name:             flengyel
sge_o_path:                 /usr/lib64/qt-3.3/bin:/usr/kerberos/bin:/usr/lib/jvm/jdk1.6.0_05/bin:/home/nept/apps64/bin:/home/nept/apps64/ocaml/bin:/home/nept/apps64/emergent/bin:/usr/local/Trolltech/Qt-4.4.3/bin:/home/nept/apps64/R/bin:/home/nept/apps64/pgi/linux86-64/7.2/bin:/usr/lib/jvm/jdk1.6.0_12/bin:/home/nept/apps64/openmpi/bin:/usr/lib/jvm/jdk1.6.0_05/bin:/usr/local/sge/bin/lx24-amd64:/usr/local/bin:/bin:/usr/bin:/home/nept/apps64/postgres/bin:/home/nept/flengyel/bin:/home/nept/apps64/ILOG/cplex/bin/x86-64_sles9.0_3.3:/home/nept/apps64/ILOG/ampl
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/nept/flengyel/gaussian/olavinda
sge_o_host:                 nept
account:                    sge
cwd:                        /home/nept/flengyel/gaussian/olavinda
path_aliases:               /tmp_mnt/ * * /
mail_list:                  flengyel at nept.gc.cuny.edu
notify:                     FALSE
job_name:                   trop.com
jobshare:                   0
hard_queue_list:            x86_64.q
shell_list:                 /bin/bash
env_list:
script_file:                STDIN
parallel environment:  gauss range: 2
scheduling info:            queue instance "instructional.q at m61.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m62.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m63.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m64.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m65.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m66.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m67.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m69.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "test.q at m62.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "quad.q at m08.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m08.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m61.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m62.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m63.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m64.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m65.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m66.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m67.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "all.q at m69.gc.cuny.edu" dropped because it is temporarily not available
                            queue instance "instructional.q at m68.gc.cuny.edu" dropped because it is overloaded: np_load_avg=15.690000 (no load adjustment) >= 3.0
                            queue instance "x86_64.q at m47.gc.cuny.edu" dropped because it is overloaded: np_load_avg=616.745000 (no load adjustment) >= 4.0
                            queue instance "all.q at m47.gc.cuny.edu" dropped because it is overloaded: np_load_avg=616.745000 (no load adjustment) >= 3.0
                            queue instance "all.q at m68.gc.cuny.edu" dropped because it is overloaded: np_load_avg=15.690000 (no load adjustment) >= 3.0
                            queue instance "x86_64.q at m31.gc.cuny.edu" dropped because it is full
                            queue instance "x86_64.q at m49.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m01.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m02.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m03.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m04.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m05.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m06.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m07.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m09.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m10.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m11.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m12.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m13.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m14.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m15.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m16.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m17.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m18.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m19.gc.cuny.edu" dropped because it is full
                            queue instance "quad.q at m20.gc.cuny.edu" dropped because it is full
                            cannot run because no access to pe "gauss"
                            cannot run in PE "gauss" because it only offers 0 slots
[flengyel at nept olavinda]$

> -Chris
>
>
>

Thanks - FL

>
>
> On Jul 12, 2009, at 5:42 PM, flengyel wrote:
>
> > I have a number of Intel E6600 core duo systems sitting idle while
> > jobs languish in the queues:
> >
> > m31                     lx24-amd64      2  4.00    7.7G    6.0G
> > 16.0G   13.8G
> > m32                     lx24-amd64      2  0.00    7.7G  149.8M
> > 16.0G     0.0
> > m33                     lx24-amd64      2  0.00    7.7G  320.1M
> > 16.0G   22.8M
> > m34                     lx24-amd64      2  0.00    7.7G  209.2M
> > 16.0G   23.6M
> > m35                     lx24-amd64      2  0.00    7.7G  151.1M
> > 16.0G     0.0
> > m36                     lx24-amd64      2  1.01    7.7G  394.7M
> > 16.0G    9.1G
> > m37                     lx24-amd64      2  0.00    7.7G  251.2M
> > 16.0G   24.6M
> > m38                     lx24-amd64      2  0.00    7.7G  215.7M
> > 16.0G   24.1M
> > m39                     lx24-amd64      2  0.00    7.7G  299.8M
> > 16.0G   23.3M
> > m40                     lx24-amd64      2  0.00    7.7G  150.3M
> > 16.0G     0.0
> > m41                     lx24-amd64      2  0.00    7.7G  156.8M
> > 16.0G     0.0
> > m42                     lx24-amd64      2  0.00    7.7G  184.0M
> > 16.0G     0.0
> > m43                     lx24-amd64      2  0.00    7.7G  232.7M
> > 16.0G   23.5M
> > m44                     lx24-amd64      2  0.00    7.7G  152.6M
> > 16.0G     0.0
> > m45                     lx24-amd64      2  0.00    7.7G  151.7M
> > 16.0G     0.0
> > m46                     lx24-amd64      2  0.00    7.7G  219.5M
> > 16.0G   23.6M
> > m47                     lx24-amd64      2 1.20K    7.7G    3.5G
> > 16.0G     0.0
> > m48                     lx24-amd64      2  0.00    7.7G  215.6M
> > 16.0G   23.0M
> > m49                     lx24-amd64      2  4.00    7.7G    6.4G
> > 16.0G   13.4G
> > m50                     lx24-amd64      2  0.00    7.7G  145.6M
> > 16.0G     0.0
> > m51                     lx24-amd64      2  0.00    7.7G  199.3M
> > 16.0G   23.7M
> > m52                     lx24-amd64      2  0.00    5.8G  151.4M
> > 16.0G     0.0
> > m53                     lx24-amd64      2  0.00    7.7G  222.5M
> > 16.0G   23.0M
> > m54                     lx24-amd64      2  0.00    7.7G  224.2M
> > 16.0G   23.6M
> > m55                     lx24-amd64      2  0.00    7.7G  222.6M
> > 16.0G   24.4M
> > m56                     lx24-amd64      2  0.00    7.7G  149.0M
> > 16.0G     0.0
> > m57                     lx24-amd64      2  0.00    7.7G  319.5M
> > 16.0G   23.8M
> > m58                     lx24-amd64      2  0.00    7.7G  118.0M
> > 16.0G     0.0
> > m59                     lx24-amd64      2  0.00    7.7G  157.5M
> > 16.0G     0.0
> > m60                     lx24-amd64      2  0.00    7.7G  206.4M
> > 16.0G   24.1M
> >
> > I'm wondering about how to diagnose and correct this. Perhaps it's
> > time to give up
> > on SGE 6.0u10 and upgrade to SGE 6.2...
> >
> > FL
> >
>
> ------------------------------------------------------
>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206710
>
> To unsubscribe from this discussion, e-mail: [
> users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=206712

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].




More information about the gridengine-users mailing list