[GE users] PE only offers 0 slots

Bart Willems b-willems at northwestern.edu
Sun Jun 29 23:18:09 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

> And the qstat output for the parallel queue(s)? Is the queue disabled
> or in error state?

Sorry, I missed that one in the output:

----------------------------------------------------------------------------
shortparallel.q at compute-0-60.l PC    0/8       7.00     lx26-amd64    E
----------------------------------------------------------------------------
shortparallel.q at compute-0-61.l PC    0/8       6.00     lx26-amd64    E
----------------------------------------------------------------------------
shortparallel.q at compute-0-62.l PC    0/8       8.05     lx26-amd64    E
----------------------------------------------------------------------------

When I cleared the error, the job started to run on compute-0-61 even
though I got

  Error, unable to open machine file '/home/bart/tmp/mpihello/machines.9447'

in the stdout/stderr file. Is it normal that the job runs without this file?

Thanks,
Bart

>> queuename                      qtype used/tot. load_avg
>> arch          states
>> ----------------------------------------------------------------------
>> ------
>> debug.q at compute-0-60.local     BIC   0/8       7.00     lx26-amd64
>> ----------------------------------------------------------------------
>> ------
>> debug.q at compute-0-61.local     BIC   0/8       6.00     lx26-amd64
>> ----------------------------------------------------------------------
>> ------
>> debug.q at compute-0-62.local     BIC   0/8       8.05     lx26-amd64
>> ----------------------------------------------------------------------
>> ------
>>
>> ...
>>
>> ----------------------------------------------------------------------
>> ------
>> longserial.q at compute-0-60.loca BIC   8/8       7.00     lx26-
>> amd64    S
>>    7930 0.59198 migr27_nga thommes      S     06/22/2008
>> 08:54:14     1
>>    8108 0.55312 GRHZLong1i ryo          S     06/24/2008
>> 11:11:59     1
>>    8272 0.55302 GRHZLong1i ryo          S     06/24/2008
>> 11:21:14     1
>>    8475 0.54718 GRHZLong1i ryo          S     06/25/2008
>> 02:23:25     1
>>    8068 0.55312 GRHZLong1i ryo          S     06/24/2008
>> 11:11:59     1
>>    8230 0.55307 GRHZLong1i ryo          S     06/24/2008
>> 11:16:14     1
>>    8489 0.54718 GRHZLong1i ryo          S     06/25/2008
>> 07:33:55     1
>>    8004 0.55313 GRHZLong1i ryo          S     06/24/2008
>> 11:10:44     1
>> ----------------------------------------------------------------------
>> ------
>> longserial.q at compute-0-61.loca BIC   8/8       6.00     lx26-
>> amd64    S
>>    8962 0.50716 GRHZLong1i ryo          S     06/27/2008
>> 17:59:01     1
>>    8003 0.55313 GRHZLong1i ryo          S     06/24/2008
>> 11:10:44     1
>>    8215 0.55307 GRHZLong1i ryo          S     06/24/2008
>> 11:16:14     1
>>    8083 0.55312 GRHZLong1i ryo          S     06/24/2008
>> 11:11:59     1
>>    7926 0.59198 migr27_nga thommes      S     06/22/2008
>> 08:54:14     1
>>    8113 0.55312 GRHZLong1i ryo          S     06/24/2008
>> 11:11:59     1
>>    8184 0.55307 GRHZLong1i ryo          S     06/24/2008
>> 11:16:14     1
>>    8490 0.54718 GRHZLong1i ryo          S     06/25/2008
>> 07:33:55     1
>> ----------------------------------------------------------------------
>> ------
>> longserial.q at compute-0-62.loca BIC   8/8       8.05     lx26-
>> amd64    S
>>    8483 0.54712 GRHZLong1i ryo          S     06/25/2008
>> 04:13:25     1
>>    8300 0.55295 GRHZLong1i ryo          S     06/24/2008
>> 13:59:11     1
>>    7878 0.60600 migr27_nga thommes      S     06/21/2008
>> 06:25:44     1
>>    8477 0.54712 GRHZLong1i ryo          S     06/25/2008
>> 03:28:25     1
>>    7993 0.55306 GRHZLong1i ryo          S     06/24/2008
>> 11:10:44     1
>>    8295 0.55295 GRHZLong1i ryo          S     06/24/2008
>> 13:19:27     1
>>    8109 0.55305 GRHZLong1i ryo          S     06/24/2008
>> 11:11:59     1
>>    8269 0.55295 GRHZLong1i ryo          S     06/24/2008
>> 11:21:14     1
>> ----------------------------------------------------------------------
>> ------
>>
>> ...
>>
>> ----------------------------------------------------------------------
>> ------
>> medserial.q at compute-0-60.local BIC   7/8       7.00     lx26-amd64
>>    9226 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9233 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9240 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9248 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9284 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 06:58:06     1
>>    9294 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 08:51:21     1
>>    9300 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 09:28:06     1
>> ----------------------------------------------------------------------
>> ------
>> medserial.q at compute-0-61.local BIC   6/8       6.00     lx26-amd64
>>    9222 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9230 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9236 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9245 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 05:48:36     1
>>    9278 0.51369 GRHZLong1i ryo          r     06/28/2008
>> 06:19:21     1
>>    9414 0.50016 GRHZLong1i ryo          r     06/29/2008
>> 07:58:36     1
>> ----------------------------------------------------------------------
>> ------
>> medserial.q at compute-0-62.local BIC   8/8       8.05     lx26-amd64
>>    9146 0.51621 GRHZLong1i ryo          r     06/28/2008
>> 03:57:36     1
>>    9141 0.51621 GRHZLong1i ryo          r     06/28/2008
>> 03:57:36     1
>>    9152 0.51621 GRHZLong1i ryo          r     06/28/2008
>> 03:57:36     1
>>    9138 0.51621 GRHZLong1i ryo          r     06/28/2008
>> 03:57:36     1
>>    9165 0.51621 GRHZLong1i ryo          r     06/28/2008
>> 03:57:36     1
>>    9179 0.51621 GRHZLong1i ryo          r     06/28/2008
>> 03:57:51     1
>>    9159 0.51621 GRHZLong1i ryo          r     06/28/2008
>> 03:57:36     1
>>    9440 0.50016 GRHZLong1i ryo          r     06/29/2008
>> 07:58:36     1
>> ----------------------------------------------------------------------
>> ------
>>
>> ...
>>
>> ----------------------------------------------------------------------
>> ------
>> shortserial.q at compute-0-60.loc BIC   0/8       7.00     lx26-amd64
>> ----------------------------------------------------------------------
>> ------
>> shortserial.q at compute-0-61.loc BIC   0/8       6.00     lx26-amd64
>> ----------------------------------------------------------------------
>> ------
>> shortserial.q at compute-0-62.loc BIC   0/8       8.05     lx26-amd64
>> ----------------------------------------------------------------------
>> ------
>>
>> I made the serial queues subordinate to the parallel queue, so the
>> running
>> serial jobs should get suspended. Nevertheless, I also tried
>> suspending
>> all serial queues, but this had no effect on the 0 slots problem in
>> the
>> parallel queue.
>>
>> Thanks,
>> Bart
>>
>>
>>
>>
>>> Hi,
>>>
>>> Am 29.06.2008 um 22:27 schrieb Bart Willems:
>>>
>>>> I am trying to set up tight integration between SGE 6.1 and MPICH2
>>>> 1.0.7
>>>> using the daemon-based smpd startup method described by Reuti:
>>>>
>>>> http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-
>>>> integration.html
>>>>
>>>> My problem is that I keep running into a "cannot run in PE
>>>> "mpich2_smpd"
>>>> because it only offers 0 slots" error. My test job is the mpihello
>>>> program.
>>>>
>>>> I currently have 4 serial job queues (debug.q, shortserial.q,
>>>> medserial.q,
>>>> longserial.q) and 1 parallel job queue (shortparallel.q). For
>>>> testing
>>>> purposes, I am trying out the parallel queue on three nodes:
>>>> compute-0-60,
>>>> compute-0-61, and compute-0-62. All three nodes have two quad-core
>>>> CPUs.
>>>> More details are included below.
>>>>
>>>> I am new to both SGE and MPICH2, so any insight into this problem
>>>> would be
>>>> most appreciated!
>>>>
>>>> Thanks,
>>>> Bart
>>>>
>>>>
>>>> My job submission file:
>>>> =======================
>>>>
>>>> #!/bin/bash
>>>>
>>>> #$ -cwd
>>>> #$ -j y
>>>> #$ -S /bin/bash
>>>> #$ -l h_cpu=10:00:00
>>>> #$ -pe mpich2_smpd 2
>>>> #$ -P Parallel
>>>>
>>>> export PATH=/share/apps/mpich2/bin:$PATH
>>>> port=$((JOB_ID % 5000 + 20000))
>>>> mpiexec -n $NSLOTS -machinefile $PWD/machines.$JOB_ID -port $port
>>>> ./mpihello
>>>>
>>>> exit
>>>>
>>>>
>>>> =========================================
>>>> Last few lines of output from :qstat -f":
>>>> =========================================
>>>>
>>>> ####################################################################
>>>> ##
>>>> ######
>>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
>>>> PENDING
>>>> JOBS
>>>> ####################################################################
>>>> ##
>>>> ######
>>>>    9446 0.50208 submit_mpi bart         qw    06/29/2008
>>>> 14:50:46     1
>>>>
>>>>
>>>> ==============================================
>>>> Last few lines of output from "qstat -j 9446":
>>>> ==============================================
>>>>
>>>>                             (no project) does not have the correct
>>>> project
>>>> to run in cluster queue "medserial.q"
>>>>                             (no project) does not have the correct
>>>> project
>>>> to run in cluster queue "shortserial.q"
>>>>                             (no project) does not have the correct
>>>> project
>>>> to run in cluster queue "debug.q"
>>>>                             (no project) does not have the correct
>>>> project
>>>> to run in cluster queue "longserial.q"
>>>>                             cannot run in PE "mpich2_smpd" because
>>>> it only
>>>> offers 0 slots
>>>>
>>>>
>>>> =========================
>>>> Output from "qstat -g c":
>>>> =========================
>>>>
>>>> CLUSTER QUEUE                   CQLOAD   USED  AVAIL  TOTAL aoACDS
>>>> cdsuE
>>>>
>>>> --------------------------------------------------------------------
>>>> --
>>>> ---------
>>>> debug.q                           0.98      0    372    444
>>>> 0     72
>>>> longserial.q                      0.98    294      6    444
>>>> 312     72
>>>> medserial.q                       0.98    308     64    444
>>>> 0     72
>>>> shortparallel.q                   0.96      0      0     24
>>>> 0     24
>>>
>>> can you check in qhost (or qstat -f) the state of your three nodes?
>>> They appear to be in error state or unavailable here.
>>>
>>>> shortserial.q                     0.98      0    372    444
>>>> 0     72
>>>>
>>>>
>>>> ========================================
>>>> Output from "qconf -sq shortparallel.q":
>>>> ========================================
>>>>
>>>> qname                 shortparallel.q
>>>> hostlist              @parallelhosts
>>>> seq_no                0
>>>> load_thresholds       np_load_avg=1.4
>>>> suspend_thresholds    NONE
>>>> nsuspend              1
>>>> suspend_interval      00:05:00
>>>> priority              0
>>>> min_cpu_interval      00:15:00
>>>> processors            UNDEFINED
>>>> qtype                 BATCH INTERACTIVE
>>>
>>> qtype NONE
>>>
>>> (to avoid serial jobs going hereto)
>>>
>>>> ckpt_list             nwu
>>>> pe_list               mpich2_smpd
>>>> rerun                 FALSE
>>>> slots                 4,[@parallelhosts=8]
>>>> tmpdir                /tmp
>>>> shell                 /bin/csh
>>>> prolog                NONE
>>>> epilog                NONE
>>>> shell_start_mode      posix_compliant
>>>
>>> shell_start_mode unix_behavior
>>>
>>> (otherwise always the defined csh will be used)
>>>
>>> -- Reuti
>>>
>>>> starter_method        NONE
>>>> suspend_method        NONE
>>>> resume_method         NONE
>>>> terminate_method      NONE
>>>> notify                00:00:60
>>>> owner_list            NONE
>>>> user_lists            parallelusers
>>>> xuser_lists           NONE
>>>> subordinate_list      longserial.q=2, medserial.q=2, shortserial.q=2
>>>> complex_values        NONE
>>>> projects              Parallel
>>>> xprojects             NONE
>>>> calendar              NONE
>>>> initial_state         default
>>>> s_rt                  INFINITY
>>>> h_rt                  INFINITY
>>>> s_cpu                 INFINITY
>>>> h_cpu                 48:00:00
>>>> s_fsize               INFINITY
>>>> h_fsize               INFINITY
>>>> s_data                INFINITY
>>>> h_data                INFINITY
>>>> s_stack               INFINITY
>>>> h_stack               INFINITY
>>>> s_core                INFINITY
>>>> h_core                INFINITY
>>>> s_rss                 INFINITY
>>>> h_rss                 INFINITY
>>>> s_vmem                INFINITY
>>>> h_vmem                INFINITY
>>>>
>>>>
>>>> ==========================================
>>>> Output from "qconf -shgrp @parallelhosts":
>>>> ==========================================
>>>>
>>>> group_name @parallelhosts
>>>> hostlist compute-0-60.local compute-0-61.local compute-0-62.local
>>>>
>>>>
>>>> ======================================
>>>> Output from "qconf -su parallelusers":
>>>> ======================================
>>>>
>>>> name    parallelusers
>>>> type    ACL
>>>> fshare  0
>>>> oticket 0
>>>> entries bart
>>>>
>>>>
>>>> ===================================
>>>> Output from "qconf -sprj Parallel":
>>>> ===================================
>>>>
>>>> name Parallel
>>>> oticket 0
>>>> fshare 0
>>>> acl parallelusers
>>>> xacl NONE
>>>>
>>>>
>>>> ====================================
>>>> Output from "qconf -sp mpich2_smpd:"
>>>> ====================================
>>>>
>>>> pe_name           mpich2_smpd
>>>> slots             9999
>>>> user_lists        parallelusers
>>>> xuser_lists       NONE
>>>> start_proc_args   /opt/gridengine/mpich2_smpd/startmpich2.sh -
>>>> catch_rsh \
>>>>                   $pe_hostfile /share/apps/mpich2
>>>> stop_proc_args    /opt/gridengine/mpich2_smpd/stopmpich2.sh -
>>>> catch_rsh \
>>>>                   /share/apps/mpich2
>>>> allocation_rule   $round_robin
>>>> control_slaves    TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots     min
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list