[GE users] [OT] Running R under SGE and OpenMPI

Sean Davis sdavis2 at mail.nih.gov
Thu Oct 9 16:10:05 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On Thu, Oct 9, 2008 at 10:45 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Am 09.10.2008 um 16:16 schrieb Sean Davis:
>
>> On Thu, Oct 9, 2008 at 9:55 AM, Ron Chen <ron_chen_123 at yahoo.com> wrote:
>>>
>>> --- On Thu, 10/9/08, Davide Cittaro wrote:
>>>>
>>>> I've used Rmpi with SGE and lam tight integration.
>>>> I've had some issues with batch and interactive jobs:
>>>> - batch jobs: I had to specify in the R script how many
>>>> nodes you
>>>> should spawn with Rmpi... if you submit your job with a
>>>> slot range you
>>>> have to play in R to understand how many slots you are
>>>> using.
>>>> Otherwise you may run qsub with a fixed -pe value but you
>>>> may not have available nodes...
>>>
>>> Can you pass in the number of slots to R, or R needs to have the number
>>> of slots hard coded somewhere before it even starts?
>>>
>>
>> No.  Under ordinary MPI, there is a function mpi.spawn.Rslaves() that
>> will spawn the maximum number of slaves possible given the hostfile.
>> Under tight integration, I would expect the behavior to be the same
>> with the number of slaves and the machines they are on defaulting to
>> the SGE-defined hostfile equivalent.
>>
>>> In the job SGE sets the number of slots the current job has, can you just
>>> pass that into R? Or are we talking about different things!?
>>>
>>>
>>>> - interactive jobs: I'm still working on making R aware
>>>> that lam has
>>>> been launched... :-( probably OpenMPI is a different
>>>> situation...
>>>
>>> LAM-MPI is not in active development. But OpenMPI is actively developed,
>>> and supports tight integration with SGE. Can't we treat R/Rmpi as yet
>>> another OpenMPI application?
>>>
>>> Or is there something special in Rmpi?
>>
>> There shouldn't be something special about Rmpi except that only the
>> master should be running interactively and the slaves are actually
>> started from within that interactive session.  So, the simple question
>> is, if I do:
>>
>> qrsh -pe orte 16
>
> Which version of the Open MPI / SGE are you using? - Reuti

openmpi 1.2.7 (built with --with-sge)
sge 6.2

Here is what I have been able to do so far:

> mpirun -hostfile hostfile -np 12 hostname
[pressa:20247] mca: base: component_find: unable to open ras tm: file
not found (ignored)
[pressa:20247] mca: base: component_find: unable to open pls tm: file
not found (ignored)
pressa
pressa
pressa
pressa
shakespeare
shakespeare
shakespeare
shakespeare
shakespeare
shakespeare
shakespeare
shakespeare


Simple script:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
mpirun -np $NSLOTS hostname

> qsub -pe orte 4 -q all.q at octopus.nci.nih.gov -l arch=lx24-amd64 junksub.sh

which gives as output (in the output file):
octopus
octopus
octopus
octopus

If, however, I try the same script with two queue instances:
~> qsub -pe orte 8 -q
all.q at octopus.nci.nih.gov,all.q at shakespeare.nci.nih.gov -l
arch=lx24-amd64 jun
ksub.sh

I get no output but this in the error file:
[octopus:21290] mca: base: component_find: unable to open ras tm: file
not found (ignored)
[octopus:21290] mca: base: component_find: unable to open pls tm: file
not found (ignored)
[octopus:21303] mca: base: component_find: unable to open ras tm: file
not found (ignored)
[octopus:21303] mca: base: component_find: unable to open pls tm: file
not found (ignored)
error: got no connection within 60 seconds
[octopus:21290] ERROR: A daemon on node shakespeare.nci.nih.gov failed
to start as expected.
[octopus:21290] ERROR: There may be more information available from
[octopus:21290] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[octopus:21290] ERROR: If the problem persists, please restart the
[octopus:21290] ERROR: Grid Engine PE job
[octopus:21290] ERROR: The daemon exited unexpectedly with status 1.

And for completeness, the qstat -t output:

job-ID  prior   name       user         state submit/start at
queue                          master ja-task-ID task-ID state cpu
   mem     io      stat failed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
   3565 0.50000 junksub.sh sdavis       r     10/09/2008 11:03:10
all.q at octopus.nci.nih.gov      MASTER

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE
   3565 0.50000 junksub.sh sdavis       r     10/09/2008 11:03:10
all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

So, it appears that I can use multiple nodes using a plain MPI
hostfile.  I can also run a simple parallel job using mpirun when
submitted via SGE if I use only one node (but I can use multiple
processors on that node).  However, running a parallel job with
multiple nodes using mpirun when submitted via SGE always ends with
the master process (in the case above, running on the host octopus)
being unable to start a job on another node.

Any suggestions?

Thanks,
Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list