[GE users] [OT] Running R under SGE and OpenMPI

Sean Davis sdavis2 at mail.nih.gov
Thu Oct 9 15:16:03 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On Thu, Oct 9, 2008 at 9:55 AM, Ron Chen <ron_chen_123 at yahoo.com> wrote:
> --- On Thu, 10/9/08, Davide Cittaro wrote:
>> I've used Rmpi with SGE and lam tight integration.
>> I've had some issues with batch and interactive jobs:
>> - batch jobs: I had to specify in the R script how many
>> nodes you
>> should spawn with Rmpi... if you submit your job with a
>> slot range you
>> have to play in R to understand how many slots you are
>> using.
>> Otherwise you may run qsub with a fixed -pe value but you
>> may not have available nodes...
>
> Can you pass in the number of slots to R, or R needs to have the number of slots hard coded somewhere before it even starts?
>

No.  Under ordinary MPI, there is a function mpi.spawn.Rslaves() that
will spawn the maximum number of slaves possible given the hostfile.
Under tight integration, I would expect the behavior to be the same
with the number of slaves and the machines they are on defaulting to
the SGE-defined hostfile equivalent.

> In the job SGE sets the number of slots the current job has, can you just pass that into R? Or are we talking about different things!?
>
>
>> - interactive jobs: I'm still working on making R aware
>> that lam has
>> been launched... :-( probably OpenMPI is a different
>> situation...
>
> LAM-MPI is not in active development. But OpenMPI is actively developed, and supports tight integration with SGE. Can't we treat R/Rmpi as yet another OpenMPI application?
>
> Or is there something special in Rmpi?

There shouldn't be something special about Rmpi except that only the
master should be running interactively and the slaves are actually
started from within that interactive session.  So, the simple question
is, if I do:

qrsh -pe orte 16

-- run with only 1 processor, leaving other nodes open
-- to start the slaves?
mpirun -np 1 R --vanilla

In any case, in troubleshooting this, I have been playing with simpler
things like:
qrsh -pe orte 16 mpirun -np 16 hostname

and I get:
[shakespeare:13857] mca: base: component_find: unable to open ras tm:
file not found (ignored)
[shakespeare:13857] mca: base: component_find: unable to open pls tm:
file not found (ignored)
[shakespeare:13870] mca: base: component_find: unable to open ras tm:
file not found (ignored)
[shakespeare:13870] mca: base: component_find: unable to open pls tm:
file not found (ignored)
^Terror: executing task of job 3556 failed: failed sending task to
execd at octopus.nci.nih.gov: can't find connection
[shakespeare:13857] ERROR: A daemon on node octopus.nci.nih.gov failed
to start as expected.
[shakespeare:13857] ERROR: There may be more information available from
[shakespeare:13857] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[shakespeare:13857] ERROR: If the problem persists, please restart the
[shakespeare:13857] ERROR: Grid Engine PE job
[shakespeare:13857] ERROR: The daemon exited unexpectedly with status 1.

A qstat -t shows:
job-ID  prior   name       user         state submit/start at
queue                          master ja-task-ID task-ID state cpu
   mem     io      stat failed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
   3556 0.50000 mpirun     sdavis       r     10/09/2008 10:04:42
all.q at shakespeare.nci.nih.gov  MASTER                        r
00:00:00 0.00692 0.00000

all.q at shakespeare.nci.nih.gov  SLAVE            1.shakespeare r
00:00:00 0.00139 0.00000

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE
   3556 0.50000 mpirun     sdavis       r     10/09/2008 10:04:42
all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

all.q at octopus.nci.nih.gov      SLAVE

So, it appears I probably have larger problems.

Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list