[GE users] OpenMPI job on stay on one node

sgexav xaviercouvelard at gmail.com
Mon Sep 7 12:32:23 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

reuti a écrit :
> Hi,
>
> Am 07.09.2009 um 13:05 schrieb sgexav:
>
>   
>> reuti a écrit :
>>     
>>> Hi,
>>>
>>> Am 07.09.2009 um 12:04 schrieb sgexav:
>>>
>>>
>>>       
>>>> Hi,
>>>> I am running a parallel code with openmpi.
>>>> the code needs 12 proc and nodes have 8.
>>>>
>>>> With the classicam mpirun -np 12 ........; it works prefectly
>>>>
>>>> But with SGE the 12 process stay on one node.
>>>>
>>>> qsub script.sh
>>>> Here is may submission script,
>>>>  #!/bin/sh
>>>> #
>>>> # job name
>>>> #$ -N ROMS_TEST
>>>> #
>>>> # Use current working directory
>>>> #$ -cwd
>>>> #
>>>> # Join stdout and stderr
>>>> #$ -j y
>>>>
>>>> # define queue
>>>> #$ -q all.q
>>>>
>>>> #$ -pe mpi 12
>>>>
>>>> # Run job through bash shell
>>>> #$ -S /bin/bash
>>>> #
>>>> echo "Got $NSLOTS processors."
>>>> echo "Machines:"
>>>> cat $TMPDIR/machines
>>>>
>>>> export PATH=/opt/openmpi-1.3.3/bin/:$PATH
>>>> export
>>>> LD_LIBRARY_PATH=/opt/openmpi-1.3.3/lib:/opt/intel/Compiler/11.1/046/
>>>> lib/intel64:$LD_LIBRARY_PATH
>>>> #export OMPI_MCA_pls_rsh_agent=rsh
>>>>
>>>> mpirun -np $NSLOTS  -hostfile $TMPDIR/machines ./roms roms.in
>>>>
>>>>         
>>> a) you compiled Open MPI with SGE support?
>>>
>>> b) what does your PE look like - you followed the Howto for SGE on
>>> the Open MPI site (http://www.open-mpi.org/faq/?category=running#run-
>>> n1ge-or-sge)?
>>>
>>>
>>>       
>> and if use pe orte 12 which seems to be the good one
>>
>> i get
>> Got 12 processors.
>> Machines:
>> cat: /tmp/38.1.all.q/machines: No such file or directory
>>     
>
> as Lydia wrote: you don't need this argument, just leave the option - 
> machinefile ... out. Open MPI will detect the granted nodes on its  
> own from the original pe_hostfile. The $TMPDIR/machines would be  
> created by the start_proc_args for other MPI libraries, but can be  
> left out here hence the file won't be create
>   

OK, doing it that way with "pe orte" et without mychinefile in mpirun 
command
i see my run starting on the nodes, but i get this error
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
error:
error reading job context from "qlogin_starter"
--------------------------------------------------------------------------
A daemon (pid 11082) died unexpectedly with status 1 while attempting
to launch so we are aborting.

What doe it mean?
Xav.
>
>   
>> ---------------------------------------------------------------------- 
>> ----
>> Open RTE was unable to open the hostfile:
>>     /tmp/38.1.all.q/machines
>> Check to make sure the path and filename are correct.
>> .....
>> Xav
>>     
>>> -- Reuti
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=216223
>>>
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>>       
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=216230
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216233
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216238

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list