[GE users] run time intel compiler library libsvml not found

Reuti reuti at staff.uni-marburg.de
Tue Jan 29 01:08:38 GMT 2008


Hi,

Am 28.01.2008 um 16:58 schrieb SLIM H.A.:

> The application is a mpi example, a MC calculation of pi. This is  
> the error:
>
> /usr/local/Cluster-Apps/sge/bin/lx24-amd64/qrsh -V -inherit - 
> nostdin node229 /data/hamilton/drk1has/_tests/montepi/ 
> amd64_lnx_intel/.
> /monte node229 46990 \-p4amslave \-p4yourname node229 \-p4rmrank 1
> error: executing task of job 95889 failed:
> p0_18320:  p4_error: Child process exited while making connection  
> to remote process on node229: 0
> forrtl: error (69): process interrupted (SIGINT)
> p0_18320: (8.101562) net_send: could not write to fd=4, errno = 32

the name "devmyri" in your ps listing has nothing to do with Myrinet?  
To me this looks like a qrsh from node229 to node229 (the node229  
after the programname and before the port number is the origin) which  
is done by Myrinet.

-- Reuti


> The ps -e f result from the master node (node229) is attached.
>
> These are the settings in the pe file:
> control_slaves    FALSE
> job_is_first_task TRUE
>
> Thanks
>
> Henk
>
> From: Reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: 18 January 2008 10:40
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] run time intel compiler library libsvml not  
> found
>
> Am 18.01.2008 um 00:29 schrieb SLIM H.A.:
>
>> I am using mpich1 over ethernet here. job_is_first_task is  FALSE  
>> and that gives me n-1 instances of the qrsh on the master node.  
>> This has been the setup all the time. If I change
>> job_is_first_task  to TRUE the job crashes. This behaviour  
>> contradicts the section "Number of tasks spread to the nodes". The  
>> device is ch_p4.
>
> Then there are more options:
>
> - What application is it? E.g. Turbomole need always one process  
> more than the user wants to use.
> - Are you using -nolocal to mpirun?
> - Can you please post the relevant lines of a `ps -e f`(blank  
> between -e and f) and post it of the master node.
> - The job crashes with what type of failure, i.e. error message?
>
> -- Reuti
>
>
>> Thanks
>> Henk
>>
>> From: Reuti [mailto:reuti at Staff.Uni-Marburg.DE]
>> Sent: Thu 1/17/2008 6:36 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] run time intel compiler library libsvml  
>> not found
>>
>> Hi,
>>
>> Am 17.01.2008 um 18:29 schrieb SLIM H.A.:
>>
>> > Apologies for the long delay to reply. I checked the web page you
>> > referred to and the -V option solves the problem, thanks.
>> > However I noticed something curious: we use standard MPICH over
>> > ethernet
>> > with sge/mpi/startmpi.sh -catch_rsh $pe_hostfile as the PE start
>> > script.
>> > If I set
>> >
>> > job_is_first_task TRUE
>>
>> this will just adjust the number of allowed qrsh calls under control
>> of SGE, whether it will be "n" (job_is_first_task FALSE) or
>> "n-1" (job_is_first_task TRUE).
>>
>> Are you using plain MPICH(1) on a) Ethernet or b) on Myrinet?
>>
>> -- Reuti
>>
>>
>> > in the definition of the PE, as sugested on the web page then MPICH
>> > generates error messages. I do have to set
>> >
>> > control_slaves    TRUE
>> > job_is_first_task FALSE
>> >
>> > to get it to work. Why should this be?
>> >
>> > Thanks
>> >
>> > Henk
>> >
>> >>
>> >> Aha, the slave task might not have the LD_LIBRARY_PATH.
>> >> Please add a - V to the rsh wrapper:
>> >>
>> >> http://gridengine.sunsource.net/howto/mpich-integration.html
>> >>
>> >> which will also solve other issues. And be sure to have a
>> >> Tight Integration, i.e. "setenv P4_RSHCOMMAND rsh" to use the
>> >> rsh-wrapper.
>> >>
>> >> -- Reuti
>> >>
>> >
>> >> -----Original Message-----
>> >> From: Reuti [mailto:reuti at staff.uni-marburg.de]
>> >> Sent: 21 December 2007 13:36
>> >> To: users at gridengine.sunsource.net
>> >> Subject: Re: [GE users] run time intel compiler library
>> >> libsvml not found
>> >>
>> >> Am 21.12.2007 um 13:04 schrieb SLIM H.A.:
>> >>
>> >>> Maybe it clarifies if I show the script:
>> >>>
>> >>> #!/bin/csh
>> >>> ... some standard sge options here
>> >>> #$ -cwd
>> >>> setenv MPICH_PROCESS_GROUP no
>> >>> # request submission to a queue for parallel jobs #$ -q
>> >> par.q ##$ -S
>> >>> /bin/csh
>> >>
>> >> This will be just a real comment, it's not #$ at the beginning.
>> >>
>> >>> #   ^^ no effect
>> >>> # set up the mpich version to use
>> >>> # load the modules
>> >>> module purge
>> >>> module load intel/fce/9.0.032 mpich/ge/intel/64/1.2.7
>> >> sge/6.0u7_1 ldd
>> >>> ./monte echo LD_LIBRARY_PATH=$LD_LIBRARY_PATH # $ -v
>> >>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH
>> >>
>> >> This you can only use on the commandline, where
>> >> $LD_LIBRARY_PATH will be expanded by the shell. Here you
>> >> should see a literal $LD_LIBRARY_PATH echoed, unless -V is
>> >> used (space between # and $ is also not allowed)
>> >>
>> >>> #   ^^ no effect
>> >>> #$ -V
>> >>> #   ^^ only works if the session shell has the module loaded  
>> as well
>> >>
>> >> Seems okay.
>> >>
>> >>> # execute command
>> >>> mpirun -np $NSLOTS -machinefile $TMPDIR/machines ./monte
>> >>>
>> >>> I built monte with
>> >>>
>> >>> module purge
>> >>> module load intel/fce/9.0.032 mpich/ge/intel/64/1.2.7
>> >> mpif90 monte.f90
>> >>> -o monte
>> >>>
>> >>> These are snippets from the output file ...
>> >>>         libsvml.so =>
>> >>> /usr/local/Cluster-Apps/intel/fce/9.0//lib/libsvml.so
>> >>> (0x00002b21417de000)
>> >>> ...
>> >>> LD_LIBRARY_PATH=/usr/local/lib:/usr/X11R6/lib:/usr/local/Cluster-
>> >>> Apps/in
>> >>> tel/fce/9.0//lib:/usr/local/Cluster-App
>> >>> s/mpich/ge/intel/64/1.2.7/lib/shared:/usr/local/Cluster-Apps/sge/
>> >>> lib/lx2
>> >>> 6-amd64
>> >>> /usr/local/Cluster-Apps/sge/bin/lx24-amd64/qrsh -inherit -nostdin
>> >>> node231 /data/hamilton/drk1has/hamilton_monte
>> >>> pi/amd64_lnx_ifort/./monte node231 50375 \-p4amslave \-p4yourname
>> >>> node231 \-p4rmrank 1
>> >>>
>> >> /data/hamilton/drk1has/hamilton_montepi/amd64_lnx_ifort/./mont
>> >> e: error
>> >>> while loading shared libraries: libsvml.
>> >>> so: cannot open shared object file: No such file or directory ...
>> >>
>> >> Aha, the slave task might not have the LD_LIBRARY_PATH.
>> >> Please add a - V to the rsh wrapper:
>> >>
>> >> http://gridengine.sunsource.net/howto/mpich-integration.html
>> >>
>> >> which will also solve other issues. And be sure to have a
>> >> Tight Integration, i.e. "setenv P4_RSHCOMMAND rsh" to use the
>> >> rsh-wrapper.
>> >>
>> >> -- Reuti
>> >>
>> >>  
>> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> For additional commands, e-mail: users- 
>> help at gridengine.sunsource.net
>> >>
>> >>
>> >
>> >  
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users- 
>> help at gridengine.sunsource.net
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> <ps-ef-failed-job-95889.txt>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list