[GE users] run time intel compiler library libsvml not found

SLIM H.A. h.a.slim at durham.ac.uk
Mon Jan 28 15:58:16 GMT 2008


The application is a mpi example, a MC calculation of pi. This is the
error:
 
/usr/local/Cluster-Apps/sge/bin/lx24-amd64/qrsh -V -inherit -nostdin
node229 /data/hamilton/drk1has/_tests/montepi/amd64_lnx_intel/.
/monte node229 46990 \-p4amslave \-p4yourname node229 \-p4rmrank 1
error: executing task of job 95889 failed: 
p0_18320:  p4_error: Child process exited while making connection to
remote process on node229: 0
forrtl: error (69): process interrupted (SIGINT)
p0_18320: (8.101562) net_send: could not write to fd=4, errno = 32
 
The ps -e f result from the master node (node229) is attached. 
 
These are the settings in the pe file:
control_slaves    FALSE
job_is_first_task TRUE

 
Thanks
 
Henk


________________________________

	From: Reuti [mailto:reuti at staff.uni-marburg.de] 
	Sent: 18 January 2008 10:40
	To: users at gridengine.sunsource.net
	Subject: Re: [GE users] run time intel compiler library libsvml
not found
	
	
	Am 18.01.2008 um 00:29 schrieb SLIM H.A.:


		I am using mpich1 over ethernet here. job_is_first_task
is  FALSE and that gives me n-1 instances of the qrsh on the master
node. This has been the setup all the time. If I change 

		job_is_first_task  to TRUE the job crashes. This
behaviour contradicts the section "Number of tasks spread to the nodes".
The device is ch_p4.
		
		


	Then there are more options:

	- What application is it? E.g. Turbomole need always one process
more than the user wants to use.
	- Are you using -nolocal to mpirun?
	- Can you please post the relevant lines of a `ps -e f`(blank
between -e and f) and post it of the master node.
	- The job crashes with what type of failure, i.e. error message?

	-- Reuti



		Thanks
		Henk

________________________________

		From: Reuti [mailto:reuti at Staff.Uni-Marburg.DE]
		Sent: Thu 1/17/2008 6:36 PM
		To: users at gridengine.sunsource.net
		Subject: Re: [GE users] run time intel compiler library
libsvml not found
		
		

		Hi,
		
		Am 17.01.2008 um 18:29 schrieb SLIM H.A.:
		
		> Apologies for the long delay to reply. I checked the
web page you
		> referred to and the -V option solves the problem,
thanks.
		> However I noticed something curious: we use standard
MPICH over 
		> ethernet
		> with sge/mpi/startmpi.sh -catch_rsh $pe_hostfile as
the PE start 
		> script.
		> If I set
		>
		> job_is_first_task TRUE
		
		this will just adjust the number of allowed qrsh calls
under control 
		of SGE, whether it will be "n" (job_is_first_task FALSE)
or 
		"n-1" (job_is_first_task TRUE).
		
		Are you using plain MPICH(1) on a) Ethernet or b) on
Myrinet?
		
		-- Reuti
		
		
		> in the definition of the PE, as sugested on the web
page then MPICH
		> generates error messages. I do have to set
		>
		> control_slaves    TRUE
		> job_is_first_task FALSE
		>
		> to get it to work. Why should this be?
		>
		> Thanks
		>
		> Henk
		>
		>>
		>> Aha, the slave task might not have the
LD_LIBRARY_PATH.
		>> Please add a - V to the rsh wrapper:
		>>
		>> 
http://gridengine.sunsource.net/howto/mpich-integration.html
		>>
		>> which will also solve other issues. And be sure to
have a
		>> Tight Integration, i.e. "setenv P4_RSHCOMMAND rsh" to
use the
		>> rsh-wrapper.
		>>
		>> -- Reuti
		>>
		>
		>> -----Original Message-----
		>> From: Reuti [mailto:reuti at staff.uni-marburg.de]
		>> Sent: 21 December 2007 13:36
		>> To: users at gridengine.sunsource.net
		>> Subject: Re: [GE users] run time intel compiler
library
		>> libsvml not found
		>>
		>> Am 21.12.2007 um 13:04 schrieb SLIM H.A.:
		>>
		>>> Maybe it clarifies if I show the script:
		>>>
		>>> #!/bin/csh
		>>> ... some standard sge options here
		>>> #$ -cwd
		>>> setenv MPICH_PROCESS_GROUP no
		>>> # request submission to a queue for parallel jobs #$
-q
		>> par.q ##$ -S
		>>> /bin/csh
		>>
		>> This will be just a real comment, it's not #$ at the
beginning.
		>>
		>>> #   ^^ no effect
		>>> # set up the mpich version to use
		>>> # load the modules
		>>> module purge
		>>> module load intel/fce/9.0.032
mpich/ge/intel/64/1.2.7
		>> sge/6.0u7_1 ldd
		>>> ./monte echo LD_LIBRARY_PATH=$LD_LIBRARY_PATH # $ -v
		>>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH
		>>
		>> This you can only use on the commandline, where
		>> $LD_LIBRARY_PATH will be expanded by the shell. Here
you
		>> should see a literal $LD_LIBRARY_PATH echoed, unless
-V is
		>> used (space between # and $ is also not allowed)
		>>
		>>> #   ^^ no effect
		>>> #$ -V
		>>> #   ^^ only works if the session shell has the
module loaded as well
		>>
		>> Seems okay.
		>>
		>>> # execute command
		>>> mpirun -np $NSLOTS -machinefile $TMPDIR/machines
./monte
		>>>
		>>> I built monte with
		>>>
		>>> module purge
		>>> module load intel/fce/9.0.032
mpich/ge/intel/64/1.2.7
		>> mpif90 monte.f90
		>>> -o monte
		>>>
		>>> These are snippets from the output file ...
		>>>         libsvml.so =>
		>>>
/usr/local/Cluster-Apps/intel/fce/9.0//lib/libsvml.so
		>>> (0x00002b21417de000)
		>>> ...
		>>>
LD_LIBRARY_PATH=/usr/local/lib:/usr/X11R6/lib:/usr/local/Cluster-
		>>> Apps/in
		>>> tel/fce/9.0//lib:/usr/local/Cluster-App
		>>>
s/mpich/ge/intel/64/1.2.7/lib/shared:/usr/local/Cluster-Apps/sge/
		>>> lib/lx2
		>>> 6-amd64
		>>> /usr/local/Cluster-Apps/sge/bin/lx24-amd64/qrsh
-inherit -nostdin
		>>> node231 /data/hamilton/drk1has/hamilton_monte
		>>> pi/amd64_lnx_ifort/./monte node231 50375 \-p4amslave
\-p4yourname
		>>> node231 \-p4rmrank 1
		>>>
		>>
/data/hamilton/drk1has/hamilton_montepi/amd64_lnx_ifort/./mont
		>> e: error
		>>> while loading shared libraries: libsvml.
		>>> so: cannot open shared object file: No such file or
directory ...
		>>
		>> Aha, the slave task might not have the
LD_LIBRARY_PATH.
		>> Please add a - V to the rsh wrapper:
		>>
		>> 
http://gridengine.sunsource.net/howto/mpich-integration.html
		>>
		>> which will also solve other issues. And be sure to
have a
		>> Tight Integration, i.e. "setenv P4_RSHCOMMAND rsh" to
use the
		>> rsh-wrapper.
		>>
		>> -- Reuti
		>>
		>>
---------------------------------------------------------------------
		>> To unsubscribe, e-mail: 
users-unsubscribe at gridengine.sunsource.net
		>> For additional commands, e-mail: 
users-help at gridengine.sunsource.net
		>>
		>>
		>
		>
---------------------------------------------------------------------
		> To unsubscribe, e-mail: 
users-unsubscribe at gridengine.sunsource.net
		> For additional commands, e-mail: 
users-help at gridengine.sunsource.net
		>
		
		
	
---------------------------------------------------------------------
		To unsubscribe, e-mail: 
users-unsubscribe at gridengine.sunsource.net
		For additional commands, e-mail: 
users-help at gridengine.sunsource.net
		
		




    [ Part 2, "ps-ef-failed-job-95889.txt"  Text/PLAIN (Name: ]
    [ "ps-ef-failed-job-95889.txt") ~886 bytes. ]
    [ Unable to print this part. ]


    [ Part 3: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list