[GE users] Integration of the MPICH2 and SGE

gqc606 gqc606 at hotmail.com
Fri May 28 04:20:05 BST 2010


Hi reuti:
  My operating system is Rocks 5.3,version of MPICH2 is mpich2-1.0.8p1-0.And I can manually start up MPICH2 successfully .
 
  I found the SGE "messages" logs on the frontend and on both compute nodes:
  On the frontend(/opt/gridengine/default/spool/qmaster/messages):
05/27/2010 16:26:29|worker|cluster|E|tightly integrated parallel task 256.1 task 1.compute-0-1 failed - killing job
05/27/2010 16:27:33|worker|cluster|W|job 256.1 failed on host compute-0-1.local general in pestart because: 05/27/2010 16:26:16 [400:24438]: exit_status of pe_start = 1
05/27/2010 16:27:33|worker|cluster|E|queue all.q marked QERROR as result of job 256's failure at host compute-0-1.local
05/27/2010 16:27:42|worker|cluster|E|denied: job "256" does not exist

  On compute-0-1 node(/opt/gridengine/default/spool/compute-0-1/messages):
05/27/2010 16:26:17|  main|compute-0-1|E|shepherd of job 256.1 exited with exit status = 10
05/27/2010 16:26:17|  main|compute-0-1|W|reaping job "256" ptf complains: Job does not exist
05/27/2010 16:26:29|  main|compute-0-1|W|reaping job "256" ptf complains: Job does not exist
05/27/2010 16:26:29|  main|compute-0-1|E|can't open file active_jobs/256.1/1.compute-0-1/error: No such file or directory
 
I can find the directory:/opt/gridengine/default/spool/compute-0-0/active_jobs,it seems that Rocks don't creat a file named "/256.1/1.compute-0-1" when I submitted the script.
 
 
 There is anther problem,when I submitted the script ,it seems taht all the processes are assigned to one of the compute nodes.As above the job "256",the wrong messages was only found on compute-0-1's "messages" log,I can't find a similar messages of the job "256" on compute-0-0's "messages" log.I don't know it is my SGE's problems or my configuration files need to reset.Can you give me some advices?Thanks!






> Hi,
> 
> Am 27.05.2010 um 15:07 schrieb gqc606:
> 
> > Hi reuti:
> >  This is my mistake that I only modified the script(startmpich2.sh) on the front node,and forgot to edit it on the compute nodes.Now I modified the script on all nodes.
> > 
> > But when I submitted the script,it still produces the following error:
> > -catch_rsh /opt/gridengine/default/spool/compute-0-0/active_jobs/254.1/pe_hostfile /opt/mpich2/gnu
> > compute-0-0:3
> > compute-0-1:3
> > startmpich2.sh: check for local mpd daemon (1 of 10)
> > /opt/gridengine/bin/lx26-x86/qrsh -inherit -V compute-0-0 /opt/mpich2/gnu/bin/mpd
> 
> so the loop is doing the right thing in the first iteration and tries to start the local daemon on node compute-0-0. The question is, why it's failing? You can start an mpd by hand on this machine? Which version of MPICH2 is installed?
> 
> 
> > startmpich2.sh: check for local mpd daemon (2 of 10)
> > startmpich2.sh: check for local mpd daemon (3 of 10)
> > startmpich2.sh: check for local mpd daemon (4 of 10)
> > startmpich2.sh: check for local mpd daemon (5 of 10)
> > startmpich2.sh: check for local mpd daemon (6 of 10)
> > startmpich2.sh: check for local mpd daemon (7 of 10)
> > startmpich2.sh: check for local mpd daemon (8 of 10)
> > startmpich2.sh: check for local mpd daemon (9 of 10)
> > startmpich2.sh: check for local mpd daemon (10 of 10)
> > startmpich2.sh: local mpd could not be started, aborting
> > -catch_rsh /opt/mpich2/gnu
> > mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_test_sge_254.undefined); possible causes:
> >  1. no mpd is running on this host
> >  2. an mpd is running but was started without a "console" (-n option)
> > In case 1, you can start an mpd on this host with:
> >    mpd &
> > and you will be able to run jobs just on this host.
> > For more details on starting mpds on a set of hosts, see
> > the MPICH2 Installation Guide.
> > error: error: ending connection before all data received
> > error: 
> > error reading job context from "qlogin_starter"
> > 
> > And I searched the "qlogin_starter",but didn't found it at all.I don't know it is my MPICH2's problems or SGE's? Can you give me some advices?Thank
> 
> To achieve a Tight Integration, the necssary setup will make a local `qrsh` call. "job_is_fist_task FALSE" in the PE? What type of communication did you set up in SGE's configuration: -builtin-, classic rsh or ssh?
> 
> -- Reuti
> 
> 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259007
> > 
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=259214

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list