[GE users] suspension under MPICH2 tight integration

Jason Crane jasonc at mrsc.ucsf.edu
Thu May 18 00:20:18 BST 2006


On Wed, 2006-05-17 at 23:58 +0200, Reuti wrote:
> Am 17.05.2006 um 23:37 schrieb Jason Crane:
> 
> > Hi,
> >
> > On Tue, 2006-05-16 at 23:35 +0200, Reuti wrote:
> >
> >>> 1. The MPICH2 user's guide documentation indicates that it is  
> >>> possible
> >>> to suspend and continue MPICH2 jobs, at least under mpd process
> >>> management (nothing explicit about smpd).  However, in a previous  
> >>> post
> >>> it was mentioned that MPI suspend isn't supported for slave tasks
> >>> under
> >>> SGE because of timing problems:
> >>> (http://gridengine.sunsource.net/servlets/
> >>> ReadMsglistName=users&msgNo=15354)
> >>> If standalone MPICH2 suspension is supported, then is the "timing
> >>> problem" introduced by the integration with SGE, or perhaps it's
> >>> related
> >>> to using smpd?  Is there anything I need to worry about if I
> >>> attempt to
> >>> implement a custom suspend/resume method for suspending slave tasks
> >>> under tight integration with the MPICH2 smpd daemonless parallel
> >>> environment?
> >>
> >> just try and let us know your results. Do you want to suspend it by
> >> hand or with another parallel job (with the same allocation of nodes
> >> - how?)?
> >
> > I'm observing (in the trace file) that the custom suspend_method is  
> > not
> > executed for slave nodes within an MPI job, but rather only for the
> > master node.  Do you know if there is a way to override this  
> > behavior at
> 
> This is the intended behavior. You have to use any rsh/ssh inside the  
> master node's custom suspend_method to do something on the slave  
> nodes. What is MPICH2 expecting - to get a SIGSTOP to all involved  
> processes at nearly the same time?
Hi,

I don't know the specific MPICH2 job suspension requirements just yet.
However, the trouble is that I would like to be able to suspend an MPI
job on a subordinate queue if a batch job on a higher priority queue is
submitted, but the batch job may be running on an arbitrary node, not
necessarily the master node for the subordinate PE job.  In this case if
the custom suspend_methods on the subordinate queues are not accessible
for slave nodes I'm not sure how to initiate the signaling.

-Jason


> 
> -- Reuti
> 
> > run-time, or does it need to be handled at the source code level?  If
> > so, do you have any hints about where to look?
> >
> > thanks,
> > Jason
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
-- 
___________________________________________
Jason Crane, Ph.D.
UCSF Radiology MC 2532
CA Institute for Quantitative Biomedical Research
Byers Hall Suite 301
1700 4th St.
San Francisco, CA 94158-2330
e-mail: jasonc at mrsc.ucsf.edu
tel: 415.514.4426
FAX: 415.514.2550


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list