[GE users] problem when parallel environment startup script fails

reuti reuti at staff.uni-marburg.de
Tue Sep 1 16:27:31 BST 2009


Hi,

Am 01.09.2009 um 17:15 schrieb cjf001:

> SGEers:
>
> I'm implementing a new parallel environment - all its startup  
> script has
> to do is make sure that only one host has been allocated by SGE (*)  
> - so
> that's easy. However, if the script finds otherwise, I want to signal
> an error to SGE.
>
> The manual says that exitting the parallel environment startup script
> with a code of other than 0 will cause SGE to "report the error and

what value do you use exactly? If it's 100, the job (and not the  
queue) should go into error state. So the job won't be rescheduled  
again. There is also a setting in sge_conf to allow or disallow this  
behavior.

-- Reuti

> not start the parallel job". Sounds good, so I tried it - and it  
> *does*
> report an error by putting the host queue that would have gotten the
> job into ERROR state. Kind of drastic, but I guess I can live with  
> that.
>
> However, it leaves the job in the pending list. So, on the next  
> scheduler
> run, it assigns the job to *another* host queue, where it fails  
> again, and
> leaves that host queue in ERROR state. And on and on until all the
> host queues I have permission for are in ERROR state. That I cannot
> live with !
>
> So, my question is, has anyone successfully signalled SGE when a  
> parallel
> environment startup script fails ? If so, how'd you do it ?! Also, is
> this a bug, or is it working as designed ? Am I missing something ?
>
> I thought about "qdel'ing" the job from within the startup script,  
> but I
> don't think that will work, since the execute hosts (which I think  
> it's
> running on at that point) are not submit hosts, so such commands  
> are not
> allowed. Any other thoughts ?  I'm using SGE v6.2u2.
>
>      Thanks !
>
>        John
>
>
>
> (*) - why, you ask ?  Because the application (Momentum) takes over  
> all
> the cores on the assigned host, but doesn't run across hosts - so,  
> it's not
> a *real* parallel job, but I need it to be assigned to all the  
> cores on
> the host.
>
>
> -- 
> ###################################################################### 
> #####
> # John Foley                          # Location:  IL93- 
> E1-21S            #
> # IT & Systems Administration         # Maildrop:  IL93- 
> E1-35O            #
> # Antenna & Mechanical Simulation Grp #    Email:  
> john.foley at motorola.com #
> # Motorola, Inc. -  Mobile Devices    #    Phone: (847)  
> 523-8719          #
> # 600 North US Highway 45             #      Fax: (847)  
> 523-5767          #
> # Libertyville, IL. 60048  (USA)      #     Cell: (847)  
> 460-8719          #
> ###################################################################### 
> #####
>                  (this email sent using Mozilla on Windows)
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=215323
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=215324

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list