[GE users] problem when parallel environment startup script fails

cjf001 john.foley at motorola.com
Tue Sep 1 16:58:03 BST 2009


Ah ha !  You are correct, sir ! ;)

I was exitting with "1" when I detected an error - when I exit with "100"
it works as you describe - the job goes into ERROR state, and not
the queues.

   Thanks very much !

       John


reuti wrote:

> Hi,
> 
> Am 01.09.2009 um 17:15 schrieb cjf001:
> 
> 
>>SGEers:
>>
>>I'm implementing a new parallel environment - all its startup  
>>script has
>>to do is make sure that only one host has been allocated by SGE (*)  
>>- so
>>that's easy. However, if the script finds otherwise, I want to signal
>>an error to SGE.
>>
>>The manual says that exitting the parallel environment startup script
>>with a code of other than 0 will cause SGE to "report the error and
> 
> 
> what value do you use exactly? If it's 100, the job (and not the  
> queue) should go into error state. So the job won't be rescheduled  
> again. There is also a setting in sge_conf to allow or disallow this  
> behavior.
> 
> -- Reuti
> 
> 
>>not start the parallel job". Sounds good, so I tried it - and it  
>>*does*
>>report an error by putting the host queue that would have gotten the
>>job into ERROR state. Kind of drastic, but I guess I can live with  
>>that.
>>
>>However, it leaves the job in the pending list. So, on the next  
>>scheduler
>>run, it assigns the job to *another* host queue, where it fails  
>>again, and
>>leaves that host queue in ERROR state. And on and on until all the
>>host queues I have permission for are in ERROR state. That I cannot
>>live with !
>>
>>So, my question is, has anyone successfully signalled SGE when a  
>>parallel
>>environment startup script fails ? If so, how'd you do it ?! Also, is
>>this a bug, or is it working as designed ? Am I missing something ?
>>
>>I thought about "qdel'ing" the job from within the startup script,  
>>but I
>>don't think that will work, since the execute hosts (which I think  
>>it's
>>running on at that point) are not submit hosts, so such commands  
>>are not
>>allowed. Any other thoughts ?  I'm using SGE v6.2u2.
>>
>>     Thanks !
>>
>>       John
>>
>>
>>
>>(*) - why, you ask ?  Because the application (Momentum) takes over  
>>all
>>the cores on the assigned host, but doesn't run across hosts - so,  
>>it's not
>>a *real* parallel job, but I need it to be assigned to all the  
>>cores on
>>the host.
>>
>>
>>-- 
>>###################################################################### 
>>#####
>># John Foley                          # Location:  IL93- 
>>E1-21S            #
>># IT & Systems Administration         # Maildrop:  IL93- 
>>E1-35O            #
>># Antenna & Mechanical Simulation Grp #    Email:  
>>john.foley at motorola.com #
>># Motorola, Inc. -  Mobile Devices    #    Phone: (847)  
>>523-8719          #
>># 600 North US Highway 45             #      Fax: (847)  
>>523-5767          #
>># Libertyville, IL. 60048  (USA)      #     Cell: (847)  
>>460-8719          #
>>###################################################################### 
>>#####
>>                 (this email sent using Mozilla on Windows)
>>
>>------------------------------------------------------
>>http://gridengine.sunsource.net/ds/viewMessage.do? 
>>dsForumId=38&dsMessageId=215323
>>
>>To unsubscribe from this discussion, e-mail: [users- 
>>unsubscribe at gridengine.sunsource.net].
> 
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=215324
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



-- 
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# Antenna & Mechanical Simulation Grp #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
                 (this email sent using Mozilla on Windows)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=215330

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list