[GE users] Suspending Parallel Jobs

Shannon V. Davidson svdavidson at charter.net
Fri Sep 26 15:11:14 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Ron,

Good catch.  Looks like you found a bug.  I'll give it a try.

Shannon

Ron Chen wrote:
> I think the problem is in signal_slave_tasks_of_job():
>
>    /* do not signal slave tasks in case of checkpointing jobs with
>       STOP/CONT when suspending means migration */
>    if ((how==SGE_SIGCONT || how==SGE_SIGSTOP) &&
>       (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)!=0) {
>       ...
>       return;
>    }
>
> I think it's for checkpointing jobs, but "(lGetUlong(jep,JB_checkpoint_attr) | CHECKPOINT_SUSPEND)!=0" is always true. 
>
> The reason is that even if lGetUlong(jep,JB_checkpoint_attr) is 0, CHECKPOINT_SUSPEND (#defined to 0x00000004) is always non-zero. And bitwise OR will then give you a non zero value.
>
> I think the fix is to replace "|" with "&":
>
>    if ((how==SGE_SIGCONT || how==SGE_SIGSTOP) &&
>       (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND)!=0) {
>       ...
>       return;
>    }
>
> Let me know if changing this code fixes anything.
>
>  -Ron
>
>
>
> --- On Fri, 9/26/08, Shannon V. Davidson <svdavidson at charter.net> wrote:
>   
>> Thanks Ron - I'll dig thru the code and see if I can
>> find it.
>>
>> Shannon
>>
>> Ron Chen wrote:
>>     
>>> I remember seeing SGE code that specifically blocks
>>>       
>> sending the suspend signal to the MPI tasks. From the list
>> discussions, the reason is that if a MPI job is suspended,
>> then the TCP/IP network socket calls will timeout, and the
>> job will then fail.
>>     
>>> I think if we comment out a few lines of code, or only
>>>       
>> enable that code by a switch, then it will make many people
>> on this list happy, as it is a FAQ.
>>     
>>>   
>>>  -Ron
>>>
>>>
>>> --- On Fri, 9/26/08, Shannon V. Davidson
>>>       
>> <svdavidson at charter.net> wrote:
>>     
>>>   
>>>       
>>>> I'm trying to suspend a parallel job using a
>>>>         
>> tight PE
>>     
>>>> integration, but 
>>>> the non-local MPI tasks are not being suspended. 
>>>>         
>> Is the
>>     
>>>> tight PE 
>>>> integration code supposed to send the SIGSTOP
>>>>         
>> signal to
>>     
>>>> every MPI task 
>>>> in the job?  Is the suspend method executed on
>>>>         
>> every
>>     
>>>> execution host in a 
>>>> parallel job?
>>>>
>>>> Thanks,
>>>> Shannon
>>>>
>>>>
>>>>
>>>>         
>> ---------------------------------------------------------------------
>>     
>>>> To unsubscribe, e-mail:
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>>>> users-help at gridengine.sunsource.net
>>>>     
>>>>         
>>>       
>>>
>>>
>>>       
>> ---------------------------------------------------------------------
>>     
>>> To unsubscribe, e-mail:
>>>       
>> users-unsubscribe at gridengine.sunsource.net
>>     
>>> For additional commands, e-mail:
>>>       
>> users-help at gridengine.sunsource.net
>>     
>>>
>>>       
>
>
>       
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>   

-- 
_________________________________________

Shannon V. Davidson <sdavidson at appro.com>
Software Engineer     Appro International
636-633-0380 (office)  443-383-0331 (fax)
_________________________________________





More information about the gridengine-users mailing list