[GE users] Spectre checkpoint

veerendra_n veerendra at yashasvi.co.in
Wed Sep 9 12:36:52 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Reuti,

You has suggested the following solution, which is

"it will be scheduled to any node, which is free. You can either use a setup to copy the file from the local scratch space to a common scratch space (and again to the local node the next time the job starts) to avoid using NFS (which is of course an option, and writing the checkpoint file one time shouldn't put much load on the NFS server).

How do I configure it in Sungrid for spectre to run successfully on another node (after reschedule).


-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Wednesday, September 09, 2009 4:51 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Spectre checkpoint

Am 09.09.2009 um 13:05 schrieb veerendra_n:

> Hi,
>
> Let me know if I can find some solution.

This I answered already two days ago - please check the archive for  
my email to force a job to run on the same host again.

-- Reuti

>
> Regards
> veerendra
>
> -----Original Message-----
> From: Veerendra [mailto:veerendra at yashasvi.co.in]
> Sent: Tuesday, September 08, 2009 6:57 PM
> To: 'users'
> Subject: RE: [GE users] Spectre checkpoint
>
> Here is the test setup
>
> Qmaster - Host A
> Execution host - Host B
> Execution host - Host C
>
> Configuration - In the queue configuration - Execution method I  
> have configured
>
> SUSPEND METHOD - SIGTSTP
> RESUME METHOD - SIGCONT  (This is based on spectre documentation)
>
> When I submit a job using qsub, the job starts execution on HOST B,  
> when I reschedule the job in middle, and if HOST B is not free it  
> starts the job on HOST C from the beginning (does not resume).
>
> However if HOST B is available it resumes the job from where it was  
> restarted.
>
> My requirement is to resume the job on HOST C also. (I have not  
> configured checkpoint as yet, only Execution method has been  
> configured on the queue).
>
> Regards
> Veeru!
>
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, September 08, 2009 6:42 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Spectre checkpoint
>
> Am 08.09.2009 um 14:36 schrieb veerendra_n:
>
>> Hi Reuti
>>
>> Thanks for the response.
>>
>> My requirement is that when I reschedule a spectre job running on
>> host x to resume on host y.
>
> This I answered yesterday.
>
>
>> To achieve what can configuration needs to be in place? If
>> checkpoint configuration is the answer how do I go about?
>
> I still don't get it: you have a working checkpointing facility right
> now by just setting up the suspend_- and resume_method? Suspended
> jobs are still on the same machine and will continue at a later point
> in time on this machine.
>
> -- Reuti
>
>
>> Regards
>> Veeru!
>>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, September 08, 2009 5:32 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] Spectre checkpoint
>>
>> Hi,
>>
>> Am 08.09.2009 um 12:01 schrieb veerendra_n:
>>
>>> I?m trying to configure checkpoint for a spectre job. I pass
>>> SIGTSTP and SIGCONT  in the  execution method and it works very
>>> well when the job reschedules on the same host.
>>>
>>> However the problem arises when the rescheduled job resumes on
>>> different host from where it started. It restarts from the
>>> beginning instead of resuming. Right now we have just configured
>>> Execution method in queue configuration (Suspend method SIGTSTP ?
>>> Resume method SIGCONT).
>>>
>>> How should I configure checkpointing?
>>
>> the job quits itself after writing the checkpointing file by the
>> sigtstp? When you only defined the suspend and resume method, then
>> the job stays on the node and won't get rescheduled at all. Therefore
>> I don't understand your question in detail.
>>
>> -- Reuti
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=216398
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=216402
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=216407
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=216548
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216553

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216557

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list