[GE users] Spectre checkpoint

veerendra_n veerendra at yashasvi.co.in
Tue Sep 8 14:27:19 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Here is the test setup

Qmaster - Host A
Execution host - Host B
Execution host - Host C

Configuration - In the queue configuration - Execution method I have configured

SUSPEND METHOD - SIGTSTP
RESUME METHOD - SIGCONT  (This is based on spectre documentation)

When I submit a job using qsub, the job starts execution on HOST B, when I reschedule the job in middle, and if HOST B is not free it starts the job on HOST C from the beginning (does not resume).

However if HOST B is available it resumes the job from where it was restarted.

My requirement is to resume the job on HOST C also. (I have not configured checkpoint as yet, only Execution method has been configured on the queue).

Regards
Veeru!


-----Original Message-----
From: reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: Tuesday, September 08, 2009 6:42 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Spectre checkpoint

Am 08.09.2009 um 14:36 schrieb veerendra_n:

> Hi Reuti
>
> Thanks for the response.
>
> My requirement is that when I reschedule a spectre job running on  
> host x to resume on host y.

This I answered yesterday.


> To achieve what can configuration needs to be in place? If  
> checkpoint configuration is the answer how do I go about?

I still don't get it: you have a working checkpointing facility right  
now by just setting up the suspend_- and resume_method? Suspended  
jobs are still on the same machine and will continue at a later point  
in time on this machine.

-- Reuti


> Regards
> Veeru!
>
> -----Original Message-----
> From: reuti [mailto:reuti at staff.uni-marburg.de]
> Sent: Tuesday, September 08, 2009 5:32 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] Spectre checkpoint
>
> Hi,
>
> Am 08.09.2009 um 12:01 schrieb veerendra_n:
>
>> I?m trying to configure checkpoint for a spectre job. I pass
>> SIGTSTP and SIGCONT  in the  execution method and it works very
>> well when the job reschedules on the same host.
>>
>> However the problem arises when the rescheduled job resumes on
>> different host from where it started. It restarts from the
>> beginning instead of resuming. Right now we have just configured
>> Execution method in queue configuration (Suspend method SIGTSTP ?
>> Resume method SIGCONT).
>>
>> How should I configure checkpointing?
>
> the job quits itself after writing the checkpointing file by the
> sigtstp? When you only defined the suspend and resume method, then
> the job stays on the node and won't get rescheduled at all. Therefore
> I don't understand your question in detail.
>
> -- Reuti
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=216398
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=216402
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216407

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=216409

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list