[GE users] about the qmod command

craffi dag at sonsorol.org
Thu Dec 10 14:40:48 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

The consensus is correct - there is no easy or magical way at the OS or 
SGE level to do application level checkpointing.

You don't mention the application type, its often easier to checkpoint 
standalone applications rather than parallel MPI programs that have lots 
of open network connections and inflight messages that need to be 
grabbed, frozen and resumed as well.

There are, however, some commercial technologies in this area if you 
find you have a need for this. I'm about to start testing some products 
from Librato.com on local SGE clusters as well as SGE and non-SGE 
running inside Amazon EC2. Whatever I learn will eventually be posted up 
at gridengine.info

-Chris



reuti wrote:
> Am 08.12.2009 um 21:05 schrieb wagoodman:
>
>> I have users that runs jobs for weeks and sometime months on our grid,
>> We experienced some "GDI errors" and traced it down to slow SATA
>> disks.
>> We need to move our sge installation to Fiber Channel disks, so before
>> we move the storage, we're informing users that all jobs will be qdel
>> at a specific time. One user ask what about a job that he's been
>> running
>> for one month, would he be able to restart the job where it was
>> suspended
>>
>> Sun tech wrote:
>>
>> It will not work.
>> The jobs need to be able to be checkpointed. Even with checkpointing
>> and it may not work unless the app is inherently checkpointable.
>>
>> I was wondering if anyone else had a scenario like this, and what
>> would
>> be a solution.
>
> Sun tech is correct: it won't work.
>
> Only thing what would work is to keep the jobs in the joblist, so
> they restart automatically from the beginning after reboot.
>
> You mean the local disks in the nodes are too slow, or anything on
> the fileserver?
>
> -- Reuti
>
>
>> Bill
>>
>> -----Original Message-----
>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>> Sent: Tuesday, December 08, 2009 2:16 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] about the qmod command
>>
>> Am 08.12.2009 um 17:56 schrieb wagoodman:
>>
>>> I?m familiar with the qmod command, my real question is:  if issue
>>> let?s say a qhold or qmod ? sj to suspend the job
>>> then shut down  the daemon on the execution host and submit hosts
>>> and shutdown the qmaster and the shadow,
>>> when I finish the work on the servers (move storage) and then issue
>>> qmod ?rj to reschedule the job, would that
>>> work when the daemon on the execution hosts, submit hosts and the
>>> qmaster and the shadow are restarted?
>> Depends on what you try to achieve with "will it work".
>>
>> -- Reuti
>>
>>
>>> Bill
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=232271
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=232294
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?
>> dsForumId=38&dsMessageId=232298
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232596
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232631

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list