[GE users] about the qmod command

reuti reuti at staff.uni-marburg.de
Thu Dec 10 15:07:55 GMT 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Am 10.12.2009 um 15:40 schrieb craffi:

> The consensus is correct - there is no easy or magical way at the  
> OS or
> SGE level to do application level checkpointing.
>
> You don't mention the application type, its often easier to checkpoint
> standalone applications rather than parallel MPI programs that have  
> lots
> of open network connections and inflight messages that need to be
> grabbed, frozen and resumed as well.
>
> There are, however, some commercial technologies in this area if you
> find you have a need for this. I'm about to start testing some  
> products
> from Librato.com on local SGE clusters as well as SGE and non-SGE
> running inside Amazon EC2. Whatever I learn will eventually be  
> posted up
> at gridengine.info

Some checkpointing solutions are listed here:

http://shum.huji.ac.il/~agay/act/

Often it depends on the type of application and it's behavior (i.e.  
used resources like sockets, pipes, threads, shared memory, ...)  
whether they will work with a particular checkpointing solution.

Does anyone know what happened to Meiosys MetaCluster solutions,  
after they were bought by IBM - looks like a product was never  
released thereafter?

-- Reuti


> -Chris
>
>
>
> reuti wrote:
>> Am 08.12.2009 um 21:05 schrieb wagoodman:
>>
>>> I have users that runs jobs for weeks and sometime months on our  
>>> grid,
>>> We experienced some "GDI errors" and traced it down to slow SATA
>>> disks.
>>> We need to move our sge installation to Fiber Channel disks, so  
>>> before
>>> we move the storage, we're informing users that all jobs will be  
>>> qdel
>>> at a specific time. One user ask what about a job that he's been
>>> running
>>> for one month, would he be able to restart the job where it was
>>> suspended
>>>
>>> Sun tech wrote:
>>>
>>> It will not work.
>>> The jobs need to be able to be checkpointed. Even with checkpointing
>>> and it may not work unless the app is inherently checkpointable.
>>>
>>> I was wondering if anyone else had a scenario like this, and what
>>> would
>>> be a solution.
>>
>> Sun tech is correct: it won't work.
>>
>> Only thing what would work is to keep the jobs in the joblist, so
>> they restart automatically from the beginning after reboot.
>>
>> You mean the local disks in the nodes are too slow, or anything on
>> the fileserver?
>>
>> -- Reuti
>>
>>
>>> Bill
>>>
>>> -----Original Message-----
>>> From: reuti [mailto:reuti at staff.uni-marburg.de]
>>> Sent: Tuesday, December 08, 2009 2:16 PM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] about the qmod command
>>>
>>> Am 08.12.2009 um 17:56 schrieb wagoodman:
>>>
>>>> I?m familiar with the qmod command, my real question is:  if issue
>>>> let?s say a qhold or qmod ? sj to suspend the job
>>>> then shut down  the daemon on the execution host and submit hosts
>>>> and shutdown the qmaster and the shadow,
>>>> when I finish the work on the servers (move storage) and then issue
>>>> qmod ?rj to reschedule the job, would that
>>>> work when the daemon on the execution hosts, submit hosts and the
>>>> qmaster and the shadow are restarted?
>>> Depends on what you try to achieve with "will it work".
>>>
>>> -- Reuti
>>>
>>>
>>>> Bill
>>>>
>>>> ------------------------------------------------------
>>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>>> dsForumId=38&dsMessageId=232271
>>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>> unsubscribe at gridengine.sunsource.net].
>>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=232294
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do?
>>> dsForumId=38&dsMessageId=232298
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=232596
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=232631
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=232638

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list