[GE users] Checkpointing howto example n°1

reuti reuti at staff.uni-marburg.de
Mon Jul 12 15:55:15 BST 2010


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi,

Am 12.07.2010 um 16:03 schrieb spow_:

> Thanks for your help, it does work now.
> Permissions of the output folder were not set right thus blocking the creation of the file.

fine.

> As for the blank directory, it happened when I forgot to add the -ckpt option. (don't laugh)
> 
> However, I'm still wondering how user-level checkpointing works... I'm stuck at understanding the basics of it.
> 
> My guess is that the admin will distribute 2 files : a C code, the one in the example n°3 of the howto, and a script which will mostly locate the C code. The user has nothing to do, except modify the script to execute his code.
> Does the checkpointing understand by itself which variables are to be saved ?

No. SGE's checkpointing will only support a checkpointing feature, which is already working outside of SGE. Hence it will need some modifications of your code, so that it saves the state of the program and all necessary variables to a file on its own. Then this can be triggered by SGE.

These examples in the Howto are just writing one information to a file, which is sufficient to restart the small demontration program to give the reader an idea, how this is supposed to work in general, and to implement something on his own. Best would be to design a program from the initial sketch to resemble a finite state machine, so that any state of the machine can easily be saved alongside its memory content.

If you want to have a checkpointing feature without re-writing your applications, you can look into using the Condor library (last example in the Howto), or the BLCR (second checkpointing Howto).

-- Reuti


> Is it really possible that it works for 2 very different programs without changing anything? If there is a huge matrix created from simple initial conditions, will it save it ? And how does it know it exists and needs to be saved ?
> 
> What I do not want is to have end-users having to modify the codes they currently run so that they can checkpoint.
> But I seriously doubt the user-level checkpointing works that way, kernel-level seems more appropriate.
> 
> GQ
> 
> 
> 
> 
> > Date: Thu, 8 Jul 2010 15:54:20 +0200
> > From: reuti at staff.uni-marburg.de
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] Checkpointing howto example n°1
> > 
> > Hi,
> > 
> > Am 08.07.2010 um 14:50 schrieb spow_:
> > 
> > > It appears I was badly mistaken with the Kernel/User-based checkpointing. Thanks for the precision.
> > > I am following Reuti's howto : http://gridengine.sunsource.net/howto/checkpointing.html and I try to have the first script (example 1) running.
> > > I have set all options accordingly to the howto :
> > > 
> > > queue parameters :
> > > shell : bin/bash
> > > shell_start_mode : unix_behaviour
> > > referenced_checkpoint_objects : check_transparent
> > > flush_submit_sec=4
> > > min_cpu_interval 00:00:15
> > > 
> > > check_transparent parameters :
> > > interface transparent
> > > commands none
> > > ckpt_dir /tmp/checkpoint
> > 
> > often /tmp is local. I.e. the file will only be on the compute node, and neither on the head node, nor be copied to the node where the re-scheduled job will start. In case you have more than one compute node, it's best to use a shared directory like the /home/checkpoint I suggested.
> > 
> > That the name is not not set, is of course a different issue. Did you specify:
> > 
> > $ qsub -ckpt check_transparent test.sh
> > ...
> > $ qstat -j <jobid>
> > ...
> > checkpoint_object: check_transparent
> > checkpoint_attr: sx 
> > 
> > for the job submission?
> > 
> > -- Reuti
> > 
> > 
> > > signal USR2
> > > when xmr
> > > 
> > > Unfortunately, the file that should be created under $SGE_CKPT_DIR isn't, though the script does execute until the end, without any errors.
> > > If I echo $SGE_CKPT_DIR, it echoes blank, even though it is correctly specified in the checkpoint params.
> > > If I set SGE_CKPT_DIR=/tmp/checkpoint in the script right before the above echo, it does display /tmp/checkpoint in the output file.
> > > But in both cases, no checkpoint file is created, and I cannot witness the dates being printed in the checkpoint file (for it doesn't exist).
> > > 
> > > A few years back, a user named Sangamesh had a similar problem (though his file got created) but I found no leads in the answers he had been given.
> > > Maybe there are externate modules I have to install ? I have a 'fresh' install of SGE, nothing else.
> > > 
> > > 
> > > Thanks for your help.
> > > Guillaume Quéré
> > > 
> > > PS : sorry for not publishing the pv messages Reuti, but I get an error anytime I try to post from my previous account.
> > > 
> > > Le nouveau Messenger arrive ! Téléchargez-le gratuitement et découvrez ses nouvelles fonctionnalités
> > 
> > ------------------------------------------------------
> > http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=266721
> > 
> > To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
> 
> Le nouveau Messenger arrive ! Téléchargez-le gratuitement et découvrez ses nouvelles fonctionnalités

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=267527

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list