[GE users] checkpointing with blcr

Daniel Templeton Dan.Templeton at Sun.COM
Tue Dec 11 16:51:38 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Jerry,

One solution might be to have your job exit with 99.  A job exit code of 
99 tells the qmaster to reschedule the job.  That's the main mechanism 
for a job to say, "I landed in a bad place.  Please move me somewhere else."

Daniel

Jerry Mersel wrote:
> Hi:
>
>  I manage to successfully checkpoint and rerun an application, with 
> migration.
>  But I won't be able to do that if the PID is in use on the other 
> machine. (That the process migrated to).
>
>  What I want to do is have the job wait on its queue until the PID 
> becomes free.
>  I simulated a situation where  the PID is in use, I find that it is 
> in use I then call
>  qalter -q $QUEUE $JOB_ID, from the batch script.
>
> But it didn't work. The job was just killed
>
> Any ideas?
>
>                               Regards,
>                                 Jerry
>
> P.S. I use BLCR and application_level checkpointing as in the how-to.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list