[GE users] SGE checkpointing for Fluent
Thamizhannal.Paramasiuam at Honeywell.com
Fri Aug 28 14:45:34 BST 2009
[ The following text is in the "Windows-1252" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I am trying to implement checkpoint option for fluent jobs.
As mentioned in the fluent user manual I followed the steps
Refer : http://www.aeromech.usyd.edu.au/AMME4210/documents/manuals/fluent_help/html/ug/node1140.htm#parallel-sge
I have created /home/checkpoint/ directory and copied sge1.0 from /Fluent.Inc/addons/sge1.0 and made this folder to be available to all the cluster users.
Here are the steps which I have followed
#qconf -sq |grep min_cpu_interval
# qconf -sp fluent_pe
# qconf -sckpt fluent_ckpt
QSUB command - fluent check pointing
qsub -pe fluent_pe 4 -ckpt fluent_ckpt FluentParallel
fluent command ? in job submission script
/opt/apps/ansys_inc/v120/fluent_release/fluent12.0.16/bin/fluent -r12.0.16 3d -gu -driver null -sge -sgeckpt fluent_ckpt -sgeq all.q -sgepe fluent_pe $ NSLOTS -i FLUENT_TEST.jou -t4 -cnf=/tmp/1515.1.all.q/machines
After I submit the job using qsub command, I can saw a folder named Job_id, got created in the job directory. Once the ?min_cpu_interval? time has reached it will create a ?check?, an empty file in the job directory.
After I have created an empty files 1./home/user1/check-fluent-28233 2. /home/ user1/exit-fluent-28233 only SGE starts to checkpoint the .case and .data files of fluent.
Please let me how to trigger the SGE checkpoint for fluent in case of (1) host down (2) Bad network connection (3) execution node storage full..etc
More information about the gridengine-users