thamizhannal Thamizhannal.Paramasiuam at Honeywell.com
Fri Aug 28 14:45:34 BST 2009

Hi All,
I am trying to implement checkpoint option for fluent jobs.
As mentioned in the fluent user manual I followed the steps
Refer : http://www.aeromech.usyd.edu.au/AMME4210/documents/manuals/fluent_help/html/ug/node1140.htm#parallel-sge

I have created /home/checkpoint/ directory and copied sge1.0 from /Fluent.Inc/addons/sge1.0 and made this folder to be available to all the cluster users.

Here are the steps which I have followed

#qconf -sq |grep min_cpu_interval
 min_cpu_interval      00:05:00

# qconf -sp fluent_pe
pe_name           fluent_pe
slots             16
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /home/checkpoint/sge1.0/kill-fluent
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task TRUE
urgency_slots     min

# qconf -sckpt fluent_ckpt
interface          APPLICATION-LEVEL
ckpt_command       /home/checkpoint/sge1.0/ckpt_command.fluent
migr_command       /home/checkpoint/sge1.0/migr_command.fluent
restart_command    true
clean_command      NONE
ckpt_dir           NONE
signal             USR1
when               xsm

QSUB command - fluent check pointing
qsub -pe fluent_pe 4 -ckpt fluent_ckpt FluentParallel

fluent command ? in job submission script
/opt/apps/ansys_inc/v120/fluent_release/fluent12.0.16/bin/fluent -r12.0.16 3d -gu -driver null -sge -sgeckpt fluent_ckpt -sgeq all.q -sgepe fluent_pe $ NSLOTS -i FLUENT_TEST.jou -t4 -cnf=/tmp/1515.1.all.q/machines

After I submit the job using qsub command, I can saw a folder named Job_id, got created in the job directory. Once the ?min_cpu_interval? time has reached it will create a ?check?, an empty file in the job directory.

After I have created an empty files 1./home/user1/check-fluent-28233 2. /home/ user1/exit-fluent-28233 only SGE starts to checkpoint the .case and .data files of fluent.

Please let me how to trigger the SGE checkpoint for fluent in case of (1) host down (2)  Bad network connection (3) execution node storage full..etc

Thamizhannal P

