[GE users] Epilog scripts and job abort/deletion

Jesse Becker jbecker at northwestern.edu
Mon Jul 12 17:12:12 BST 2004

Greetings all,

I am currently having a few problems with epilogue scripts and job
deletion.  My cluster is currently running SGE 5.3p5 under the ROCKs
distribution (kernel 2.4.21-4.0.1.ELsmp), and I've configured my own
queues and PEs.

My first question:  Are epilog scripts run when a job is either deleted
(using qdel), or aborts for some reason?  I specifically would like to
run a small script to clean up the IPCS mess left behind when mpich jobs
exit improperly.

Second question/problem:  I have found that processes involved in jobs
are not always killed, and continue to hang around, even after the job
is finished.  I believe that I am using tight integration (partly to try
and avoid this problem with MPI jobs).  I should note that many of these
"mpi" jobs are actually not parallel at all; the users are recycling the
wrapper scripts that have "#$ -pe mpi" embedded in them.  The relevant
parts of the process tree look like this:


If I delete job 2029, there is a fair chance that it won't actually
die, but merely become a child of init (the process named "2029" does go
away properly).  This actually has become enough of a problem that I have
written a few tools to find nodes with high loads, but no jobs assigned.
Is there a way around this by chance?

Third question:  Is there any harm in running non-MPI jobs under a PE
designed for MPI?  I don't really see how it could be an issue, except
as bumping into whatever accounting limits are in place for number of
jobs in the queue...

Various configuration details:

The 'mpi' PE:
	[root at hydra Hydra]# qconf -sp mpi
	pe_name           mpi
	queue_list        all
	slots             16
	user_lists        NONE
	xuser_lists       NONE
	start_proc_args   /opt/gridengine/mpi/startmpi.sh -catch_rsh
	stop_proc_args    /opt/gridengine/mpi/stopmpi.sh
	allocation_rule   $fill_up
	control_slaves    TRUE
	job_is_first_task FALSE

One of the queues (one queue per host, all hosts and queues are the same):

	[root at hydra Hydra]# qconf -sq cp0-20.q
	qname                cp0-20.q
	hostname             comp-pvfs-0-20.local
	seq_no               0
	load_thresholds      np_load_avg=1.75
	suspend_thresholds   NONE
	nsuspend             1
	suspend_interval     00:05:00
	priority             0
	min_cpu_interval     00:05:00
	processors           UNDEFINED
	qtype                BATCH INTERACTIVE PARALLEL 
	rerun                TRUE
	slots                2
	tmpdir               /tmp
	shell                /bin/sh
	shell_start_mode     NONE
	prolog               NONE
	epilog               NONE
	starter_method       NONE
	suspend_method       NONE
	resume_method        NONE
	terminate_method     NONE
	notify               00:00:60
	owner_list           NONE
	user_lists           NONE
	xuser_lists          NONE
	subordinate_list     NONE
	complex_list         NONE
	complex_values       NONE
	projects             NONE
	xprojects            NONE
	calendar             NONE
	initial_state        default
	fshare               0
	oticket              0
	s_rt                 INFINITY
	h_rt                 INFINITY
	s_cpu                INFINITY
	h_cpu                INFINITY
	s_fsize              INFINITY
	h_fsize              INFINITY
	s_data               INFINITY
	h_data               INFINITY
	s_stack              INFINITY
	h_stack              INFINITY
	s_core               INFINITY
	h_core               INFINITY
	s_rss                INFINITY
	h_rss                INFINITY
	s_vmem               INFINITY
	h_vmem               INFINITY

Global configuration:
	[root at hydra Hydra]# qconf -sconf
	qmaster_spool_dir         /opt/gridengine/default/spool/qmaster
	execd_spool_dir           /opt/gridengine/default/spool
	binary_path               /opt/gridengine/bin
	mailer                    /bin/mail
	xterm                     /usr/bin/X11/xterm
	load_sensor               none
	prolog                    none
	epilog                    none
	shell_start_mode          unix_behavior
	login_shells              sh,ksh,csh,tcsh
	min_uid                   0
	min_gid                   0
	user_lists                none
	xuser_lists               none
	projects                  none
	xprojects                 none
	enforce_project           true
	enforce_user              true
	load_report_time          00:00:40
	stat_log_time             48:00:00
	max_unheard               00:05:00
	reschedule_unknown        00:00:00
	loglevel                  log_warning
	administrator_mail        none
	set_token_cmd             none
	pag_cmd                   none
	token_extend_time         none
	shepherd_cmd              none
	qmaster_params            none
	schedd_params             SHARE_FUNCTIONAL_SHARES=1
	execd_params              PTF_MIN_PRIORITY=10,PTF_MAX_PRIORITY=-5
	finished_jobs             1000
	gid_range                 20000-20100
	admin_user                sge
	qlogin_command            telnet
	qlogin_daemon             /usr/sbin/in.telnetd
	rlogin_daemon             /usr/sbin/sshd -i
	default_domain            none
	ignore_fqdn               true
	max_aj_instances          200
	max_aj_tasks              7500
	max_u_jobs                64
	rsh_daemon                /usr/sbin/sshd -i
	rsh_command               /usr/bin/ssh
	rlogin_command            /usr/bin/ssh

I am aware that there are no epilog or prolog scripts configured in the
examples above; I tried adding them, and deleteing a test job, but to no
avail.  I reset the configuration after testing, but can change it back.

Thanks for any suggestions anyone can offer.

Jesse Becker
GPG-fingerprint: BD00 7AA4 4483 AFCC 82D0  2720 0083 0931 9A2B 06A2

    [ Part 2, Application/PGP-SIGNATURE 196 bytes. ]
    [ Unable to print this part. ]

More information about the gridengine-users mailing list