[GE users] checking job return status in epilog script

cjf001 john.foley at motorola.com
Wed Jun 10 14:59:19 BST 2009


Mathias -

here's a couple of thoughts/ideas for you -

first, when I was experimenting with this yesterday, I put the
following line in my epilog script:

   cp -r $SGE_JOB_SPOOL_DIR    /tmp/sge_job_spool_dir

this copies the entire job spool directory up to /tmp
*on the machine where the job is running* before it's
removed. Then you can peruse through the stuff there
and see if there's anything that might help in your
debugging, without worrying  about it disappearing on
you.

second, below is the current epilog script that I'm using.
It writes some status messages to a "job log" that I keep
in a common spot (accessible to all hosts via NFS), and
then appends the pe_hostfile and trace files from the
job_spool_directory to this "job log" - good info if there's
a problem with the job. Note that I extract the exit
status from the "usage" file.

     Hope this helps a bit -

         John

PS - oh, one other thing I noticed in your post --
you mentioned that your problem jobs are in state "S", which
you called "sleeping" -- from the way I understand the qstat
output, capital s ("S") means that the queue is suspended
(as opposed to a small s ("s") which means that the job is
suspended). Not sure if that's just the term you use for this
or not, but I thought I'd point it out - it could be that
your problem is that the queue is getting suspended for some
reason....



my epilog file:

#!/bin/bash

PATH=$SGE_BINARY_PATH:/bin:/usr/bin:
OUTFILE=/appl/sun/grid_engine/site_PCSRL/sge_logs/$JOB_ID
myname=`basename $0`
me=`id | awk '{print $1 " " $2}'`

echo "`date` : $myname  : completing job id '$JOB_ID'"                >> $OUTFILE
estatus=`grep exit_status $SGE_JOB_SPOOL_DIR/usage | cut -d '=' -f 2 `
echo "`date` : $myname  : exit status is '$estatus'"                  >> $OUTFILE

echo ""                               >> $OUTFILE
echo ""                               >> $OUTFILE
echo "pe_hostfile file follows : "    >> $OUTFILE
cat $SGE_JOB_SPOOL_DIR/pe_hostfile    >> $OUTFILE
echo ""                               >> $OUTFILE
echo ""                               >> $OUTFILE
echo "trace file follows : "          >> $OUTFILE
cat $SGE_JOB_SPOOL_DIR/trace          >> $OUTFILE



madpower wrote:
>>What exactly are you trying to find out??
> 
> Well, the problem that arises in our environment is that sometimes jobs
> stop to be executed on the execution hosts but they are not killed.
> I.e., they are in status "S" (sleeping). We, however, do not know why
> this happens and what to do such that this does not happen again.
> Unfortunately, the behavior cannot be reproduced since it is non
> deterministic.
> 
> So my hope was, when reading the original post(s), that there might be
> some information in the usage file which indicates further details on
> the failure (maybe mem-usage, cpu-usage, i/o, etc.).
> 
> 
>>The job directory is created on the execution host, and when the job
>>finishes, the directory is cleaned up after the job data is sent to
>>qmaster.
> 
> So this means that while a job is executed there should be some
> informations (in some files) on the execution host. Is there any default
>  directory, where these files are stored or any default names for this
> files, e.g., $TASK_ID.usage?
> Because then I can search for these files and have a look at them. I did
> not find, however, any file on my execution hosts having "usage" in
> their name. Or are they created only on finish of jobs - as described
> above, our jobs do not finish (neither clean nor unclean) but they sleep.
> 
> Thanks,
> Matthias
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201388
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


-- 
###########################################################################
# John Foley                          # Location:  IL93-E1-21S            #
# IT & Systems Administration         # Maildrop:  IL93-E1-35O            #
# Antenna & Mechanical Simulation Grp #    Email: john.foley at motorola.com #
# Motorola, Inc. -  Mobile Devices    #    Phone: (847) 523-8719          #
# 600 North US Highway 45             #      Fax: (847) 523-5767          #
# Libertyville, IL. 60048  (USA)      #     Cell: (847) 460-8719          #
###########################################################################
                 (this email sent using Mozilla on VPC)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201426

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list