[GE users] "failed to deliver job" warnings?

John Saalwaechter bababooey182 at yahoo.com
Mon Apr 25 18:17:29 BST 2005


I'm getting a lot of "failed to deliver job..." warnings in the SGE
messages file lately.  Oddly, the jobs seem unaffected, and in fact
the accounting file claims that they were completed quite a while
before the warning time stamp.  Any ideas on troubleshooting this?

========== Details: ==========
A. SGEEE 5.3p6

B. Example messages warnings (hostnames changed):
Sun Apr 24 11:15:50 2005|qmaster|xxxxxx|W|failed to deliver job 861774.332 to queue
"yyyyyy21.low.q"
Sun Apr 24 11:15:50 2005|qmaster|xxxxxx|W|failed to deliver job 861774.331 to queue
"yyyyyy15.low.q"
Sun Apr 24 11:15:50 2005|qmaster|xxxxxx|W|failed to deliver job 861774.328 to queue
"yyyyyy19.low.q"
Sun Apr 24 11:16:03 2005|qmaster|xxxxxx|W|failed to deliver job 861774.334 to queue
"yyyyyy34.low.q"
Sun Apr 24 11:16:04 2005|qmaster|xxxxxx|W|failed to deliver job 861774.339 to queue
"yyyyyy10.low.q"
Sun Apr 24 11:16:04 2005|qmaster|xxxxxx|W|failed to deliver job 861774.338 to queue
"yyyyyy26.low.q"
Sun Apr 24 11:16:04 2005|qmaster|xxxxxx|W|failed to deliver job 861774.337 to queue
"yyyyyy20.low.q"
Sun Apr 24 11:16:23 2005|qmaster|xxxxxx|W|failed to deliver job 861774.345 to queue
"yyyyyy01.low.q"
Sun Apr 24 11:16:23 2005|qmaster|xxxxxx|W|failed to deliver job 861774.344 to queue
"yyyyyy09.low.q"
Sun Apr 24 11:16:23 2005|qmaster|xxxxxx|W|failed to deliver job 861774.343 to queue
"yyyyyy23.low.q"
Sun Apr 24 11:16:23 2005|qmaster|xxxxxx|W|failed to deliver job 861774.342 to queue
"yyyyyy11.low.q"
Sun Apr 24 11:16:23 2005|qmaster|xxxxxx|W|failed to deliver job 861774.341 to queue
"yyyyyy18.low.q"

C. Looking at that last one, the accounting file has this information about
job 861774.341 (and it actually ran on the queue listed in the warning):

submission_time: 1114354724 (Sun Apr 24 09:58:44 2005)
start_time:      1114359083 (Sun Apr 24 11:11:23 2005)
end_time:        1114359140 (Sun Apr 24 11:12:20 2005)
failed:          0
exit_status:     0
ru_wallclock:    57 (0 hrs 0 min 57 sec)

D. So this job finished at 11:12:20, but then generated a "failed to deliver job"
warning at 11:16:23!

E. We do have an epilog script turned on.  It contains:
#!/bin/sh
#
# $Id: gridepilog.sh,v 1.1 2004/03/04 22:59:28 sgeadmin Exp $
#

if [ "${SGE_EPILOGUE}" != "" ]; then
        $SGE_EPILOGUE
fi
exit 0

The vast majority of our jobs do not use the epilog feature, but the skeleton
above does have to run for each job.  Could the warning be from the epilog?

--
John Saalwaechter <bababooey182 at yahoo.com>

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list