[GE users] Job Aborted under Irix 6.5.26

Patrick Lesher patrick.lesher at ecs.gatech.edu
Wed Jan 19 19:21:26 GMT 2005


Hello, I have added an Origin2k to my SGE 6.0u1 cluster
and keep getting job failed/aborted emails even though the
job runs fine.

The sun's don't have this problem and their queue's are
basically copies of the SGI's.

I submitted a very simple matlab script and everything appears
to run fine.  I get the expected results, etc.
prompt>>qsub -q large-sgi qsub.txt

qsub.txt: 
----------------------------------------------------------------- 
#$-j y
#$-cwd
#$-m bea
#$-Mpl25 at ecs.gatech.edu                                                                                                                         
/software/bin/matlab < input.m > out
-----------------------------------------------------------------



In the $SGE_ROOT/spool/$HOST/messages file:
01/19/2005 14:06:21|execd|nighthawk|E|abnormal termination of shepherd
for job 34.1: "exit_status" file is empty

The user gets the job submitted email and then gets a job
Aborted email:
Job 34 (qsub.txt) Aborted
 Exit Status      = 0
 Signal           = unknown signal
 User             = pl25
 Queue            = large-sgi at HOSTNAME-REMOVED
 Host             = HOSTNAME-REMOVED
 Start Time       = 01/19/2005 14:06:02
 End Time         = 01/19/2005 14:06:19
 CPU              = 00:00:04
 Max vmem         = NA
failed before writing exit_status because:
shepherd exited with exit status 19



In an email being sent to the administrative address I get:

Job 34 caused action: none
 User        = pl25
 Queue       = large-sgi at HOSTNAME-REMOVED
 Host        = HOSTNAME-REMOVED
 Start Time  = 01/19/2005 14:06:02
 End Time    = 01/19/2005 14:06:19
failed before writing exit_status:shepherd exited with exit status 19
Shepherd trace:
01/19/2005 14:06:02 [0:5433]: in irix code
01/19/2005 14:06:02 [0:5433]: 399
01/19/2005 14:06:02 [0:5433]: can't get id for project "none"
01/19/2005 14:06:02 [100:5433]: RLIMIT_CPU setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
9223372036854775807 hard 9223372036854775807)
01/19/2005 14:06:02 [100:5433]: RLIMIT_FSIZE setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
9223372036854775807 hard 9223372036854775807)
01/19/2005 14:06:02 [100:5433]: RLIMIT_DATA setting: (soft 6442450944
hard 6442450944) resulting: (soft 6442450944 hard 6442450944)
01/19/2005 14:06:02 [100:5433]: RLIMIT_STACK setting: (soft 6442450944
hard 6442450944) resulting: (soft 6442450944 hard 6442450944)
01/19/2005 14:06:02 [100:5433]: RLIMIT_CORE setting: (soft
9223372036854775807 hard 9223372036854775807) resulting: (soft
9223372036854775807 hard 9223372036854775807)
01/19/2005 14:06:02 [100:5433]: RLIMIT_VMEM setting: (soft 6442450944
hard 6442450944) resulting: (soft 6442450944 hard 6442450944)
01/19/2005 14:06:02 [100:5433]: RLIMIT_RSS setting: (soft 4294967296
hard 4294967296) resulting: (soft 4294967296 hard 4294967296)
01/19/2005 14:06:02 [500:5433]: closing all filedescriptors
01/19/2005 14:06:02 [500:5433]: further messages are in "error" and
"trace"
01/19/2005 14:06:02 [500:5433]: using stdout as stderr
01/19/2005 14:06:02 [500:5433]:
execvp(/opt/sge/ecs1/spool/nighthawk/job_scripts/34,
"/opt/sge/ecs1/spool/nighthawk/job_scripts/34")
01/19/2005 14:06:19 [100:5430]: wait3 returned 5433 (status: 0;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
01/19/2005 14:06:19 [100:5430]: job exited with exit status 0
01/19/2005 14:06:19 [100:5430]: reaped "job" with pid 5433
01/19/2005 14:06:19 [100:5430]: job exited not due to signal
01/19/2005 14:06:19 [100:5430]: job exited with status 0
01/19/2005 14:06:20 [100:5430]: now sending signal KILL to pid -5433
01/19/2005 14:06:20 [100:5430]: writing usage file to "usage"
01/19/2005 14:06:20 [100:5430]: no tasker to notify
01/19/2005 14:06:20 [100:5430]: no epilog script to start

Shepherd pe_hostfile:
HOSTNAME-REMOVED 1 large-sgi at HOSTNAME-REMOVED UNDEFINED

Any help would be greatly appreciated.

Thanks,
Patrick



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list