[GE users] shepherd problem with local spool directory

Robert Dahlke r.d at lmu.de
Tue Jun 22 22:48:14 BST 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I have a problem which probably has to do with the use of local spool
directories. (For testing reasons I am using just one host and a simple
default queue).

I can submit jobs, but when they are passed to a matching queue, they
can not be executed, because the job script has not been copied to the
local spool directory "job_scripts". This directory is empty. I don't
think that the problem is related to permissions being set wrongly, as
the execd can happily write to the local spool directory.

The execd message says: 

Tue Jun 22 22:49:00 2004|execd|boe|E|shepherd of job 10.1 exited with exit status = 11

The qmaster message says:

Tue Jun 22 22:49:00 2004|qmaster|boe|W|job 10.1 failed on host
boe.severin.local general before job because: 06/22/2004 22:48:58
[1000:3427]: unable to find job file
"/var/spool/sge/theorie/boe/job_scripts/10"
Tue Jun 22 22:49:00 2004|qmaster|boe|W|rescheduling job 10.1 
Tue Jun 22 22:49:00 2004|qmaster|boe|E|queue q4 marked QERROR as result of job 10's failure

The output of the debug-email can be found at the reminder of this
mail.

Thanks for any help,
Robert.

-- 
Encrypted mail is welcome. My PGP-Key fingerprint: 
8FA1 35B3 8A70 57CD 1F2E  E58A A863 A88F F127 8E93

---------------------- MAIL -------------------------
Job 10 caused action: Queue "q4"set to ERROR
 User        = bob
 Queue       = q4
 Host        = boe.severin.local
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job:06/22/2004 22:48:58 [1000:3427]: unable to find job
 file "/var/spool/sge/theorie/boe/job_scripts/10"
Shepherd trace:
06/22/2004 22:48:58 [1002:3426]: shepherd called with uid = 0, euid =
 1002
06/22/2004 22:48:58 [1002:3426]: starting up 5.3p6
06/22/2004 22:48:58 [1002:3426]: setpgid(3426, 3426) returned 0
06/22/2004 22:48:58 [1002:3426]: no prolog script to start
06/22/2004 22:48:58 [1002:3427]: pid=3427 pgrp=3427 sid=3427 old
 pgrp=3426 getlo
gin()=<no login set>
06/22/2004 22:48:58 [1002:3427]: setosjobid: uid = 0, euid = 1002
06/22/2004 22:48:58 [1002:3427]: RLIMIT_CPU setting: (soft -1 hard -1)
 resulting
: (soft -1 hard -1)
06/22/2004 22:48:58 [1002:3427]: RLIMIT_FSIZE setting: (soft -1 hard
 -1) resulti
ng: (soft -1 hard -1)
06/22/2004 22:48:58 [1002:3427]: RLIMIT_DATA setting: (soft -1 hard
 -1) resultin
g: (soft -1 hard -1)
06/22/2004 22:48:58 [1002:3427]: RLIMIT_STACK setting: (soft -1 hard
 -1) resulting: (soft -1 hard -1)
06/22/2004 22:48:58 [1002:3427]: RLIMIT_CORE setting: (soft -1 hard
 -1) resultin
g: (soft -1 hard -1)
06/22/2004 22:48:58 [1002:3427]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
 -1 hard -1
) resulting: (soft -1 hard -1)
06/22/2004 22:48:58 [1002:3427]: RLIMIT_RSS setting: (soft -1 hard -1)
 resulting
: (soft -1 hard -1)
06/22/2004 22:48:58 [1000:3427]: closing all filedescriptors
06/22/2004 22:48:58 [1000:3427]: further messages are in ërror"and
 "trace"
06/22/2004 22:48:58 [1002:3426]: forked "job"with pid 3427
06/22/2004 22:48:58 [1002:3426]: child: job - pid: 3427
06/22/2004 22:48:58 [1002:3426]: wait3 returned 3427 (status: 2816;
 WIFSIGNALED:
 0,  WIFEXITED: 1, WEXITSTATUS: 11)
06/22/2004 22:48:58 [1002:3426]: job exited with exit status 11
06/22/2004 22:48:58 [1002:3426]: reaped "job"with pid 3427
06/22/2004 22:48:58 [1002:3426]: job exited not due to signal
06/22/2004 22:48:58 [1002:3426]: now sending signal 9 to pid -3427
06/22/2004 22:48:58 [1002:3426]: job exited with status 11
06/22/2004 22:48:58 [1002:3426]: no tasker to notify
06/22/2004 22:48:58 [1002:3426]: failed starting job
06/22/2004 22:48:58 [1002:3426]: no epilog script to start

Shepherd error:
06/22/2004 22:48:58 [1000:3427]: unable to find job file
"/var/spool/sge/theorie/boe/job_scripts/10"

Shepherd pe_hostfile:
boe.severin.local 1 q4 UNDEFINED


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list