[GE users] Fwd: GE 6.1u2: Job 166 failed

Sean Davis sdavis2 at mail.nih.gov
Fri Apr 18 22:43:14 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

I am running sge 6.1u2 and openmpi and attempting to run a simple
test.  I am a pure novice at both administration and use of SGE.  I
have a pe set up:

sdavis at shakespeare:~> qconf -sp orte
pe_name           orte
slots             2
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $pe_slots
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min

And I have added the orte pe to the pe_list for all.q.  I have two
hosts, each with eight processors.  I start a parallel interactive
job:

sdavis at shakespeare:~> qsh -pe orte 2
Your job 166 ("INTERACTIVE") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 166 has been successfully scheduled.

In the new interactive shell, I do:

/home/sdavis> mpirun --mca pls_gridengine_verbose 1 -np 2 hostname
local configuration shakespeare.nci.nih.gov not defined - using global
configuration
Starting server daemon at host "shakespeare.nci.nih.gov"
Server daemon successfully started with task id "1.shakespeare"
Establishing /usr/bin/ssh -o StrictHostKeyChecking=no session to host
shakespeare.nci.nih.gov ...
/usr/bin/ssh -o StrictHostKeyChecking=no exited on signal 13 (PIPE)
reading exit code from shepherd ...  timeout (60 s) expired while
waiting on socket fd 6
error: error reading returncode of remote command
[shakespeare:04628] ERROR: A daemon on node shakespeare.nci.nih.gov
failed to start as expected.
[shakespeare:04628] ERROR: There may be more information available from
[shakespeare:04628] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[shakespeare:04628] ERROR: If the problem persists, please restart the
[shakespeare:04628] ERROR: Grid Engine PE job
[shakespeare:04628] ERROR: The daemon exited unexpectedly with status 255.

And the output of qstat -t:

sdavis at shakespeare:~> qstat -t
job-ID  prior   name       user         state submit/start at
queue                          master ja-task-ID task-ID state cpu
   mem     io      stat failed
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    167 0.55500 INTERACTIV sdavis       r     04/18/2008 17:37:23
all.q at shakespeare.nci.nih.gov  MASTER                        r
00:00:00 0.01696 0.00000

all.q at shakespeare.nci.nih.gov  SLAVE

all.q at shakespeare.nci.nih.gov  SLAVE


Below is the email status from the host.  I'm not sure what is broken
here, as there are several pieces to the puzzle.  Can someone give me
a hint or two?

Thanks,
Sean


---------- Forwarded message ----------
From: root <root at shakespeare.nci.nih.gov>
Date: Fri, Apr 18, 2008 at 5:16 PM
Subject: GE 6.1u2: Job 166 failed
To: sdavis2 at mail.nih.gov


Job 166 caused action: PE Job 166 will be deleted
  User        = sdavis
  Queue       = all.q at shakespeare.nci.nih.gov
  Host        = shakespeare.nci.nih.gov
  Start Time  = <unknown>
  End Time    = <unknown>
 failed before job:04/18/2008 17:16:02 [0:4517]: can't open file
/tmp/166.1.all.q/pid.1.shakespeare: No such file or di
 Shepherd trace:
 04/18/2008 17:14:52 [10020:4464]: shepherd called with uid = 0, euid = 10020
 04/18/2008 17:14:52 [10020:4464]: starting up 6.1u2
 04/18/2008 17:14:52 [10020:4464]: setpgid(4464, 4464) returned 0
 04/18/2008 17:14:52 [10020:4464]: no prolog script to start
 04/18/2008 17:14:52 [10020:4464]: /bin/true
 04/18/2008 17:14:52 [10020:4464]: /bin/true
 04/18/2008 17:14:52 [10020:4465]: pid=4465 pgrp=4465 sid=4465 old
pgrp=4464 getlogin()=<no login set>
 04/18/2008 17:14:52 [10020:4465]: reading passwd information for user 'sdavis'
 04/18/2008 17:14:52 [10020:4464]: forked "pe_start" with pid 4465
 04/18/2008 17:14:52 [10020:4464]: using signal delivery delay of 120 seconds
 04/18/2008 17:14:52 [10020:4464]: child: pe_start - pid: 4465
 04/18/2008 17:14:52 [10020:4465]: setting limits
 04/18/2008 17:14:52 [10020:4465]: setting environment
 04/18/2008 17:14:52 [10020:4465]: Initializing error file
 04/18/2008 17:14:52 [10020:4465]: switching to intermediate/target user
 04/18/2008 17:14:52 [10005:4465]: closing all filedescriptors
 04/18/2008 17:14:52 [10005:4465]: further messages are in "error" and "trace"
 04/18/2008 17:14:52 [10005:4465]: using "/bin/bash" as shell of user "sdavis"
 04/18/2008 17:14:52 [10005:4465]: now running with uid=10005, euid=10005
 04/18/2008 17:14:52 [10005:4465]: execvp(/bin/true, "/bin/true")
 04/18/2008 17:14:52 [10005:4465]: not a GUI job, starting directly
 04/18/2008 17:14:52 [10020:4464]: wait3 returned 4465 (status: 0;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
 04/18/2008 17:14:52 [10020:4464]: pe_start exited with exit status 0
 04/18/2008 17:14:52 [10020:4464]: reaped "pe_start" with pid 4465
 04/18/2008 17:14:52 [10020:4464]: pe_start exited not due to signal
 04/18/2008 17:14:52 [10020:4464]: pe_start exited with status 0
 04/18/2008 17:14:52 [10020:4464]: forked "job" with pid 4466
 04/18/2008 17:14:52 [10020:4464]: child: job - pid: 4466
 04/18/2008 17:14:52 [10020:4466]: processing interactive job
 04/18/2008 17:14:52 [10020:4466]: pid=4466 pgrp=4466 sid=4466 old
pgrp=4464 getlogin()=<no login set>
 04/18/2008 17:14:52 [10020:4466]: reading passwd information for user 'sdavis'
 04/18/2008 17:14:52 [10020:4466]: setosjobid: uid = 0, euid = 10020
 04/18/2008 17:14:52 [10020:4466]: setting limits
 04/18/2008 17:14:52 [10020:4466]: RLIMIT_CPU setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
 04/18/2008 17:14:52 [10020:4466]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
 04/18/2008 17:14:52 [10020:4466]: RLIMIT_DATA setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
 04/18/2008 17:14:52 [10020:4466]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
 04/18/2008 17:14:52 [10020:4466]: RLIMIT_CORE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
 04/18/2008 17:14:52 [10020:4466]: RLIMIT_VMEM/RLIMIT_AS setting:
(soft 18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
 04/18/2008 17:14:52 [10020:4466]: RLIMIT_RSS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
 04/18/2008 17:14:52 [10020:4466]: setting environment
 04/18/2008 17:14:52 [10020:4466]: Initializing error file
 04/18/2008 17:14:52 [10020:4466]: switching to intermediate/target user
 04/18/2008 17:14:52 [10005:4466]: closing all filedescriptors
 04/18/2008 17:14:52 [10005:4466]: further messages are in "error" and "trace"
 04/18/2008 17:14:52 [10005:4466]: now running with uid=10005, euid=10005
 04/18/2008 17:14:52 [10005:4466]: execvp(/usr/bin/X11/xterm,
"/usr/bin/X11/xterm" "-display" "localhost:14.0" "-n" "SGE Interactive
Job 166 on shakespeare.nci.nih.gov in Queue all.q" "-e" "/bin/csh")
 04/18/2008 17:14:52 [10005:4466]: not a GUI job, starting directly

 Shepherd pe_hostfile:
 shakespeare.nci.nih.gov 2 all.q at shakespeare.nci.nih.gov <NULL>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list