[GE users] Fwd: GE 6.1u2: Job 166 failed

Sean Davis sdavis2 at mail.nih.gov
Sat Apr 19 13:18:34 BST 2008


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

On Sat, Apr 19, 2008 at 6:00 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
>  Am 18.04.2008 um 23:43 schrieb Sean Davis:
>
>
>
>
> > I am running sge 6.1u2 and openmpi and attempting to run a simple
> > test.  I am a pure novice at both administration and use of SGE.  I
> > have a pe set up:
> >
> > sdavis at shakespeare:~> qconf -sp orte
> > pe_name           orte
> > slots             2
> > user_lists        NONE
> > xuser_lists       NONE
> > start_proc_args   /bin/true
> > stop_proc_args    /bin/true
> > allocation_rule   $pe_slots
> > control_slaves    TRUE
> > job_is_first_task FALSE
> > urgency_slots     min
> >
> > And I have added the orte pe to the pe_list for all.q.  I have two
> > hosts, each with eight processors.  I start a parallel interactive
> > job:
> >
> > sdavis at shakespeare:~> qsh -pe orte 2
> > Your job 166 ("INTERACTIVE") has been submitted
> > waiting for interactive job to be scheduled ...
> > Your interactive job 166 has been successfully scheduled.
> >
> > In the new interactive shell, I do:
> >
> > /home/sdavis> mpirun --mca pls_gridengine_verbose 1 -np 2 hostname
> > local configuration shakespeare.nci.nih.gov not defined - using global
> > configuration
> > Starting server daemon at host "shakespeare.nci.nih.gov"
> > Server daemon successfully started with task id "1.shakespeare"
> > Establishing /usr/bin/ssh -o StrictHostKeyChecking=no session to host
> >
>
>  SGE's qrsh should be used automatically instead of the default ssh, when
> Open MPI discovers that it's running under SGE (also the hostname needn't to
> be specified in the mpirun cal). Are the $SGE_* defined in your interactive
> shell?

Thanks, Reuti, for taking a look at this.

qsh -pe orte 1

In the interactive shell:

/home/sdavis> env | grep SGE
SGE_CELL=default
SGE_ROOT=/usr/local/sge
SGE_BINARY_PATH=/usr/local/sge/bin/lx24-amd64
SGE_O_HOME=/home/sdavis
SGE_O_LOGNAME=sdavis
SGE_O_PATH=/usr/local/sge/bin/lx26-amd64:/home/sdavis/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin
SGE_O_SHELL=/bin/bash
SGE_O_MAIL=/var/mail/sdavis
SGE_O_HOST=shakespeare
SGE_O_WORKDIR=/home/sdavis
SGE_TASK_ID=undefined
SGE_TASK_FIRST=undefined
SGE_TASK_LAST=undefined
SGE_TASK_STEPSIZE=undefined
SGE_ARCH=lx24-amd64
SGE_ACCOUNT=sge
SGE_JOB_SPOOL_DIR=/var/spool/sge//shakespeare/active_jobs/169.1
SGE_STDIN_PATH=/dev/null
SGE_STDOUT_PATH=/dev/null
SGE_STDERR_PATH=/dev/null
SGE_CWD_PATH=/home/sdavis



> >  04/18/2008 17:14:52 [10005:4466]: execvp(/usr/bin/X11/xterm,
> > "/usr/bin/X11/xterm" "-display" "localhost:14.0" "-n" "SGE Interactive
> >
>
>  How did you setup qsh to use ssh?

rsh_daemon                   /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh -o StrictHostKeyChecking=no
rlogin_command               /usr/bin/ssh -o StrictHostKeyChecking=no
qlogin_daemon                /usr/sbin/sshd -i
rlogin_daemon                /usr/sbin/sshd -i


> >
> >
> >
> > shakespeare.nci.nih.gov ...
> > /usr/bin/ssh -o StrictHostKeyChecking=no exited on signal 13 (PIPE)
> > reading exit code from shepherd ...  timeout (60 s) expired while
> > waiting on socket fd 6
> > error: error reading returncode of remote command
> > [shakespeare:04628] ERROR: A daemon on node shakespeare.nci.nih.gov
> > failed to start as expected.
> > [shakespeare:04628] ERROR: There may be more information available from
> > [shakespeare:04628] ERROR: the 'qstat -t' command on the Grid Engine
> tasks.
> > [shakespeare:04628] ERROR: If the problem persists, please restart the
> > [shakespeare:04628] ERROR: Grid Engine PE job
> > [shakespeare:04628] ERROR: The daemon exited unexpectedly with status 255.
> >
> > And the output of qstat -t:
> >
> > sdavis at shakespeare:~> qstat -t
> > job-ID  prior   name       user         state submit/start at
> > queue                          master ja-task-ID task-ID state cpu
> >   mem     io      stat failed
> >
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >    167 0.55500 INTERACTIV sdavis       r     04/18/2008 17:37:23
> > all.q at shakespeare.nci.nih.gov  MASTER                        r
> > 00:00:00 0.01696 0.00000
> >
> > all.q at shakespeare.nci.nih.gov  SLAVE
> >
> > all.q at shakespeare.nci.nih.gov  SLAVE
> >
> >
> > Below is the email status from the host.  I'm not sure what is broken
> > here, as there are several pieces to the puzzle.  Can someone give me
> > a hint or two?
> >
> > Thanks,
> > Sean
> >
> >
> > ---------- Forwarded message ----------
> > From: root <root at shakespeare.nci.nih.gov>
> > Date: Fri, Apr 18, 2008 at 5:16 PM
> > Subject: GE 6.1u2: Job 166 failed
> > To: sdavis2 at mail.nih.gov
> >
> >
> > Job 166 caused action: PE Job 166 will be deleted
> >  User        = sdavis
> >  Queue       = all.q at shakespeare.nci.nih.gov
> >  Host        = shakespeare.nci.nih.gov
> >  Start Time  = <unknown>
> >  End Time    = <unknown>
> >  failed before job:04/18/2008 17:16:02 [0:4517]: can't open file
> > /tmp/166.1.all.q/pid.1.shakespeare: No such file or di
> >  Shepherd trace:
> >  04/18/2008 17:14:52 [10020:4464]: shepherd called with uid = 0, euid =
> 10020
> >  04/18/2008 17:14:52 [10020:4464]: starting up 6.1u2
> >  04/18/2008 17:14:52 [10020:4464]: setpgid(4464, 4464) returned 0
> >  04/18/2008 17:14:52 [10020:4464]: no prolog script to start
> >  04/18/2008 17:14:52 [10020:4464]: /bin/true
> >  04/18/2008 17:14:52 [10020:4464]: /bin/true
> >  04/18/2008 17:14:52 [10020:4465]: pid=4465 pgrp=4465 sid=4465 old
> > pgrp=4464 getlogin()=<no login set>
> >  04/18/2008 17:14:52 [10020:4465]: reading passwd information for user
> 'sdavis'
> >  04/18/2008 17:14:52 [10020:4464]: forked "pe_start" with pid 4465
> >  04/18/2008 17:14:52 [10020:4464]: using signal delivery delay of 120
> seconds
> >  04/18/2008 17:14:52 [10020:4464]: child: pe_start - pid: 4465
> >  04/18/2008 17:14:52 [10020:4465]: setting limits
> >  04/18/2008 17:14:52 [10020:4465]: setting environment
> >  04/18/2008 17:14:52 [10020:4465]: Initializing error file
> >  04/18/2008 17:14:52 [10020:4465]: switching to intermediate/target user
> >  04/18/2008 17:14:52 [10005:4465]: closing all filedescriptors
> >  04/18/2008 17:14:52 [10005:4465]: further messages are in "error" and
> "trace"
> >  04/18/2008 17:14:52 [10005:4465]: using "/bin/bash" as shell of user
> "sdavis"
> >  04/18/2008 17:14:52 [10005:4465]: now running with uid=10005, euid=10005
> >  04/18/2008 17:14:52 [10005:4465]: execvp(/bin/true, "/bin/true")
> >  04/18/2008 17:14:52 [10005:4465]: not a GUI job, starting directly
> >  04/18/2008 17:14:52 [10020:4464]: wait3 returned 4465 (status: 0;
> > WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> >  04/18/2008 17:14:52 [10020:4464]: pe_start exited with exit status 0
> >  04/18/2008 17:14:52 [10020:4464]: reaped "pe_start" with pid 4465
> >  04/18/2008 17:14:52 [10020:4464]: pe_start exited not due to signal
> >  04/18/2008 17:14:52 [10020:4464]: pe_start exited with status 0
> >  04/18/2008 17:14:52 [10020:4464]: forked "job" with pid 4466
> >  04/18/2008 17:14:52 [10020:4464]: child: job - pid: 4466
> >  04/18/2008 17:14:52 [10020:4466]: processing interactive job
> >  04/18/2008 17:14:52 [10020:4466]: pid=4466 pgrp=4466 sid=4466 old
> > pgrp=4464 getlogin()=<no login set>
> >  04/18/2008 17:14:52 [10020:4466]: reading passwd information for user
> 'sdavis'
> >  04/18/2008 17:14:52 [10020:4466]: setosjobid: uid = 0, euid = 10020
> >  04/18/2008 17:14:52 [10020:4466]: setting limits
> >  04/18/2008 17:14:52 [10020:4466]: RLIMIT_CPU setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> >  04/18/2008 17:14:52 [10020:4466]: RLIMIT_FSIZE setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> >  04/18/2008 17:14:52 [10020:4466]: RLIMIT_DATA setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> >  04/18/2008 17:14:52 [10020:4466]: RLIMIT_STACK setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> >  04/18/2008 17:14:52 [10020:4466]: RLIMIT_CORE setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> >  04/18/2008 17:14:52 [10020:4466]: RLIMIT_VMEM/RLIMIT_AS setting:
> > (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> >  04/18/2008 17:14:52 [10020:4466]: RLIMIT_RSS setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> >  04/18/2008 17:14:52 [10020:4466]: setting environment
> >  04/18/2008 17:14:52 [10020:4466]: Initializing error file
> >  04/18/2008 17:14:52 [10020:4466]: switching to intermediate/target user
> >  04/18/2008 17:14:52 [10005:4466]: closing all filedescriptors
> >  04/18/2008 17:14:52 [10005:4466]: further messages are in "error" and
> "trace"
> >  04/18/2008 17:14:52 [10005:4466]: now running with uid=10005, euid=10005
> >  04/18/2008 17:14:52 [10005:4466]: execvp(/usr/bin/X11/xterm,
> > "/usr/bin/X11/xterm" "-display" "localhost:14.0" "-n" "SGE Interactive
> > Job 166 on shakespeare.nci.nih.gov in Queue all.q" "-e" "/bin/csh")
> >  04/18/2008 17:14:52 [10005:4466]: not a GUI job, starting directly
> >
> >  Shepherd pe_hostfile:
> >  shakespeare.nci.nih.gov 2 all.q at shakespeare.nci.nih.gov <NULL>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
>
>  ---------------------------------------------------------------------
>  To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>  For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list