[GE users] Fwd: GE 6.1u2: Job 166 failed

Reuti reuti at staff.uni-marburg.de
Sat Apr 19 11:00:29 BST 2008


Hi,

Am 18.04.2008 um 23:43 schrieb Sean Davis:

> I am running sge 6.1u2 and openmpi and attempting to run a simple
> test.  I am a pure novice at both administration and use of SGE.  I
> have a pe set up:
>
> sdavis at shakespeare:~> qconf -sp orte
> pe_name           orte
> slots             2
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   $pe_slots
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> And I have added the orte pe to the pe_list for all.q.  I have two
> hosts, each with eight processors.  I start a parallel interactive
> job:
>
> sdavis at shakespeare:~> qsh -pe orte 2
> Your job 166 ("INTERACTIVE") has been submitted
> waiting for interactive job to be scheduled ...
> Your interactive job 166 has been successfully scheduled.
>
> In the new interactive shell, I do:
>
> /home/sdavis> mpirun --mca pls_gridengine_verbose 1 -np 2 hostname
> local configuration shakespeare.nci.nih.gov not defined - using global
> configuration
> Starting server daemon at host "shakespeare.nci.nih.gov"
> Server daemon successfully started with task id "1.shakespeare"
> Establishing /usr/bin/ssh -o StrictHostKeyChecking=no session to host

SGE's qrsh should be used automatically instead of the default ssh,  
when Open MPI discovers that it's running under SGE (also the  
hostname needn't to be specified in the mpirun cal). Are the $SGE_*  
defined in your interactive shell?

>  04/18/2008 17:14:52 [10005:4466]: execvp(/usr/bin/X11/xterm,
> "/usr/bin/X11/xterm" "-display" "localhost:14.0" "-n" "SGE Interactive

How did you setup qsh to use ssh?

-- Reuti


> shakespeare.nci.nih.gov ...
> /usr/bin/ssh -o StrictHostKeyChecking=no exited on signal 13 (PIPE)
> reading exit code from shepherd ...  timeout (60 s) expired while
> waiting on socket fd 6
> error: error reading returncode of remote command
> [shakespeare:04628] ERROR: A daemon on node shakespeare.nci.nih.gov
> failed to start as expected.
> [shakespeare:04628] ERROR: There may be more information available  
> from
> [shakespeare:04628] ERROR: the 'qstat -t' command on the Grid  
> Engine tasks.
> [shakespeare:04628] ERROR: If the problem persists, please restart the
> [shakespeare:04628] ERROR: Grid Engine PE job
> [shakespeare:04628] ERROR: The daemon exited unexpectedly with  
> status 255.
>
> And the output of qstat -t:
>
> sdavis at shakespeare:~> qstat -t
> job-ID  prior   name       user         state submit/start at
> queue                          master ja-task-ID task-ID state cpu
>    mem     io      stat failed
> ---------------------------------------------------------------------- 
> ---------------------------------------------------------------------- 
> ---------------------------
>     167 0.55500 INTERACTIV sdavis       r     04/18/2008 17:37:23
> all.q at shakespeare.nci.nih.gov  MASTER                        r
> 00:00:00 0.01696 0.00000
>
> all.q at shakespeare.nci.nih.gov  SLAVE
>
> all.q at shakespeare.nci.nih.gov  SLAVE
>
>
> Below is the email status from the host.  I'm not sure what is broken
> here, as there are several pieces to the puzzle.  Can someone give me
> a hint or two?
>
> Thanks,
> Sean
>
>
> ---------- Forwarded message ----------
> From: root <root at shakespeare.nci.nih.gov>
> Date: Fri, Apr 18, 2008 at 5:16 PM
> Subject: GE 6.1u2: Job 166 failed
> To: sdavis2 at mail.nih.gov
>
>
> Job 166 caused action: PE Job 166 will be deleted
>   User        = sdavis
>   Queue       = all.q at shakespeare.nci.nih.gov
>   Host        = shakespeare.nci.nih.gov
>   Start Time  = <unknown>
>   End Time    = <unknown>
>  failed before job:04/18/2008 17:16:02 [0:4517]: can't open file
> /tmp/166.1.all.q/pid.1.shakespeare: No such file or di
>  Shepherd trace:
>  04/18/2008 17:14:52 [10020:4464]: shepherd called with uid = 0,  
> euid = 10020
>  04/18/2008 17:14:52 [10020:4464]: starting up 6.1u2
>  04/18/2008 17:14:52 [10020:4464]: setpgid(4464, 4464) returned 0
>  04/18/2008 17:14:52 [10020:4464]: no prolog script to start
>  04/18/2008 17:14:52 [10020:4464]: /bin/true
>  04/18/2008 17:14:52 [10020:4464]: /bin/true
>  04/18/2008 17:14:52 [10020:4465]: pid=4465 pgrp=4465 sid=4465 old
> pgrp=4464 getlogin()=<no login set>
>  04/18/2008 17:14:52 [10020:4465]: reading passwd information for  
> user 'sdavis'
>  04/18/2008 17:14:52 [10020:4464]: forked "pe_start" with pid 4465
>  04/18/2008 17:14:52 [10020:4464]: using signal delivery delay of  
> 120 seconds
>  04/18/2008 17:14:52 [10020:4464]: child: pe_start - pid: 4465
>  04/18/2008 17:14:52 [10020:4465]: setting limits
>  04/18/2008 17:14:52 [10020:4465]: setting environment
>  04/18/2008 17:14:52 [10020:4465]: Initializing error file
>  04/18/2008 17:14:52 [10020:4465]: switching to intermediate/target  
> user
>  04/18/2008 17:14:52 [10005:4465]: closing all filedescriptors
>  04/18/2008 17:14:52 [10005:4465]: further messages are in "error"  
> and "trace"
>  04/18/2008 17:14:52 [10005:4465]: using "/bin/bash" as shell of  
> user "sdavis"
>  04/18/2008 17:14:52 [10005:4465]: now running with uid=10005,  
> euid=10005
>  04/18/2008 17:14:52 [10005:4465]: execvp(/bin/true, "/bin/true")
>  04/18/2008 17:14:52 [10005:4465]: not a GUI job, starting directly
>  04/18/2008 17:14:52 [10020:4464]: wait3 returned 4465 (status: 0;
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>  04/18/2008 17:14:52 [10020:4464]: pe_start exited with exit status 0
>  04/18/2008 17:14:52 [10020:4464]: reaped "pe_start" with pid 4465
>  04/18/2008 17:14:52 [10020:4464]: pe_start exited not due to signal
>  04/18/2008 17:14:52 [10020:4464]: pe_start exited with status 0
>  04/18/2008 17:14:52 [10020:4464]: forked "job" with pid 4466
>  04/18/2008 17:14:52 [10020:4464]: child: job - pid: 4466
>  04/18/2008 17:14:52 [10020:4466]: processing interactive job
>  04/18/2008 17:14:52 [10020:4466]: pid=4466 pgrp=4466 sid=4466 old
> pgrp=4464 getlogin()=<no login set>
>  04/18/2008 17:14:52 [10020:4466]: reading passwd information for  
> user 'sdavis'
>  04/18/2008 17:14:52 [10020:4466]: setosjobid: uid = 0, euid = 10020
>  04/18/2008 17:14:52 [10020:4466]: setting limits
>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_CPU setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_DATA setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_CORE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_VMEM/RLIMIT_AS setting:
> (soft 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_RSS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
>  04/18/2008 17:14:52 [10020:4466]: setting environment
>  04/18/2008 17:14:52 [10020:4466]: Initializing error file
>  04/18/2008 17:14:52 [10020:4466]: switching to intermediate/target  
> user
>  04/18/2008 17:14:52 [10005:4466]: closing all filedescriptors
>  04/18/2008 17:14:52 [10005:4466]: further messages are in "error"  
> and "trace"
>  04/18/2008 17:14:52 [10005:4466]: now running with uid=10005,  
> euid=10005
>  04/18/2008 17:14:52 [10005:4466]: execvp(/usr/bin/X11/xterm,
> "/usr/bin/X11/xterm" "-display" "localhost:14.0" "-n" "SGE Interactive
> Job 166 on shakespeare.nci.nih.gov in Queue all.q" "-e" "/bin/csh")
>  04/18/2008 17:14:52 [10005:4466]: not a GUI job, starting directly
>
>  Shepherd pe_hostfile:
>  shakespeare.nci.nih.gov 2 all.q at shakespeare.nci.nih.gov <NULL>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list