[GE users] Fwd: GE 6.1u2: Job 166 failed

Reuti reuti at staff.uni-marburg.de
Sat Apr 19 16:05:35 BST 2008


Am 19.04.2008 um 14:18 schrieb Sean Davis:
> On Sat, Apr 19, 2008 at 6:00 AM, Reuti <reuti at staff.uni-marburg.de>  
> wrote:
>>  Am 18.04.2008 um 23:43 schrieb Sean Davis:
>>> I am running sge 6.1u2 and openmpi and attempting to run a simple
>>> test.  I am a pure novice at both administration and use of SGE.  I
>>> have a pe set up:
>>>
>>> sdavis at shakespeare:~> qconf -sp orte
>>> pe_name           orte
>>> slots             2
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /bin/true
>>> stop_proc_args    /bin/true
>>> allocation_rule   $pe_slots
>>> control_slaves    TRUE
>>> job_is_first_task FALSE
>>> urgency_slots     min
>>>
>>> And I have added the orte pe to the pe_list for all.q.  I have two
>>> hosts, each with eight processors.  I start a parallel interactive
>>> job:
>>>
>>> sdavis at shakespeare:~> qsh -pe orte 2
>>> Your job 166 ("INTERACTIVE") has been submitted
>>> waiting for interactive job to be scheduled ...
>>> Your interactive job 166 has been successfully scheduled.
>>>
>>> In the new interactive shell, I do:
>>>
>>> /home/sdavis> mpirun --mca pls_gridengine_verbose 1 -np 2 hostname
>>> local configuration shakespeare.nci.nih.gov not defined - using  
>>> global
>>> configuration
>>> Starting server daemon at host "shakespeare.nci.nih.gov"
>>> Server daemon successfully started with task id "1.shakespeare"
>>> Establishing /usr/bin/ssh -o StrictHostKeyChecking=no session to  
>>> host
>>>
>>
>>  SGE's qrsh should be used automatically instead of the default  
>> ssh, when
>> Open MPI discovers that it's running under SGE (also the hostname  
>> needn't to
>> be specified in the mpirun cal). Are the $SGE_* defined in your  
>> interactive
>> shell?
>
> Thanks, Reuti, for taking a look at this.
>
> qsh -pe orte 1
>
> In the interactive shell:
>
> /home/sdavis> env | grep SGE
> SGE_CELL=default
> SGE_ROOT=/usr/local/sge
> SGE_BINARY_PATH=/usr/local/sge/bin/lx24-amd64
> SGE_O_HOME=/home/sdavis
> SGE_O_LOGNAME=sdavis
> SGE_O_PATH=/usr/local/sge/bin/lx26-amd64:/home/sdavis/bin:/usr/ 
> local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/opt/ 
> kde3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin
> SGE_O_SHELL=/bin/bash
> SGE_O_MAIL=/var/mail/sdavis
> SGE_O_HOST=shakespeare
> SGE_O_WORKDIR=/home/sdavis
> SGE_TASK_ID=undefined
> SGE_TASK_FIRST=undefined
> SGE_TASK_LAST=undefined
> SGE_TASK_STEPSIZE=undefined
> SGE_ARCH=lx24-amd64
> SGE_ACCOUNT=sge
> SGE_JOB_SPOOL_DIR=/var/spool/sge//shakespeare/active_jobs/169.1
> SGE_STDIN_PATH=/dev/null
> SGE_STDOUT_PATH=/dev/null
> SGE_STDERR_PATH=/dev/null
> SGE_CWD_PATH=/home/sdavis
>
>
>
>>>  04/18/2008 17:14:52 [10005:4466]: execvp(/usr/bin/X11/xterm,
>>> "/usr/bin/X11/xterm" "-display" "localhost:14.0" "-n" "SGE  
>>> Interactive
>>>
>>
>>  How did you setup qsh to use ssh?
>
> rsh_daemon                   /usr/sbin/sshd -i
> rsh_command                  /usr/bin/ssh -o StrictHostKeyChecking=no
> rlogin_command               /usr/bin/ssh -o StrictHostKeyChecking=no
> qlogin_daemon                /usr/sbin/sshd -i
> rlogin_daemon                /usr/sbin/sshd -i

This is not used for qsh. So I wonder, how you could get a xterm  
there with localhost:14. The execd is running on the same machine as  
your terminal - in this case it is working, as the already  
established connection can be reused (Although I wonder, how the  
"localhost:14" made it to this xterm process. Are you using -V by  
default?). But this will not work in the cluster from another machine.


>>> shakespeare.nci.nih.gov ...
>>> /usr/bin/ssh -o StrictHostKeyChecking=no exited on signal 13 (PIPE)
>>> reading exit code from shepherd ...  timeout (60 s) expired while
>>> waiting on socket fd 6
>>> error: error reading returncode of remote command
>>> [shakespeare:04628] ERROR: A daemon on node shakespeare.nci.nih.gov
>>> failed to start as expected.
>>> [shakespeare:04628] ERROR: There may be more information  
>>> available from
>>> [shakespeare:04628] ERROR: the 'qstat -t' command on the Grid Engine
>> tasks.
>>> [shakespeare:04628] ERROR: If the problem persists, please  
>>> restart the
>>> [shakespeare:04628] ERROR: Grid Engine PE job
>>> [shakespeare:04628] ERROR: The daemon exited unexpectedly with  
>>> status 255.

Any firewall on the machine? The SGE startup of the processes will  
use a random port and not the usual 22. So at least local connections  
from any port must be allowed. (You could even revert to the usual  
rsh startup for local use: as SGE starts it's own daemon, you don't  
have to have rshd running all the time or have it enabled in /etc/ 
xinetd.d).

-- Reuti


>>> And the output of qstat -t:
>>>
>>> sdavis at shakespeare:~> qstat -t
>>> job-ID  prior   name       user         state submit/start at
>>> queue                          master ja-task-ID task-ID state cpu
>>>   mem     io      stat failed
>>>
>> --------------------------------------------------------------------- 
>> --------------------------------------------------------------------- 
>> -----------------------------
>>>    167 0.55500 INTERACTIV sdavis       r     04/18/2008 17:37:23
>>> all.q at shakespeare.nci.nih.gov  MASTER                        r
>>> 00:00:00 0.01696 0.00000
>>>
>>> all.q at shakespeare.nci.nih.gov  SLAVE
>>>
>>> all.q at shakespeare.nci.nih.gov  SLAVE
>>>
>>>
>>> Below is the email status from the host.  I'm not sure what is  
>>> broken
>>> here, as there are several pieces to the puzzle.  Can someone  
>>> give me
>>> a hint or two?
>>>
>>> Thanks,
>>> Sean
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: root <root at shakespeare.nci.nih.gov>
>>> Date: Fri, Apr 18, 2008 at 5:16 PM
>>> Subject: GE 6.1u2: Job 166 failed
>>> To: sdavis2 at mail.nih.gov
>>>
>>>
>>> Job 166 caused action: PE Job 166 will be deleted
>>>  User        = sdavis
>>>  Queue       = all.q at shakespeare.nci.nih.gov
>>>  Host        = shakespeare.nci.nih.gov
>>>  Start Time  = <unknown>
>>>  End Time    = <unknown>
>>>  failed before job:04/18/2008 17:16:02 [0:4517]: can't open file
>>> /tmp/166.1.all.q/pid.1.shakespeare: No such file or di
>>>  Shepherd trace:
>>>  04/18/2008 17:14:52 [10020:4464]: shepherd called with uid = 0,  
>>> euid =
>> 10020
>>>  04/18/2008 17:14:52 [10020:4464]: starting up 6.1u2
>>>  04/18/2008 17:14:52 [10020:4464]: setpgid(4464, 4464) returned 0
>>>  04/18/2008 17:14:52 [10020:4464]: no prolog script to start
>>>  04/18/2008 17:14:52 [10020:4464]: /bin/true
>>>  04/18/2008 17:14:52 [10020:4464]: /bin/true
>>>  04/18/2008 17:14:52 [10020:4465]: pid=4465 pgrp=4465 sid=4465 old
>>> pgrp=4464 getlogin()=<no login set>
>>>  04/18/2008 17:14:52 [10020:4465]: reading passwd information for  
>>> user
>> 'sdavis'
>>>  04/18/2008 17:14:52 [10020:4464]: forked "pe_start" with pid 4465
>>>  04/18/2008 17:14:52 [10020:4464]: using signal delivery delay of  
>>> 120
>> seconds
>>>  04/18/2008 17:14:52 [10020:4464]: child: pe_start - pid: 4465
>>>  04/18/2008 17:14:52 [10020:4465]: setting limits
>>>  04/18/2008 17:14:52 [10020:4465]: setting environment
>>>  04/18/2008 17:14:52 [10020:4465]: Initializing error file
>>>  04/18/2008 17:14:52 [10020:4465]: switching to intermediate/ 
>>> target user
>>>  04/18/2008 17:14:52 [10005:4465]: closing all filedescriptors
>>>  04/18/2008 17:14:52 [10005:4465]: further messages are in  
>>> "error" and
>> "trace"
>>>  04/18/2008 17:14:52 [10005:4465]: using "/bin/bash" as shell of  
>>> user
>> "sdavis"
>>>  04/18/2008 17:14:52 [10005:4465]: now running with uid=10005,  
>>> euid=10005
>>>  04/18/2008 17:14:52 [10005:4465]: execvp(/bin/true, "/bin/true")
>>>  04/18/2008 17:14:52 [10005:4465]: not a GUI job, starting directly
>>>  04/18/2008 17:14:52 [10020:4464]: wait3 returned 4465 (status: 0;
>>> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 0)
>>>  04/18/2008 17:14:52 [10020:4464]: pe_start exited with exit  
>>> status 0
>>>  04/18/2008 17:14:52 [10020:4464]: reaped "pe_start" with pid 4465
>>>  04/18/2008 17:14:52 [10020:4464]: pe_start exited not due to signal
>>>  04/18/2008 17:14:52 [10020:4464]: pe_start exited with status 0
>>>  04/18/2008 17:14:52 [10020:4464]: forked "job" with pid 4466
>>>  04/18/2008 17:14:52 [10020:4464]: child: job - pid: 4466
>>>  04/18/2008 17:14:52 [10020:4466]: processing interactive job
>>>  04/18/2008 17:14:52 [10020:4466]: pid=4466 pgrp=4466 sid=4466 old
>>> pgrp=4464 getlogin()=<no login set>
>>>  04/18/2008 17:14:52 [10020:4466]: reading passwd information for  
>>> user
>> 'sdavis'
>>>  04/18/2008 17:14:52 [10020:4466]: setosjobid: uid = 0, euid = 10020
>>>  04/18/2008 17:14:52 [10020:4466]: setting limits
>>>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_CPU setting: (soft
>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>>> 18446744073709551615 hard 18446744073709551615)
>>>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_FSIZE setting: (soft
>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>>> 18446744073709551615 hard 18446744073709551615)
>>>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_DATA setting: (soft
>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>>> 18446744073709551615 hard 18446744073709551615)
>>>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_STACK setting: (soft
>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>>> 18446744073709551615 hard 18446744073709551615)
>>>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_CORE setting: (soft
>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>>> 18446744073709551615 hard 18446744073709551615)
>>>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_VMEM/RLIMIT_AS setting:
>>> (soft 18446744073709551615 hard 18446744073709551615) resulting:  
>>> (soft
>>> 18446744073709551615 hard 18446744073709551615)
>>>  04/18/2008 17:14:52 [10020:4466]: RLIMIT_RSS setting: (soft
>>> 18446744073709551615 hard 18446744073709551615) resulting: (soft
>>> 18446744073709551615 hard 18446744073709551615)
>>>  04/18/2008 17:14:52 [10020:4466]: setting environment
>>>  04/18/2008 17:14:52 [10020:4466]: Initializing error file
>>>  04/18/2008 17:14:52 [10020:4466]: switching to intermediate/ 
>>> target user
>>>  04/18/2008 17:14:52 [10005:4466]: closing all filedescriptors
>>>  04/18/2008 17:14:52 [10005:4466]: further messages are in  
>>> "error" and
>> "trace"
>>>  04/18/2008 17:14:52 [10005:4466]: now running with uid=10005,  
>>> euid=10005
>>>  04/18/2008 17:14:52 [10005:4466]: execvp(/usr/bin/X11/xterm,
>>> "/usr/bin/X11/xterm" "-display" "localhost:14.0" "-n" "SGE  
>>> Interactive
>>> Job 166 on shakespeare.nci.nih.gov in Queue all.q" "-e" "/bin/csh")
>>>  04/18/2008 17:14:52 [10005:4466]: not a GUI job, starting directly
>>>
>>>  Shepherd pe_hostfile:
>>>  shakespeare.nci.nih.gov 2 all.q at shakespeare.nci.nih.gov <NULL>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>   
>> ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>  For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list