[GE users] Jobs Pending and qrsh fails

neoideo axischire at gmail.com
Fri Apr 2 18:00:51 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

i made progress,

i deleted de all.q queue, and made a new one.
now submitting jobs in bactch or interactive works good.

however, the problem begins again if i try to submit an openMPI job

for example

ijorge:~ cristobal$ qrsh -V -pe pempi 2 mpirun -np 4 hostname
ijorge:~ cristobal$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
cola.q at ijorge.local            BIP   0/2/2          0.13     darwin-x86
     49 0.50000 mpirun     cristobal    r     04/02/2010 12:45:46     2
ijorge:~ cristobal$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
cola.q at ijorge.local            BIP   0/2/2          0.13     darwin-x86
     49 0.50000 mpirun     cristobal    r     04/02/2010 12:45:46     2
ijorge:~ cristobal$ sudo nano /etc/profile
Password:
ijorge:~ cristobal$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
cola.q at ijorge.local            BIP   0/0/2          0.08     darwin-x86    E

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     49 0.50000 mpirun     cristobal    qw    04/02/2010 12:45:46     2
ijorge:~ cristobal$ qstat -j 49
==============================================================
job_number:                 49
exec_file:                  job_scripts/49
submission_time:            Fri Apr  2 12:45:46 2010
owner:                      cristobal
uid:                        503
group:                      staff
gid:                        20
sge_o_home:                 /Users/cristobal
sge_o_log_name:             cristobal
sge_o_path:                 /common/bin/darwin-x86:/opt/openmpi-1.4.1/bin:/usr/local/bin:/common/bin/darwin-x86:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /Users/cristobal
sge_o_host:                 ijorge
account:                    sge
stderr_path_list:           NONE:NONE:/dev/null
mail_list:                  cristobal at ijorge.local
notify:                     FALSE
job_name:                   mpirun
stdout_path_list:           NONE:NONE:/dev/null
jobshare:                   0
restart:                    n
env_list:                   MANPATH=/common/man:/usr/share/man:/usr/local/share/man:/usr/X11/man,TERM_PROGRAM=Apple_Terminal,TERM=xterm-color,SHELL=/bin/bash,TMPDIR=/var/folders/ll/llO6JraKH3yaexMAxqjz4E+++TQ/-Tmp-/,Apple_PubSub_Socket_Render=/tmp/launch-qHGuGc/Render,SGE_CELL=default,TERM_PROGRAM_VERSION=240.2,USER=cristobal,COMMAND_MODE=unix2003,SSH_AUTH_SOCK=/tmp/launch-GI2rHY/Listeners,__CF_USER_TEXT_ENCODING=0x1F7:0:86,XGRID_CONTROLLER_PASSWORD=NONE,PATH=/common/bin/darwin-x86:/opt/openmpi-1.4.1/bin:/usr/local/bin:/common/bin/darwin-x86:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin,PWD=/Users/cristobal,SGE_ROOT=/common,SHLVL=1,HOME=/Users/cristobal,DYLD_LIBRARY_PATH=/common/lib/darwin-x86,LOGNAME=cristobal,XGRID_CONTROLLER_HOSTNAME=ijorge.local,LC_CTYPE=UTF-8,SGE_CLUSTER_NAME=p6444,DISPLAY=/tmp/launch-t6fUwO/:0,SECURITYSESSIONID=daf940,_=/common/bin/darwin-x86/qrsh,QRSH_PORT=ijorge.local:49552,QRSH_COMMAND=mpirun?-np?4?hostname
script_file:                mpirun
parallel environment:  pempi range: 2
error reason    1:          04/02/2010 12:45:46 [503:499]: execvp(/bin/true, "/bin/true") failed: No such file or directory
scheduling info:            queue instance "cola.q at ijorge.local" dropped because it is temporarily not available
                            All queues dropped because of overload or full


im on a mac with leopard 10.5.6.  interesting the part where it says "execvp(/bin/true, "/bin/true") failed"
after this error, my queue gets corrupt again and i get forever pending jobs.
what do you suggest?
ps: i have openMPI 1.4.1 installed on /opt/openmpi-1.4.1/

thanks in advance
Cristobal




On Fri, Apr 2, 2010 at 1:39 PM, Cristobal Navarro <axischire at gmail.com<mailto:axischire at gmail.com>> wrote:
are my replies being received on the mailing list??
please someone confirm! i solved the problem but i need to know this first
thanks
Cristobal




On Fri, Apr 2, 2010 at 12:45 PM, rayson <rayrayson at gmail.com<mailto:rayrayson at gmail.com>> wrote:
On 4/2/10, neoideo <axischire at gmail.com<mailto:axischire at gmail.com>> wrote:
> yesterday i was able to submit jobs with qsub or with qrsh without problems,
> everything worked fine.
> now all my jobs just appear pending when i check them with qstat -f. this
> happened after shutdown/power on
> also running jobs list is empty, so its weird.

You can turn on "schedd_job_info" (see sched_conf(5)), and then do a
qstat -j to find out why SGE is not scheduling jobs for you.

Rayson



>
> for example if i run
>
> ijorge:/ cristobal$ qrsh -verbose -q all.q hostname
> Your job 38 ("hostname") has been submitted
> waiting for interactive job to be scheduled ...
>
> Your "qrsh" request could not be scheduled, try again later.
>
> this is a test cluster that i have, so i only have 1 node with qmaster and
> exec in the same machine. i repeat that this was working yesterday, no
> update was installed, shut shutdown the mac.
> i noticed that now the qmaster daemon and execd daemon do not start as
> "root" can that be the problem and how i fix it?
>
> ijorge:/ cristobal$ ps aux | grep sge
> cristobal   225   0.2  0.1   610088   3368   ??  Ss   10:35AM   0:00.77
> /common/bin/darwin-x86/sge_qmaster
> cristobal    55   0.0  0.0   603492   1408   ??  S<s  10:26AM   0:00.31
> /common/bin/darwin-x86/sge_execd
> cristobal   555   0.0  0.0   599780    456 s000  R+   11:10AM   0:00.00 grep
> sge
>
> ijorge:/ cristobal$ ls -la /common/bin/darwin-x86/ | grep sge
> -rwxr-xr-x@  1 root  wheel   158584 Mar 31 18:45 sge_coshepherd
> -rwxr-xr-x@  1 root  wheel  1697376 Mar 31 18:45 sge_execd
> -rwxr-xr-x@  1 root  wheel  2583644 Mar 31 18:45 sge_qmaster
> -rwxr-xr-x@  1 root  wheel  1398848 Mar 31 18:45 sge_shadowd
> -rwxr-xr-x@  1 root  wheel  2499752 Mar 31 18:45 sge_shepherd
> -rwxr-xr-x@  1 root  wheel     2115 Mar 31 18:45 sgeinspect
> -r-s--x--x   1 root  wheel   975652 Mar 31 18:45 sgepasswd
>
>
> ijorge:/ cristobal$ ls -la /Library/LaunchDaemons/ | grep sge
> -rw-r--r--   1 root  wheel   782 Mar 31 23:58
> net.sunsource.gridengine.sgeexecd.plist
> -rw-r--r--   1 root  wheel   786 Mar 31 23:58
> net.sunsource.gridengine.sgeqmaster.plist
>
>
> more info from messages
>
> any help is welcome, thanks in advance!
>
> Cristobal
>
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252141

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].





More information about the gridengine-users mailing list