[GE users] Jobs Pending and qrsh fails

templedf daniel.templeton at oracle.com
Fri Apr 2 18:14:21 BST 2010


Are your execds running? Are your queues enabled?

Daniel

On 04/02/10 08:54, neoideo wrote:
> Ok,
>
> i managed to get an example job info, for
>
> $ qsub -q all.q /common/examples/jobs/simple.sh
>
> this is the job info
> ijorge:~ cristobal$ qstat -j 41
> ==============================================================
> job_number:                 41
> exec_file:                  job_scripts/41
> submission_time:            Fri Apr  2 11:44:24 2010
> owner:                      cristobal
> uid:                        503
> group:                      staff
> gid:                        20
> sge_o_home:                 /Users/cristobal
> sge_o_log_name:             cristobal
> sge_o_path:
> /opt/openmpi-1.4.1/bin:/usr/local/bin:/common/bin/darwin-x86:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
> sge_o_shell:                /bin/bash
> sge_o_workdir:              /Users/cristobal
> sge_o_host:                 ijorge
> account:                    sge
> mail_list:                  cristobal at ijorge.local
> notify:                     FALSE
> job_name:                   simple.sh
> jobshare:                   0
> hard_queue_list:            all.q
> shell_list:                 NONE:/bin/sh
> env_list:
> MANPATH=/usr/share/man:/usr/local/share/man:/usr/X11/man,TERM_PROGRAM=Apple_Terminal,TERM=xterm-color,SHELL=/bin/bash,TMPDIR=/var/folders/ll/llO6JraKH3yaexMAxqjz4E+++TQ/-Tmp-/,Apple_PubSub_Socket_Render=/tmp/launch-qHGuGc/Render,TERM_PROGRAM_VERSION=240.2,USER=cristobal,COMMAND_MODE=unix2003,SSH_AUTH_SOCK=/tmp/launch-GI2rHY/Listeners,__CF_USER_TEXT_ENCODING=0x1F7:0:86,XGRID_CONTROLLER_PASSWORD=NONE,PATH=/opt/openmpi-1.4.1/bin:/usr/local/bin:/common/bin/darwin-x86:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin,PWD=/Users/cristobal,SGE_ROOT=/common,SHLVL=1,HOME=/Users/cristobal,LOGNAME=cristobal,XGRID_CONTROLLER_HOSTNAME=ijorge.local,LC_CTYPE=UTF-8,DISPLAY=/tmp/launch-t6fUwO/:0,SECURITYSESSIONID=daf940,_=/common/bin/darwin-x86/qsub
> script_file:                /common/examples/jobs/simple.sh
> scheduling info:            queue instance "all.q at ijorge.local" dropped
> because it is temporarily not available
>                              All queues dropped because of overload or full
>
>
>
>
> Cristobal
>
>
>
>
> On Fri, Apr 2, 2010 at 12:45 PM, rayson <rayrayson at gmail.com
> <mailto:rayrayson at gmail.com>> wrote:
>
>     On 4/2/10, neoideo <axischire at gmail.com
>     <mailto:axischire at gmail.com>> wrote:
>      > yesterday i was able to submit jobs with qsub or with qrsh
>     without problems,
>      > everything worked fine.
>      > now all my jobs just appear pending when i check them with qstat
>     -f. this
>      > happened after shutdown/power on
>      > also running jobs list is empty, so its weird.
>
>     You can turn on "schedd_job_info" (see sched_conf(5)), and then do a
>     qstat -j to find out why SGE is not scheduling jobs for you.
>
>     Rayson
>
>
>
>      >
>      > for example if i run
>      >
>      > ijorge:/ cristobal$ qrsh -verbose -q all.q hostname
>      > Your job 38 ("hostname") has been submitted
>      > waiting for interactive job to be scheduled ...
>      >
>      > Your "qrsh" request could not be scheduled, try again later.
>      >
>      > this is a test cluster that i have, so i only have 1 node with
>     qmaster and
>      > exec in the same machine. i repeat that this was working
>     yesterday, no
>      > update was installed, shut shutdown the mac.
>      > i noticed that now the qmaster daemon and execd daemon do not
>     start as
>      > "root" can that be the problem and how i fix it?
>      >
>      > ijorge:/ cristobal$ ps aux | grep sge
>      > cristobal   225   0.2  0.1   610088   3368   ??  Ss   10:35AM
>     0:00.77
>      > /common/bin/darwin-x86/sge_qmaster
>      > cristobal    55   0.0  0.0   603492   1408   ??  S<s  10:26AM
>     0:00.31
>      > /common/bin/darwin-x86/sge_execd
>      > cristobal   555   0.0  0.0   599780    456 s000  R+   11:10AM
>     0:00.00 grep
>      > sge
>      >
>      > ijorge:/ cristobal$ ls -la /common/bin/darwin-x86/ | grep sge
>      > -rwxr-xr-x@  1 root  wheel   158584 Mar 31 18:45 sge_coshepherd
>      > -rwxr-xr-x@  1 root  wheel  1697376 Mar 31 18:45 sge_execd
>      > -rwxr-xr-x@  1 root  wheel  2583644 Mar 31 18:45 sge_qmaster
>      > -rwxr-xr-x@  1 root  wheel  1398848 Mar 31 18:45 sge_shadowd
>      > -rwxr-xr-x@  1 root  wheel  2499752 Mar 31 18:45 sge_shepherd
>      > -rwxr-xr-x@  1 root  wheel     2115 Mar 31 18:45 sgeinspect
>      > -r-s--x--x   1 root  wheel   975652 Mar 31 18:45 sgepasswd
>      >
>      >
>      > ijorge:/ cristobal$ ls -la /Library/LaunchDaemons/ | grep sge
>      > -rw-r--r--   1 root  wheel   782 Mar 31 23:58
>      > net.sunsource.gridengine.sgeexecd.plist
>      > -rw-r--r--   1 root  wheel   786 Mar 31 23:58
>      > net.sunsource.gridengine.sgeqmaster.plist
>      >
>      >
>      > more info from messages
>      >
>      > any help is welcome, thanks in advance!
>      >
>      > Cristobal
>      >
>      >
>      >
>
>     ------------------------------------------------------
>     http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252141
>     <http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252141>
>
>     To unsubscribe from this discussion, e-mail:
>     [users-unsubscribe at gridengine.sunsource.net
>     <mailto:users-unsubscribe at gridengine.sunsource.net>].
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252150

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list