[GE users] Jobs Pending and qrsh fails

neoideo axischire at gmail.com
Fri Apr 2 19:19:07 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

maybe this informaton will help some beginners like me
from what i understood,

the openmpi problem made the queue to fail, and set its status to E (error)
reading the documentation, it says that as a safety condition, each time an unexpected error ocurrs, that queue becomes deactivated with the status E, so it cannot receive more jobs until is manually re-activated
in this state, you get errors like overloaded or full which can confuse the administrator a lot.
to activate the queue again it use this command
$ qmod -c queue.q

and kill the problematic job that may be in zombie state

Cristobal




On Fri, Apr 2, 2010 at 3:13 PM, Cristobal Navarro <axischire at gmail.com<mailto:axischire at gmail.com>> wrote:
guys i think im getting your replies with some lag,

even when it has been fixed,
i answer your questions

a) yes queues were enabled

b) qhost is okay too, load below 1.75
ijorge:~ cristobal$ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
ijorge.local            darwin-x86      2  0.12    4.0G    1.1G     0.0     0.0
localhost               darwin-x86      2     -    4.0G       -     0.0       -


regards,
Cristobal


Cristobal





On Fri, Apr 2, 2010 at 2:31 PM, rayson <rayrayson at gmail.com<mailto:rayrayson at gmail.com>> wrote:
On 4/2/10, neoideo <axischire at gmail.com<mailto:axischire at gmail.com>> wrote:
> scheduling info:            queue instance "all.q at ijorge.local" dropped
> because it is temporarily not available
>                             All queues dropped because of overload or full

The line above is the hint.

What is the output of qhost?? Is the ijorge machine overloaded??

Rayson



>
>
>
>
> Cristobal
>
>
>
>
>
> On Fri, Apr 2, 2010 at 12:45 PM, rayson <rayrayson at gmail.com<mailto:rayrayson at gmail.com>> wrote:
> >
> >
> > On 4/2/10, neoideo <axischire at gmail.com<mailto:axischire at gmail.com>> wrote:
> > > yesterday i was able to submit jobs with qsub or with qrsh without
> problems,
> > > everything worked fine.
> > > now all my jobs just appear pending when i check them with qstat -f.
> this
> > > happened after shutdown/power on
> > > also running jobs list is empty, so its weird.
> >
> > You can turn on "schedd_job_info" (see sched_conf(5)), and then do a
> > qstat -j to find out why SGE is not scheduling jobs for you.
> >
> > Rayson
> >
> >
> >
> >
> >
> >
> > >
> > > for example if i run
> > >
> > > ijorge:/ cristobal$ qrsh -verbose -q all.q hostname
> > > Your job 38 ("hostname") has been submitted
> > > waiting for interactive job to be scheduled ...
> > >
> > > Your "qrsh" request could not be scheduled, try again later.
> > >
> > > this is a test cluster that i have, so i only have 1 node with qmaster
> and
> > > exec in the same machine. i repeat that this was working yesterday, no
> > > update was installed, shut shutdown the mac.
> > > i noticed that now the qmaster daemon and execd daemon do not start as
> > > "root" can that be the problem and how i fix it?
> > >
> > > ijorge:/ cristobal$ ps aux | grep sge
> > > cristobal   225   0.2  0.1   610088   3368   ??  Ss   10:35AM   0:00.77
> > > /common/bin/darwin-x86/sge_qmaster
> > > cristobal    55   0.0  0.0   603492   1408   ??  S<s  10:26AM   0:00.31
> > > /common/bin/darwin-x86/sge_execd
> > > cristobal   555   0.0  0.0   599780    456 s000  R+   11:10AM   0:00.00
> grep
> > > sge
> > >
> > > ijorge:/ cristobal$ ls -la /common/bin/darwin-x86/ | grep sge
> > > -rwxr-xr-x@  1 root  wheel   158584 Mar 31 18:45 sge_coshepherd
> > > -rwxr-xr-x@  1 root  wheel  1697376 Mar 31 18:45 sge_execd
> > > -rwxr-xr-x@  1 root  wheel  2583644 Mar 31 18:45 sge_qmaster
> > > -rwxr-xr-x@  1 root  wheel  1398848 Mar 31 18:45 sge_shadowd
> > > -rwxr-xr-x@  1 root  wheel  2499752 Mar 31 18:45 sge_shepherd
> > > -rwxr-xr-x@  1 root  wheel     2115 Mar 31 18:45 sgeinspect
> > > -r-s--x--x   1 root  wheel   975652 Mar 31 18:45 sgepasswd
> > >
> > >
> > > ijorge:/ cristobal$ ls -la /Library/LaunchDaemons/ | grep sge
> > > -rw-r--r--   1 root  wheel   782 Mar 31 23:58
> > > net.sunsource.gridengine.sgeexecd.plist
> > > -rw-r--r--   1 root  wheel   786 Mar 31 23:58
> > > net.sunsource.gridengine.sgeqmaster.plist
> > >
> > >
> > > more info from messages
> > >
> > > any help is welcome, thanks in advance!
> > >
> > > Cristobal
> > >
> > >
> > >
> >
> > ------------------------------------------------------
> >
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252141
> >
> > To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].
> >
>
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252151

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].





More information about the gridengine-users mailing list