[GE users] Your "qrsh" request could not be scheduled, try again later.

Kirk Patton kpatton at transmeta.com
Wed Oct 6 19:27:30 BST 2004


That message is generally caused when the cluster is full and no machines are available for immeadiate
dispatch.

You can wait for a machine by including '-now no'
man qsub
       Qrsh jobs can only run in INTERACTIVE queues unless the option -now  no
       is used (see below).  They can also only be run, if the sge_execd(8) is
       running under the root account.

Kirk

On Wed, Oct 06, 2004 at 07:12:22PM +0100, Mohammed Iqbal wrote:
> 
> Hi All,
> 
> I've recently setup SGE on a small group of Sun Solaris Workstations. After some setup hurdles it has been running fine until I decided to add another submission host. God know what has happened but now all I get when submitting to any host is:
> 
> qrsh -verbose -V -q bladei.q pwd
> waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 5
> Your "qrsh" request could not be scheduled, try again later.
> 
> All the queues, sge daemons and /etc/service etc should be setup ok as it has been running fine till now...
> 
> 1:    USER   PID %CPU   SZ  RSS  VSZ        TIME TT      COMMAND
> 3:    root 28417  0.1  409 2336 3272        3:04 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_commd
> 4:    root 28419  0.1 1223 6720 9784        3:01 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_qmaster
> 5:    root 28425  0.3  615 2576 4920       31:58 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_execd
> 6:    root 28422  0.1  674 2912 5392        2:41 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_schedd
> 
> Please help, what could be the cause and how I could clear up and make it run again?
> 
> To add a new machine do I execute qconf -ah bladei and then install_execd on the machine?
> 
> Thanks
> Mohammed.
> 
> System Display Group, Sharp Labs of Europe
> Edmund Halley Road, Oxford Science Park
> Oxford, OX4 4GB. United Kingdom.
> Tel: +44-1865 334299 / Fax: +44-1865 747717
> Mohammed.Iqbal at sharp.co.uk / www.sle.sharp.co.uk
> 
> 
> 
> -----Original Message-----
> From: Ranga Srinivasan [mailto:ranga at bizrate.com]
> Sent: 06 October 2004 17:46
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] getting failed before writing
> exit_status:shepherd exited with exit status 19 --- Additional info
> 
> 
> Hi
> 
> I am resending the email as I did not get any answer to my problem.
> Some additional info:
> 
> 1. /grid/gridware01/default/spool/cruncher01/job_scripts/1 is created as
> -rw-r--r--    1 gridadm  gridadm       992 Oct  6 09:25 1
> 
> 2. All the directories are 777 from /grid onwards.
> 
> 3. If I try to run the process on the gridmaster machine itself I get the
> same error messages.
> 
> I am totally confused after doing all that I get the error email even though
> the process completes the execution of the script successfully.
> 
> Questions
> 
> 1. What directory should I be looking at to see if it has the permission ?
> 2. Is there a better way to debugging this problem?
> 3. Are there any docs on all the error messages that come out of the
> shepherd process ?
> 4. What does "failed before writing exit_status:shepherd exited with exit
> status 19" mean.
> 
> 
> Can someone help me to resolve this issue
> 
> Thanks in Advance
> 
> Ranga
> 
> 
> -----Original Message-----
> From: Ranga Srinivasan [mailto:ranga at bizrate.com]
> Sent: Monday, October 04, 2004 4:02 PM
> To: users at gridengine.sunsource.net
> Subject: [GE users] getting failed before writing exit_status:shepherd
> exited with exit status 19
> 
> 
> Hi
> 
> After recreating the whole Grid setup to use one nfs mounted across the
> gridmaster and the execution host.I am getting the same errors. What does "
> failed before writing exit_status:shepherd exited with exit status 19" mean.
> I did chmod -R 777 on the gridware directory. So I am assuming it should be
> a file permission issue.
> 
> It completes the simple.sh script, but sends the error email.
> 
> Is there something else I need to do to make it work w/o sending me error
> messages.
> 
> Any help/ pointer realy helpful. I am totally confused as to why I am
> getting an error message after ensuring the gridmaster and the execution
> host share the same nfs mounted device.
> 
> Thanks again
> 
> Ranga
> 
> -----Original Message-----
> From: root [mailto:root at cruncher01.bizrate.com]
> Sent: Monday, October 04, 2004 3:54 PM
> To: ranga at bizrate.com
> Subject: SGE 6.0u1: Job 1 failed
> 
> 
> Job 1 caused action: none
>  User        = gridadm
>  Queue       = all.q at cruncher01.bizrate.com
>  Host        = cruncher01.bizrate.com
>  Start Time  = 10/04/2004 15:53:18
>  End Time    = 10/04/2004 15:53:38
> failed before writing exit_status:shepherd exited with exit status 19
> Shepherd trace:
> 10/04/2004 15:53:18 [10461:4462]: shepherd called with uid = 0, euid = 10461
> 10/04/2004 15:53:18 [10461:4462]: starting up 6.0u1
> 10/04/2004 15:53:18 [10461:4466]: closing all filedescriptors
> 10/04/2004 15:53:18 [10461:4466]: further messages are in "error" and
> "trace"
> 10/04/2004 15:53:18 [10461:4466]: using stdout as stderr
> 10/04/2004 15:53:18 [10461:4466]: execvp(/bin/bash, "bash"
> "/grid/gridware01/default/spool/cruncher01/job_scripts/1")
> 
> Shepherd pe_hostfile:
> cruncher01.bizrate.com 1 all.q at cruncher01.bizrate.com UNDEFINED
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> _________________________________________________________
> This e-mail has been scanned for viruses by MessageLabs.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Kirk Patton
Unix Administrator
Transmeta Inc.
Tel. 408 919-3055

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list