[GE users] Your "qrsh" request could not be scheduled, try again later.

Mohammed Iqbal mohammed.iqbal at sharp.co.uk
Wed Oct 6 19:12:22 BST 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


Hi All,

I've recently setup SGE on a small group of Sun Solaris Workstations. After some setup hurdles it has been running fine until I decided to add another submission host. God know what has happened but now all I get when submitting to any host is:

qrsh -verbose -V -q bladei.q pwd
waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 5
Your "qrsh" request could not be scheduled, try again later.

All the queues, sge daemons and /etc/service etc should be setup ok as it has been running fine till now...

1:    USER   PID %CPU   SZ  RSS  VSZ        TIME TT      COMMAND
3:    root 28417  0.1  409 2336 3272        3:04 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_commd
4:    root 28419  0.1 1223 6720 9784        3:01 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_qmaster
5:    root 28425  0.3  615 2576 4920       31:58 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_execd
6:    root 28422  0.1  674 2912 5392        2:41 ?       /home/cadsw/gnu/grid/bin/solaris64/sge_schedd

Please help, what could be the cause and how I could clear up and make it run again?

To add a new machine do I execute qconf -ah bladei and then install_execd on the machine?

Thanks
Mohammed.

System Display Group, Sharp Labs of Europe
Edmund Halley Road, Oxford Science Park
Oxford, OX4 4GB. United Kingdom.
Tel: +44-1865 334299 / Fax: +44-1865 747717
Mohammed.Iqbal at sharp.co.uk / www.sle.sharp.co.uk



-----Original Message-----
From: Ranga Srinivasan [mailto:ranga at bizrate.com]
Sent: 06 October 2004 17:46
To: users at gridengine.sunsource.net
Subject: RE: [GE users] getting failed before writing
exit_status:shepherd exited with exit status 19 --- Additional info


Hi

I am resending the email as I did not get any answer to my problem.
Some additional info:

1. /grid/gridware01/default/spool/cruncher01/job_scripts/1 is created as
-rw-r--r--    1 gridadm  gridadm       992 Oct  6 09:25 1

2. All the directories are 777 from /grid onwards.

3. If I try to run the process on the gridmaster machine itself I get the
same error messages.

I am totally confused after doing all that I get the error email even though
the process completes the execution of the script successfully.

Questions

1. What directory should I be looking at to see if it has the permission ?
2. Is there a better way to debugging this problem?
3. Are there any docs on all the error messages that come out of the
shepherd process ?
4. What does "failed before writing exit_status:shepherd exited with exit
status 19" mean.


Can someone help me to resolve this issue

Thanks in Advance

Ranga


-----Original Message-----
From: Ranga Srinivasan [mailto:ranga at bizrate.com]
Sent: Monday, October 04, 2004 4:02 PM
To: users at gridengine.sunsource.net
Subject: [GE users] getting failed before writing exit_status:shepherd
exited with exit status 19


Hi

After recreating the whole Grid setup to use one nfs mounted across the
gridmaster and the execution host.I am getting the same errors. What does "
failed before writing exit_status:shepherd exited with exit status 19" mean.
I did chmod -R 777 on the gridware directory. So I am assuming it should be
a file permission issue.

It completes the simple.sh script, but sends the error email.

Is there something else I need to do to make it work w/o sending me error
messages.

Any help/ pointer realy helpful. I am totally confused as to why I am
getting an error message after ensuring the gridmaster and the execution
host share the same nfs mounted device.

Thanks again

Ranga

-----Original Message-----
From: root [mailto:root at cruncher01.bizrate.com]
Sent: Monday, October 04, 2004 3:54 PM
To: ranga at bizrate.com
Subject: SGE 6.0u1: Job 1 failed


Job 1 caused action: none
 User        = gridadm
 Queue       = all.q at cruncher01.bizrate.com
 Host        = cruncher01.bizrate.com
 Start Time  = 10/04/2004 15:53:18
 End Time    = 10/04/2004 15:53:38
failed before writing exit_status:shepherd exited with exit status 19
Shepherd trace:
10/04/2004 15:53:18 [10461:4462]: shepherd called with uid = 0, euid = 10461
10/04/2004 15:53:18 [10461:4462]: starting up 6.0u1
10/04/2004 15:53:18 [10461:4466]: closing all filedescriptors
10/04/2004 15:53:18 [10461:4466]: further messages are in "error" and
"trace"
10/04/2004 15:53:18 [10461:4466]: using stdout as stderr
10/04/2004 15:53:18 [10461:4466]: execvp(/bin/bash, "bash"
"/grid/gridware01/default/spool/cruncher01/job_scripts/1")

Shepherd pe_hostfile:
cruncher01.bizrate.com 1 all.q at cruncher01.bizrate.com UNDEFINED



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


_________________________________________________________
This e-mail has been scanned for viruses by MessageLabs.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list