[GE users] trouble installing Windows hosts,shepherd exit status 127

Eric wu ewu at bbn.com
Fri May 23 13:21:55 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

In case anybody runs into this, we figured it out.

This was a permissions problem to the spool directory.  We changed the 
permissions on /var/spool/sge to 777
and it worked.  Seems there is a difference between 'sgeadmin' user and 
COMPUTERNAME+sgeadmin or something.

We have a different problem now, sgeexecd on Windows works if you kill 
and restart sge_exec but
not at boot.  Running the debugger, it gies an error like

can't load libdl.so.3.5

in /var/adm/log/init.log

but works interactively.  Still hunting it down, anybody else had this 
problem?


Eric Wu wrote:

> Any help here is appreciated.
>
> I am trying to install a Windows execution host.
>
> I tried to follow Beat Rubischon's instructions here
>
> http://www.0x1b.ch/misc/papers/sge/sgeonwindows.pdf
>
>
> My server is RHEL 5 clone installing both "supported" and "free" 
> packages (that is, I tried
> one then the other).
>
> My client is Windows 2003 R2 with SUA.
>
> I am able to get the Windows client to show load (see below)
> HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  
> SWAPTO  SWAPUS
> ------------------------------------------------------------------------------- 
>
> global                  -               -     -       -       -       
> -       -
> f207                    lx24-amd64      1  0.14  372.7M  152.9M  
> 768.7M  104.0K
>   all.q                BIP   0/1
> winsge202               win32-EM64T     2  0.00   16.0G  850.3M   
> 17.4G  235.8M
>   all.q                BIP   0/2
>
>
> but when I submit a job I get this error in server messages
>
> 05/11/2008 15:04:36|qmaster|f207|W|job 22.1 failed on host 
> winsge202.bbn.com invalid execution state because: shepherd exited 
> with exit status 127
> 05/11/2008 15:04:36|qmaster|f207|W|job 24.1 failed on host 
> winsge202.bbn.com invalid execution state because: shepherd exited 
> with exit status 127
>
>
> client messages is the following
>
> 05/12/2008 16:57:46|execd|winsge202|E|abnormal termination of shepherd 
> for job 26.1: no "exit_status" file
> 05/12/2008 16:57:46|execd|winsge202|E|can't open file 
> active_jobs/26.1/error: No such file or directory
> 05/12/2008 16:57:46|execd|winsge202|E|can't open pid file 
> "active_jobs/26.1/pid" for job 26.1
> 05/12/2008 16:57:46|execd|winsge202|E|can't open usage file 
> "active_jobs/26.1/usage" for job 26.1: No such file or directory
> 05/12/2008 16:57:46|execd|winsge202|E|shepherd exited with exit status 
> 127
> 05/12/2008 16:58:09|execd|winsge202|E|shepherd of job 27.1 exited with 
> exit status = 127
> 05/12/2008 16:58:09|execd|winsge202|W|reaping job "27" ptf complains: 
> Job does not exist
> 05/12/2008 16:58:09|execd|winsge202|E|abnormal termination of shepherd 
> for job 27.1: no "exit_status" file
> 05/12/2008 16:58:09|execd|winsge202|E|can't open file 
> active_jobs/27.1/error: No such file or directory
> 05/12/2008 16:58:09|execd|winsge202|E|can't open pid file 
> "active_jobs/27.1/pid" for job 27.1
> 05/12/2008 16:58:09|execd|winsge202|E|can't open usage file 
> "active_jobs/27.1/usage" for job 27.1: No such file or directory
> 05/12/2008 16:58:09|execd|winsge202|E|shepherd exited with exit status 
> 127
> 05/12/2008 16:58:09|execd|winsge202|E|shepherd of job 28.1 exited with 
> exit status = 127
> 05/12/2008 16:58:09|execd|winsge202|W|reaping job "28" ptf complains: 
> Job does not exist
> 05/12/2008 16:58:09|execd|winsge202|E|abnormal termination of shepherd 
> for job 28.1: no "exit_status" file
> 05/12/2008 16:58:09|execd|winsge202|E|can't open file 
> active_jobs/28.1/error: No such file or directory
> 05/12/2008 16:58:09|execd|winsge202|E|can't open pid file 
> "active_jobs/28.1/pid" for job 28.1
> 05/12/2008 16:58:09|execd|winsge202|E|can't open usage file 
> "active_jobs/28.1/usage" for job 28.1: No such file or directory
> 05/12/2008 16:58:09|execd|winsge202|E|shepherd exited with exit status 
> 127
> 05/12/2008 16:58:32|execd|winsge202|E|shepherd of job 29.1 exited with 
> exit status = 127
> 05/12/2008 16:58:32|execd|winsge202|W|reaping job "29" ptf complains: 
> Job does not exist
> 05/12/2008 16:58:32|execd|winsge202|E|abnormal termination of shepherd 
> for job 29.1: no "exit_status" file
> 05/12/2008 16:58:32|execd|winsge202|E|can't open file 
> active_jobs/29.1/error: No such file or directory
> 05/12/2008 16:58:32|execd|winsge202|E|can't open pid file 
> "active_jobs/29.1/pid" for job 29.1
> 05/12/2008 16:58:32|execd|winsge202|E|can't open usage file 
> "active_jobs/29.1/usage" for job 29.1: No such file or directory
> 05/12/2008 16:58:32|execd|winsge202|E|shepherd exited with exit status 
> 127
>
>
> I have gotten it to work before, sometimes it goes and sometimes it 
> does not.
>
> The other strange thing is that it does not always ask me if I want to 
> install the rc scripts
> when I install the Windows client.
>
>
> I appreciate any pointers or troubleshooting steps.
>
> Eric Wu
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list