[GE users] Help reg SGE 6.0 Globus 3.2 integration

Shuja Parvez shshgs01 at fht-esslingen.de
Tue Sep 28 15:21:53 BST 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello.
Thank you for the reply.
> I helped someone with the exact same symptoms last week.  His problem
> was that the sge_execd was not running as the right user.  Did you check
> that node2's execd is run as root or as the $SGE_ROOT directory owner?
sge_execd runs as user sgeadmin
sgeadmin  4768     1  0 15:48 ?      00:00:00
/home/sgeadmin/bin/lx26-x86/sge_execd

And the $SGE_ROOT=/home/sgeadmin which is NFSed from node1
This looks on node2 as follows
drwxrwxrwx  20   1003 globus 4096 2004-09-28 16:33 sgeadmin

Now, I have user sgeadmin on both machines and sgeadmin belongs to the
group sgeadmin. But when I NFS it the group is shown as globus, that too
confuses me now.

This is how /home/sgeadmin looks on node1
drwxrwxrwx  20 sgeadmin sgeadmin 4096 2004-09-28 16:33 sgeadmin

> Another thought would be to check that the SGE user on node2 has
> permission to write to the $SGE_ROOT over NFS.
it does. i confirmed that.

>If SGE is running as
> root, this can be an issue since root turns into nobody when it crosses
> NFS boundaries.
>
> Daniel
>
> Shuja Parvez wrote:
>
>>Hi
>>I have 2 nodes node1. SGE Master which is also the Globus gatekeeper and
>> an
>>execution host.
>>node2. Execution host.
>>I installed SGE succesfully and the jobs run on queues on both machines.
>>But when i submit a job from globus to the Sun Grid engine, the job goes
>>into the error state and i have the following message in the spooler:
>>===8X message on node2===
>>09/28/2004 15:57:23|execd|node2|E|shepherd of job 172.1 exited with exit
>>status = 26
>>09/28/2004 15:57:23|execd|node2|E|can't open usage file
>>"active_jobs/172.1/usage" for job 172.1: No such file or directory
>>09/28/2004 15:57:23|execd|node2|E|"can't read usage file for job 172.1
>>===8X ===
>>===8X message on qmaster ===
>>09/28/2004 16:01:44|qmaster|node1|W|job 172.1 failed on host
>>node2.cfd1.honda-ri.de general opening input/output file because: can't
>>read usage file for job 172.1
>>09/28/2004 16:01:44|qmaster|node1|W|rescheduling job 172.1
>>===8X ===
>>The jobs always go into the error state, and when i clear the error
>>through qmon, the jobs are rescheduled on node1 and then it continues.
>>
>>Could anyone please help me out of this
>>Regards
>>
>>
>>
>>
>
> --
> *******************************************************
> *          Daniel Templeton   ERGB01 x60220           *
> *         Staff Engineer, Sun N1 Grid Engine          *
> *******************************************************
> *    "Camera one closes in, the soundtrack starts,    *
> *     The scene begins.  You're playing you now."     *
> *                -Josh Joplin Group, "Camera One"     *
> *******************************************************
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


-- 
Shuja Parvez
Msc IT and Automation Systems (2003-2005),
FH Esslingen.
Residence Address: AschaffenburgerStrasse 120,
D-63073, Offenbach am Main, Germany
Email: shshgs01 at fht-esslingen.de
Handy: +49 176 700 395 00

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list