[GE users] Help reg SGE 6.0 Globus 3.2 integration

Shuja Parvez shshgs01 at fht-esslingen.de
Wed Sep 29 11:14:37 BST 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I took a ethereal dump of the messages coming from node2 to node1. And i
found this interesting observation. When the job fails to schedule on node
2. it returns the "Can't open Usage File error" but if i see just before
that in the dump the queuename is very strange:
shuja at node2.cfd1.honda-ri.de.sgeadmin.sgeadmin <- ".sgeadmin.sgeadmin" is
extra!
my hostname is node2.cfd1.honda-ri.de and the queue name is shuja
if you go through the following message dump you can observe that. Could
this be the problem?
Any help would be greatly appreciated :)

0000  00 30 48 2a 94 80 00 30  48 2a a0 32 08 00 45 00   .0H*...0 H*.2..E.
0010  03 74 82 7e 40 00 40 06  8a c7 0a 0a 0b 16 0a 0a   .t.~@. at . ........
0020  0b 15 80 3c 1a 0a cb c4  c3 f0 c5 e0 59 78 80 18   ...<.... ....Yx..
0030  7c 70 22 df 00 00 01 01  08 0a 71 0a 90 b8 47 e5   |p"..... ..q...G.
0040  32 86 00 00 00 02 65 78  65 63 64 20 63 6f 6e 66   2.....ex ecd conf
0050  69 67 20 6c 69 73 74 20  63 6f 70 79 00 00 00 00   ig list  copy....
0060  01 00 00 00 03 00 00 06  72 00 00 27 0c 00 00 06   ........ r..'....
0070  73 00 00 20 03 00 00 06  74 00 00 20 09 00 00 00   s.. .... t.. ....
0080  02 00 00 00 03 04 67 6c  6f 62 61 6c 00 00 00 00   ......gl obal....
0090  03 00 00 00 00 00 00 00  02 00 00 00 03 04 6e 6f   ........ ......no
00a0  64 65 32 2e 63 66 64 31  2e 68 6f 6e 64 61 2d 72   de2.cfd1 .honda-r
00b0  69 2e 64 65 00 00 00 00  01 00 00 00 00 10 00 0f   i.de.... ........
00c0  ff 00 00 07 28 00 00 00  02 00 00 00 05 1f 00 00   ....(... ........
00d0  00 04 6e 6f 64 65 32 2e  63 66 64 31 2e 68 6f 6e   ..node2. cfd1.hon
00e0  64 61 2d 72 69 2e 64 65  00 00 00 00 01 00 00 00   da-ri.de ........
00f0  01 6c 69 63 65 6e 73 65  20 72 65 70 6f 72 74 20   .license  report
0100  6c 69 73 74 00 00 00 00  01 00 00 00 02 00 02 0f   list.... ........
0110  8a 00 00 00 03 00 02 0f  8b 00 00 00 08 00 00 00   ........ ........
0120  02 00 00 00 02 03 00 00  00 02 6c 78 32 34 2d 78   ........ ..lx24-x
0130  38 36 00 10 00 0f ff 00  00 07 28 00 00 00 02 00   86...... ..(.....
0140  00 00 05 1f 00 00 00 05  6e 6f 64 65 32 2e 63 66   ........ node2.cf
0150  64 31 2e 68 6f 6e 64 61  2d 72 69 2e 64 65 00 00   d1.honda -ri.de..
0160  00 00 01 00 00 00 01 6a  72 5f 6c 69 73 74 00 00   .......j r_list..
0170  00 00 01 00 00 00 10 00  02 0f 58 00 00 00 03 00   ........ ..X.....
0180  02 0f 59 00 00 00 03 00  02 0f 5a 00 00 00 08 00   ..Y..... ..Z.....
0190  02 0f 5b 00 00 00 0c 00  02 0f 5c 00 00 00 08 00   ..[..... ..\.....
01a0  02 0f 5d 00 00 00 08 00  02 0f 5e 00 00 00 03 00   ..]..... ..^.....
01b0  02 0f 5f 00 00 00 03 00  02 0f 60 00 00 00 03 00   .._..... ..`.....
01c0  02 0f 61 00 00 00 08 00  02 0f 62 00 00 00 09 00   ..a..... ..b.....
01d0  02 0f 63 00 00 00 03 00  02 0f 64 00 00 00 03 00   ..c..... ..d.....
01e0  02 0f 65 00 00 00 08 00  02 0f 66 00 00 00 08 00   ..e..... ..f.....
01f0  02 0f 67 00 00 00 03 00  00 00 02 00 00 00 10 ff   ..g..... ........
0200  57 00 00 00 d2 00 00 00  01 73 68 75 6a 61 40 6e   W....... .shuja at n
0210  6f 64 65 32 2e 63 66 64  31 2e 68 6f 6e 64 61 2d   ode2.cfd 1.honda-
0220  72 69 2e 64 65 00 6e 6f  64 65 32 2e 63 66 64 31   ri.de.no de2.cfd1
0230  2e 68 6f 6e 64 61 2d 72  69 2e 64 65 00 73 67 65   .honda-r i.de.sge
0240  61 64 6d 69 6e 00 73 67  65 61 64 6d 69 6e 00 00   admin.sg eadmin..
0250  00 10 00 00 00 00 1a 00  00 00 04 63 61 6e 27 74   ........ ...can't
0260  20 72 65 61 64 20 75 73  61 67 65 20 66 69 6c 65    read us age file
0270  20 66 6f 72 20 6a 6f 62  20 32 31 30 2e 31 0a 00    for job  210.1..
0280  00 00 00 01 00 00 00 08  75 73 61 67 65 6c 69 73   ........ usagelis
0290  74 00 00 00 00 01 00 00  00 02 00 00 09 60 00 01   t....... .....`..
02a0  07 08 00 00 09 61 00 01  00 02 00 00 00 02 00 00   .....a.. ........
02b0  00 02 01 69 6f 00 00 00  00 00 00 00 00 00 00 00   ...io... ........
02c0  00 02 00 00 00 02 01 69  6f 77 00 00 00 00 00 00   .......i ow......
02d0  00 00 00 00 00 00 02 00  00 00 02 01 6d 65 6d 00   ........ ....mem.
02e0  00 00 00 00 00 00 00 00  00 00 00 02 00 00 00 02   ........ ........
02f0  01 63 70 75 00 00 00 00  00 00 00 00 00 00 00 00   .cpu.... ........
0300  02 00 00 00 02 01 76 6d  65 6d 00 00 00 00 00 00   ......vm em......
0310  00 00 00 00 00 00 02 00  00 00 02 01 6d 61 78 76   ........ ....maxv
0320  6d 65 6d 00 00 00 00 00  00 00 00 00 00 00 00 02   mem..... ........
0330  00 00 00 02 03 73 75 62  6d 69 73 73 69 6f 6e 5f   .....sub mission_
0340  74 69 6d 65 00 41 d0 56  a3 48 00 00 00 00 00 00   time.A.V .H......
0350  02 00 00 00 02 01 70 72  69 6f 72 69 74 79 00 00   ......pr iority..
0360  00 00 00 00 00 00 00 00  00 00 00 00 00 00 01 00   ........ ........
0370  32 30 30 34 32 00 00 00  00 00 10 00 0f ff 00 00   20042... ........
0380  07 28                                              .(
> Hmmm... I'm fresh out of clever ideas.  I see that you're on a 2.6
> kernel.  Are you using the symbolic link fix?
>
> Daniel
>
> Shuja Parvez wrote:
>
>>yes i can. i tried them. all worked. :(
>>
>>
>>
>>>If you log in as sgeadmin on node2 and source the settings file, can you
>>>execute the following successfully?
>>>
>>>% mkdir $SGE_ROOT/default/spool/node2/active_jobs/mytest
>>>% touch $SGE_ROOT/default/spool/node2/active_jobs/mytest/test
>>>% ls -R $SGE_ROOT/default/spool/node2/active_jobs/mytest
>>>
>>>Daniel
>>>
>>>Shuja Parvez wrote:
>>>
>>>
>>>
>>>>Hello.
>>>>Thank you for the reply.
>>>>
>>>>
>>>>
>>>>
>>>>>I helped someone with the exact same symptoms last week.  His problem
>>>>>was that the sge_execd was not running as the right user.  Did you
>>>>> check
>>>>>that node2's execd is run as root or as the $SGE_ROOT directory owner?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>sge_execd runs as user sgeadmin
>>>>sgeadmin  4768     1  0 15:48 ?      00:00:00
>>>>/home/sgeadmin/bin/lx26-x86/sge_execd
>>>>
>>>>And the $SGE_ROOT=/home/sgeadmin which is NFSed from node1
>>>>This looks on node2 as follows
>>>>drwxrwxrwx  20   1003 globus 4096 2004-09-28 16:33 sgeadmin
>>>>
>>>>Now, I have user sgeadmin on both machines and sgeadmin belongs to the
>>>>group sgeadmin. But when I NFS it the group is shown as globus, that
>>>> too
>>>>confuses me now.
>>>>
>>>>This is how /home/sgeadmin looks on node1
>>>>drwxrwxrwx  20 sgeadmin sgeadmin 4096 2004-09-28 16:33 sgeadmin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Another thought would be to check that the SGE user on node2 has
>>>>>permission to write to the $SGE_ROOT over NFS.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>it does. i confirmed that.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>If SGE is running as
>>>>>root, this can be an issue since root turns into nobody when it
>>>>> crosses
>>>>>NFS boundaries.
>>>>>
>>>>>Daniel
>>>>>
>>>>>Shuja Parvez wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>Hi
>>>>>>I have 2 nodes node1. SGE Master which is also the Globus gatekeeper
>>>>>>and
>>>>>>an
>>>>>>execution host.
>>>>>>node2. Execution host.
>>>>>>I installed SGE succesfully and the jobs run on queues on both
>>>>>>machines.
>>>>>>But when i submit a job from globus to the Sun Grid engine, the job
>>>>>>goes
>>>>>>into the error state and i have the following message in the spooler:
>>>>>>===8X message on node2===
>>>>>>09/28/2004 15:57:23|execd|node2|E|shepherd of job 172.1 exited with
>>>>>>exit
>>>>>>status = 26
>>>>>>09/28/2004 15:57:23|execd|node2|E|can't open usage file
>>>>>>"active_jobs/172.1/usage" for job 172.1: No such file or directory
>>>>>>09/28/2004 15:57:23|execd|node2|E|"can't read usage file for job
>>>>>> 172.1
>>>>>>===8X ===
>>>>>>===8X message on qmaster ===
>>>>>>09/28/2004 16:01:44|qmaster|node1|W|job 172.1 failed on host
>>>>>>node2.cfd1.honda-ri.de general opening input/output file because:
>>>>>> can't
>>>>>>read usage file for job 172.1
>>>>>>09/28/2004 16:01:44|qmaster|node1|W|rescheduling job 172.1
>>>>>>===8X ===
>>>>>>The jobs always go into the error state, and when i clear the error
>>>>>>through qmon, the jobs are rescheduled on node1 and then it
>>>>>> continues.
>>>>>>
>>>>>>Could anyone please help me out of this
>>>>>>Regards
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>--
>>>>>*******************************************************
>>>>>*          Daniel Templeton   ERGB01 x60220           *
>>>>>*         Staff Engineer, Sun N1 Grid Engine          *
>>>>>*******************************************************
>>>>>*    "Camera one closes in, the soundtrack starts,    *
>>>>>*     The scene begins.  You're playing you now."     *
>>>>>*                -Josh Joplin Group, "Camera One"     *
>>>>>*******************************************************
>>>>>
>>>>>
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>--
>>>*******************************************************
>>>*          Daniel Templeton   ERGB01 x60220           *
>>>*         Staff Engineer, Sun N1 Grid Engine          *
>>>*******************************************************
>>>*    "Camera one closes in, the soundtrack starts,    *
>>>*     The scene begins.  You're playing you now."     *
>>>*                -Josh Joplin Group, "Camera One"     *
>>>*******************************************************
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>
>>
>>
>>
>
> --
> *******************************************************
> *          Daniel Templeton   ERGB01 x60220           *
> *         Staff Engineer, Sun N1 Grid Engine          *
> *******************************************************
> *    "Camera one closes in, the soundtrack starts,    *
> *     The scene begins.  You're playing you now."     *
> *                -Josh Joplin Group, "Camera One"     *
> *******************************************************
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


-- 
Shuja Parvez
Msc IT and Automation Systems (2003-2005),
FH Esslingen.
Residence Address: AschaffenburgerStrasse 120,
D-63073, Offenbach am Main, Germany
Email: shshgs01 at fht-esslingen.de
Handy: +49 176 700 395 00

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list