[GE users] SGE 6.2: jobs queued indefinitely

Bart Willems b-willems at northwestern.edu
Fri Sep 26 17:19:02 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Lubos,

Sorry for the delay, new GPU nodes for the cluster have been a distraction
:-)

I managed to get the execd running on a node, but qmaster fails to start
in the frontend. I also tried to redo the upgrade after renaming all the
lx26-amd64 directories (renaming should be sufficient, right?), but then
to the install fails when it tries to start qmaster.

This is the output of env | grep -i sge:

SGE_CELL=default
SGE_ARCH=lx24-amd64
SGE_EXECD_PORT=537
SGE_QMASTER_PORT=536
SGE_ROOT=/opt/gridengine

I am also attaching the relevant part of
/opt/gridengine/default/spool/qmaster/messages

Any suggestions?

Thanks,
Bart

> On 09/23/08 19:45, Bart Willems wrote:
>> Hi Lubos,
>>
>> one more question. Our compute nodes still only have the lx26-amd64
>> directory, not lx24-amd64. Does this mean I need to install SGE 6.2 on
>> all
>> nodes separately?
>>
> Yes. This means that you most likely use local binaries. These were
> unfortunately not overwritten by the upgrade procedure. I'll check and
> improve documentation.
>
> What you need to do:
> Shutdown all execds. Remove this lx26-amd64 architecture and copy to
> each execd the new lx24-amd64.  This most likely applies to both bin and
> utilbin directories.
> Depending on your setup (what is local and what is shared):
> Copy whole SGE_ROOT, except for the SGE_CELL directory if SGE_CELL is
> shared.
> If you have even local SGE_CELL, then whole SGE_ROOT is needed as the
> new bootstrap file is only in the SGE_CELL/common/bootstrap on the
> master host.
>
> Once all hosts have lx24-amd64 and can access the new bootstrap file,
> you may start the execds and all commands will now work.
>
> What exactly happened:
> Your environment uses only shared SGE_CELL.
> Upgrade started on master host. This upgraded only the master host's
> SGE_ROOT.
> (MISSING): You should've copied the new SGE_ROOT to each execd host
> Started the cluster.
>
> Now qmaster was 6.2 but all execds and all clients on non-master host
> are 6.1u4.
>
> As I already said, I'll improve the documentations regarding this case.
>
> Please let me know, if your cluster finally works.
>
> Lubos.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>



    [ Part 2: "Attached Text" ]

    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list