[GE users] SGE 6.2: jobs queued indefinitely

Bart Willems b-willems at northwestern.edu
Fri Sep 26 22:15:31 BST 2008


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Lubos,

I manually removed all running jobs from the spool directories and redid
the upgrade. That solved the problem. My queues are all listed as invalid,
but that 's all fixable.

Thanks a lot for your help!

Cheers,
Bart

> Hi Lubos,
>
> Sorry for the delay, new GPU nodes for the cluster have been a distraction
> :-)
>
> I managed to get the execd running on a node, but qmaster fails to start
> in the frontend. I also tried to redo the upgrade after renaming all the
> lx26-amd64 directories (renaming should be sufficient, right?), but then
> to the install fails when it tries to start qmaster.
>
> This is the output of env | grep -i sge:
>
> SGE_CELL=default
> SGE_ARCH=lx24-amd64
> SGE_EXECD_PORT=537
> SGE_QMASTER_PORT=536
> SGE_ROOT=/opt/gridengine
>
> I am also attaching the relevant part of
> /opt/gridengine/default/spool/qmaster/messages
>
> Any suggestions?
>
> Thanks,
> Bart
>
>> On 09/23/08 19:45, Bart Willems wrote:
>>> Hi Lubos,
>>>
>>> one more question. Our compute nodes still only have the lx26-amd64
>>> directory, not lx24-amd64. Does this mean I need to install SGE 6.2 on
>>> all
>>> nodes separately?
>>>
>> Yes. This means that you most likely use local binaries. These were
>> unfortunately not overwritten by the upgrade procedure. I'll check and
>> improve documentation.
>>
>> What you need to do:
>> Shutdown all execds. Remove this lx26-amd64 architecture and copy to
>> each execd the new lx24-amd64.  This most likely applies to both bin and
>> utilbin directories.
>> Depending on your setup (what is local and what is shared):
>> Copy whole SGE_ROOT, except for the SGE_CELL directory if SGE_CELL is
>> shared.
>> If you have even local SGE_CELL, then whole SGE_ROOT is needed as the
>> new bootstrap file is only in the SGE_CELL/common/bootstrap on the
>> master host.
>>
>> Once all hosts have lx24-amd64 and can access the new bootstrap file,
>> you may start the execds and all commands will now work.
>>
>> What exactly happened:
>> Your environment uses only shared SGE_CELL.
>> Upgrade started on master host. This upgraded only the master host's
>> SGE_ROOT.
>> (MISSING): You should've copied the new SGE_ROOT to each execd host
>> Started the cluster.
>>
>> Now qmaster was 6.2 but all execds and all clients on non-master host
>> are 6.1u4.
>>
>> As I already said, I'll improve the documentations regarding this case.
>>
>> Please let me know, if your cluster finally works.
>>
>> Lubos.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list