[GE users] Can't get SGE 6.1u5 to work on Linux/PPC64

Nick Tan nick at wehi.EDU.AU
Fri Sep 26 03:22:34 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Thanks for the heads up.  I'll recompile sge_execd as 32bit first but 
probably won't be today though.

qstat on the PPC64 nodes gives the same output as the qmaster and the 
x86_64 nodes.

Nick

Ron Chen wrote:
> SGE is supposed to be able to handle clusters with different architectures, with different operating systems. So a 32-bit ppc sge_execd should work fine, and same for client commands: a 32-bit qsub should be able to submit jobs to 64-bit qmaster machines.
> 
> Just a note, if you just want to test, just compile a 32-bit execd, and then replace the existing 64-bit binary. Going from a fresh install of the client side is a bit too much work.
> 
> On the other hand, a 32-bit qstat should give you the status of the cluster too. (BTW, when you run qstat on the PPC64 nodes, can you get any output?)
> 
>  -Ron
> 
> 
> --- On Fri, 9/26/08, Nick Tan <nick at wehi.EDU.AU> wrote:
>> I will try compiling 32-bit binaries and see how it goes. 
>> Will it 
>> matter if the PPC64 nodes use 32-bit binaries and the
>> x86_64 nodes use 
>> 64-bit binaries?
>>
>> I might also try using wireshark to sniff the traffic
>> between the node 
>> and the qmaster to try and figure out if there's
>> something not right 
>> there too.
>>
>> I don't know of any public PPC64 servers online, sorry.
>>
>> Nick
>>
>> Ron Chen wrote:
>>> Then there are 2 possibilities:
>>>
>>> 1.) There really is a communication error (usually due
>> to setup of hostname resolution) from the execution hosts to
>> the qmaster.
>>> 2.) There is still a bug in the 64-bit code, as 32-bit
>> worked fine before:
>> http://gridengine.sunsource.net/servlets/BrowseList?list=dev&by=thread&from=2151
>>> As a hack, you can change the arch script to make it
>> think that it's executing on a 32-bit machine. Then,
>> aimk will compile SGE in pure 32-bit.
>>> BTW, do you know if there are any public PPC64 compile
>> farms or servers available online? If I have time, I may be
>> able to test the PPC64 Linux port.
>>>  -Ron
>>>
>>>
>>> --- On Fri, 9/26/08, Nick Tan <nick at wehi.EDU.AU>
>> wrote:
>>>> I've done as you suggested and recompiled but
>> I am
>>>> seeing the same 
>>>> behaviour as before.
>>>>
>>>> Nick
>>>>
>>>> Ron Chen wrote:
>>>>> Then it really looks like a communication
>> problem.
>>>> qhost is really basic (with no complex settings or
>> other
>>>> kinds of setup needed).
>>>>> As you mentioned that TARGET_64BIT is defined,
>> I
>>>> greped the source and found that there is a case
>> for the
>>>> LINUXAMD64 macro but not TARGET_64BIT. I am
>> wondering if it
>>>> is right or not, as AMD64 is also 64-bit?
>>>>> So, one last thing that I can think of right
>> now is in
>>>> common/basis_types.h:
>>>>> #if defined(FREEBSD) || defined(NETBSD) ||
>>>> defined(LINUXAMD64)
>>>>> #  define sge_U32CFormat "%u"
>>>>> #  define sge_U32CLetter "u"
>>>>> #  define sge_u32c(x)  (unsigned int)(x)
>>>>>
>>>>> #  define sge_X32CFormat "%x"
>>>>> #  define sge_x32c(x)  (unsigned int)(x)
>>>>> #else
>>>>> ...
>>>>> ...
>>>>>
>>>>> In the code,  add a case for
>> "TARGET_64BIT",
>>>> like:
>>>>> #if defined(FREEBSD) || defined(NETBSD) ||
>>>> defined(LINUXAMD64) ||
>>>>> defined(TARGET_64BIT)
>>>>>
>>>>> Do an "aimk clean" (since it is a
>> header
>>>> file, the dependency may not be able to detect
>> that) and
>>>> recompile everything.
>>>>>  -Ron
>>>>>
>>>>>
>>>>> --- On Fri, 9/26/08, Nick Tan
>> <nick at wehi.EDU.AU>
>>>> wrote:
>>>>>> doing qhost shows:
>>>>>>
>>>>>> bionode01               lx24-amd64      8 
>> 0.00   
>>>> 7.8G 
>>>>>> 122.9M    2.0G 
>>>>>>      0.0
>>>>>> bionode34               -               - 
>>    -   
>>>>    -    
>>>>>>   -       - 
>>>>>>        -
>>>>>>
>>>>>> where bionode01 is one an x86_64 node
>> which is
>>>> working and
>>>>>> bionode34 is 
>>>>>> a ppc64 node which isn't working.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> Rayson Ho wrote:
>>>>>>> On 9/25/08, Nick Tan
>> <nick at wehi.edu.au>
>>>> wrote:
>>>>>>>> It looks like it can collect the
>> data so
>>>> would
>>>>>> that indicate a
>>>>>>>> communication error then?
>>>>>>> What does qhost show??
>>>>>>>
>>>>>>> Rayson
>>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>>
>>>>>>>> Chris Dagdigian wrote:
>>>>>>>>> Hi Nick,
>>>>>>>>>
>>>>>>>>> I'm guessing that maybe
>> the PDC
>>>> part of
>>>>>> SGE on your ppc systems is unable
>>>>>>>> to poll the apple nodes to get
>> load and
>>>> state
>>>>>> status.
>>>>>>>>> Can you try the following?
>>>>>>>>>
>>>>>>>>> Run the utilbin/loadcheck
>> program on
>>>> your PPC
>>>>>> systems and see what comes
>>>>>>>> back?
>>>>>>>>> Running it on my OS X intel
>> macbook
>>>> pro
>>>>>> returns:
>>>>>>>>>> $
>>>> /opt/sge/utilbin/darwin-x86/loadcheck
>>>>>>>>>> arch            darwin-x86
>>>>>>>>>> num_proc        2
>>>>>>>>>> load_short      1.35
>>>>>>>>>> load_medium     1.37
>>>>>>>>>> load_long       1.39
>>>>>>>>>> mem_free       
>> 2044.082031M
>>>>>>>>>> swap_free       0.000000M
>>>>>>>>>> virtual_free   
>> 2044.082031M
>>>>>>>>>> mem_total      
>> 4096.000000M
>>>>>>>>>> swap_total      0.000000M
>>>>>>>>>> virtual_total  
>> 4096.000000M
>>>>>>>>>> mem_used       
>> 2051.917969M
>>>>>>>>>> swap_used       0.000000M
>>>>>>>>>> virtual_used   
>> 2051.917969M
>>>>>>>>>> cpu             45.5%
>>>>>>>>>>
>>>>>>>>> If you can't find the
>> equiv for
>>>> your
>>>>>> PPC/Linux setup then I think that may
>>>>>>>> be the issue (SGE is running but
>> can't
>>>> collect
>>>>>> local performance data)
>>>>>>>>> Regards,
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sep 25, 2008, at 2:26 AM,
>> Nick Tan
>>>> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I am setting up a cluster
>> with 33
>>>> nodes
>>>>>> running Linux on x86_64 (SunFire
>>>>>>>> X2100) and 40 nodes running Linux
>> on ppc64
>>>> (Apple
>>>>>> Xserve G5 cluster node).
>>>>>>>>>> I am using the precompiled
>> SGE
>>>> binaries
>>>>>> for the x86_64 nodes which are
>>>>>>>> working fine.  I have compiled SGE
>> for the
>>>> PPC64
>>>>>> nodes.  The x86_64 nodes
>>>>>>>> are running CentOS 5.2 and the
>> PPC64 nodes
>>>> are
>>>>>> running Fedora 9.
>>>>>>>>>> sge_execd starts on the
>> ppc64 node
>>>> but I
>>>>>> get this in the "qstat -f
>>>>>>>> -explain a" output
>>>>>>>>>> all.q at bionode34.biocluster
>>     BIP
>>>>   0/1  
>>>>>>     -NA-     -NA-          a
>>>>>>>>>>       error: no complex
>> attribute
>>>> for
>>>>>> threshold np_load_avg
>>>>>>>>>> What can I do to fix this?
>>>> I've
>>>>>> searched the mailing list archives but
>>>>>>>> couldn't find anything so
>> I'm
>>>> hoping
>>>>>> someone will be able to help.
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands,
>> e-mail:
>> users-help at gridengine.sunsource.net
>>>>>>>> --
>>>>>>>> Nick Tan
>>>>>>>> Unix Systems Manager
>>>>>>>> The Walter and Eliza Hall
>> Institute
>>>>>>>> nick at wehi.edu.au
>>>>>>>>
>>>>>>>>
>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail:
>>>>>>>>
>> users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail:
>>>>>>>>
>> users-help at gridengine.sunsource.net
>>>>>>>>
>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail:
>>>>>> users-help at gridengine.sunsource.net
>>>>>> -- 
>>>>>> Nick Tan
>>>>>> Unix Systems Manager
>>>>>> The Walter and Eliza Hall Institute
>>>>>> nick at wehi.edu.au
>>>>>>
>>>>>>
>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail:
>>>>>> users-help at gridengine.sunsource.net
>>>>>       
>>>>>
>>>>>
>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail:
>>>> users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail:
>>>> users-help at gridengine.sunsource.net
>>>> -- 
>>>> Nick Tan
>>>> Unix Systems Manager
>>>> The Walter and Eliza Hall Institute
>>>> nick at wehi.edu.au
>>>>
>>>>
>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail:
>>>> users-help at gridengine.sunsource.net
>>>
>>>       
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>> -- 
>> Nick Tan
>> Unix Systems Manager
>> The Walter and Eliza Hall Institute
>> nick at wehi.edu.au
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Nick Tan
Unix Systems Manager
The Walter and Eliza Hall Institute
nick at wehi.edu.au

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list