[GE users] Can't get SGE 6.1u5 to work on Linux/PPC64

Ron Chen ron_chen_123 at yahoo.com
Fri Sep 26 03:16:59 BST 2008


SGE is supposed to be able to handle clusters with different architectures, with different operating systems. So a 32-bit ppc sge_execd should work fine, and same for client commands: a 32-bit qsub should be able to submit jobs to 64-bit qmaster machines.

Just a note, if you just want to test, just compile a 32-bit execd, and then replace the existing 64-bit binary. Going from a fresh install of the client side is a bit too much work.

On the other hand, a 32-bit qstat should give you the status of the cluster too. (BTW, when you run qstat on the PPC64 nodes, can you get any output?)

 -Ron


--- On Fri, 9/26/08, Nick Tan <nick at wehi.EDU.AU> wrote:
> I will try compiling 32-bit binaries and see how it goes. 
> Will it 
> matter if the PPC64 nodes use 32-bit binaries and the
> x86_64 nodes use 
> 64-bit binaries?
> 
> I might also try using wireshark to sniff the traffic
> between the node 
> and the qmaster to try and figure out if there's
> something not right 
> there too.
> 
> I don't know of any public PPC64 servers online, sorry.
> 
> Nick
> 
> Ron Chen wrote:
> > Then there are 2 possibilities:
> > 
> > 1.) There really is a communication error (usually due
> to setup of hostname resolution) from the execution hosts to
> the qmaster.
> > 
> > 2.) There is still a bug in the 64-bit code, as 32-bit
> worked fine before:
> >
> http://gridengine.sunsource.net/servlets/BrowseList?list=dev&by=thread&from=2151
> > 
> > As a hack, you can change the arch script to make it
> think that it's executing on a 32-bit machine. Then,
> aimk will compile SGE in pure 32-bit.
> > 
> > BTW, do you know if there are any public PPC64 compile
> farms or servers available online? If I have time, I may be
> able to test the PPC64 Linux port.
> > 
> >  -Ron
> > 
> > 
> > --- On Fri, 9/26/08, Nick Tan <nick at wehi.EDU.AU>
> wrote:
> >> I've done as you suggested and recompiled but
> I am
> >> seeing the same 
> >> behaviour as before.
> >>
> >> Nick
> >>
> >> Ron Chen wrote:
> >>> Then it really looks like a communication
> problem.
> >> qhost is really basic (with no complex settings or
> other
> >> kinds of setup needed).
> >>> As you mentioned that TARGET_64BIT is defined,
> I
> >> greped the source and found that there is a case
> for the
> >> LINUXAMD64 macro but not TARGET_64BIT. I am
> wondering if it
> >> is right or not, as AMD64 is also 64-bit?
> >>> So, one last thing that I can think of right
> now is in
> >> common/basis_types.h:
> >>> #if defined(FREEBSD) || defined(NETBSD) ||
> >> defined(LINUXAMD64)
> >>> #  define sge_U32CFormat "%u"
> >>> #  define sge_U32CLetter "u"
> >>> #  define sge_u32c(x)  (unsigned int)(x)
> >>>
> >>> #  define sge_X32CFormat "%x"
> >>> #  define sge_x32c(x)  (unsigned int)(x)
> >>> #else
> >>> ...
> >>> ...
> >>>
> >>> In the code,  add a case for
> "TARGET_64BIT",
> >> like:
> >>> #if defined(FREEBSD) || defined(NETBSD) ||
> >> defined(LINUXAMD64) ||
> >>> defined(TARGET_64BIT)
> >>>
> >>> Do an "aimk clean" (since it is a
> header
> >> file, the dependency may not be able to detect
> that) and
> >> recompile everything.
> >>>  -Ron
> >>>
> >>>
> >>> --- On Fri, 9/26/08, Nick Tan
> <nick at wehi.EDU.AU>
> >> wrote:
> >>>> doing qhost shows:
> >>>>
> >>>> bionode01               lx24-amd64      8 
> 0.00   
> >> 7.8G 
> >>>> 122.9M    2.0G 
> >>>>      0.0
> >>>> bionode34               -               - 
>    -   
> >>    -    
> >>>>   -       - 
> >>>>        -
> >>>>
> >>>> where bionode01 is one an x86_64 node
> which is
> >> working and
> >>>> bionode34 is 
> >>>> a ppc64 node which isn't working.
> >>>>
> >>>> Nick
> >>>>
> >>>> Rayson Ho wrote:
> >>>>> On 9/25/08, Nick Tan
> <nick at wehi.edu.au>
> >> wrote:
> >>>>>> It looks like it can collect the
> data so
> >> would
> >>>> that indicate a
> >>>>>> communication error then?
> >>>>> What does qhost show??
> >>>>>
> >>>>> Rayson
> >>>>>
> >>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Nick
> >>>>>>
> >>>>>>
> >>>>>> Chris Dagdigian wrote:
> >>>>>>> Hi Nick,
> >>>>>>>
> >>>>>>> I'm guessing that maybe
> the PDC
> >> part of
> >>>> SGE on your ppc systems is unable
> >>>>>> to poll the apple nodes to get
> load and
> >> state
> >>>> status.
> >>>>>>> Can you try the following?
> >>>>>>>
> >>>>>>> Run the utilbin/loadcheck
> program on
> >> your PPC
> >>>> systems and see what comes
> >>>>>> back?
> >>>>>>> Running it on my OS X intel
> macbook
> >> pro
> >>>> returns:
> >>>>>>>> $
> >> /opt/sge/utilbin/darwin-x86/loadcheck
> >>>>>>>> arch            darwin-x86
> >>>>>>>> num_proc        2
> >>>>>>>> load_short      1.35
> >>>>>>>> load_medium     1.37
> >>>>>>>> load_long       1.39
> >>>>>>>> mem_free       
> 2044.082031M
> >>>>>>>> swap_free       0.000000M
> >>>>>>>> virtual_free   
> 2044.082031M
> >>>>>>>> mem_total      
> 4096.000000M
> >>>>>>>> swap_total      0.000000M
> >>>>>>>> virtual_total  
> 4096.000000M
> >>>>>>>> mem_used       
> 2051.917969M
> >>>>>>>> swap_used       0.000000M
> >>>>>>>> virtual_used   
> 2051.917969M
> >>>>>>>> cpu             45.5%
> >>>>>>>>
> >>>>>>> If you can't find the
> equiv for
> >> your
> >>>> PPC/Linux setup then I think that may
> >>>>>> be the issue (SGE is running but
> can't
> >> collect
> >>>> local performance data)
> >>>>>>> Regards,
> >>>>>>> Chris
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sep 25, 2008, at 2:26 AM,
> Nick Tan
> >> wrote:
> >>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> I am setting up a cluster
> with 33
> >> nodes
> >>>> running Linux on x86_64 (SunFire
> >>>>>> X2100) and 40 nodes running Linux
> on ppc64
> >> (Apple
> >>>> Xserve G5 cluster node).
> >>>>>>>> I am using the precompiled
> SGE
> >> binaries
> >>>> for the x86_64 nodes which are
> >>>>>> working fine.  I have compiled SGE
> for the
> >> PPC64
> >>>> nodes.  The x86_64 nodes
> >>>>>> are running CentOS 5.2 and the
> PPC64 nodes
> >> are
> >>>> running Fedora 9.
> >>>>>>>> sge_execd starts on the
> ppc64 node
> >> but I
> >>>> get this in the "qstat -f
> >>>>>> -explain a" output
> >>>>>>>> all.q at bionode34.biocluster
>     BIP
> >>   0/1  
> >>>>     -NA-     -NA-          a
> >>>>>>>>       error: no complex
> attribute
> >> for
> >>>> threshold np_load_avg
> >>>>>>>> What can I do to fix this?
> 
> >> I've
> >>>> searched the mailing list archives but
> >>>>>> couldn't find anything so
> I'm
> >> hoping
> >>>> someone will be able to help.
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Nick
> >>>>>>>>
> >>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail:
> >>>>>>
> users-unsubscribe at gridengine.sunsource.net
> >>>>>>> For additional commands,
> e-mail:
> >>>>>>
> users-help at gridengine.sunsource.net
> >>>>>> --
> >>>>>> Nick Tan
> >>>>>> Unix Systems Manager
> >>>>>> The Walter and Eliza Hall
> Institute
> >>>>>> nick at wehi.edu.au
> >>>>>>
> >>>>>>
> >>>>>>
> >>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail:
> >>>>>>
> users-unsubscribe at gridengine.sunsource.net
> >>>>>> For additional commands, e-mail:
> >>>>>>
> users-help at gridengine.sunsource.net
> >>>>>>
> >>>>>>
> >>
> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail:
> >>>> users-unsubscribe at gridengine.sunsource.net
> >>>>> For additional commands, e-mail:
> >>>> users-help at gridengine.sunsource.net
> >>>> -- 
> >>>> Nick Tan
> >>>> Unix Systems Manager
> >>>> The Walter and Eliza Hall Institute
> >>>> nick at wehi.edu.au
> >>>>
> >>>>
> >>
> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail:
> >>>> users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail:
> >>>> users-help at gridengine.sunsource.net
> >>>
> >>>       
> >>>
> >>>
> >>
> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail:
> >> users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail:
> >> users-help at gridengine.sunsource.net
> >> -- 
> >> Nick Tan
> >> Unix Systems Manager
> >> The Walter and Eliza Hall Institute
> >> nick at wehi.edu.au
> >>
> >>
> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> >> users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail:
> >> users-help at gridengine.sunsource.net
> > 
> > 
> >       
> > 
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> > 
> 
> -- 
> Nick Tan
> Unix Systems Manager
> The Walter and Eliza Hall Institute
> nick at wehi.edu.au
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net


      

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list