[GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being submitted, going into 't' state, then disappearing")

Richard Hobbs richard.hobbs at crl.toshiba.co.uk
Wed Dec 5 09:41:08 GMT 2007


    [ The following text is in the "windows-1250" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

Let's combine these two email problems, as Neil Baker and I are working
on the same grid, and our separate problems have combined into one now!

Basically, the jobs enter the queue ("qw"), then go into state "t" as
you would expect. The jobs then get executed, kind of.

For example, if i submit "sleep 100" it works perfectly. The job runs,
and it sits there in state "r" for around 100 seconds. I still get no
stderr and stdout files though.

However, if i submit a script i've got that generates fractal planet
images, immediately after state "t", the job disappears.

I have a thought though - we are seeing no stdout and stderr files being
generated, and apart from qacct showing the job existing, nothing else
gets logged.

I have also run "find $SGE_ROOT -type f | xargs grep <jobID>" and it
only returns the "accounting" file.

So, onto my question:

Do the stderr and stdout files get generated by the exec host, or the
qmaster? If it's the qmaster then that may explain the problem - we have
not put our qmaster into the same automount setup as the rest of the
network, and as a result it cannot see any of our network drives,
including people's home directories etc... Therefore, the qmaster itself
is unable to write any stderr and stdout files to the locations it needs to.

Could this be the cause?

The planet generation script basically generates an image (which takes
around 90 seconds), writes it to a network location, prints some stdout,
and then does it 4 more times.

We are not seeing the stdout, stderr *or* the generated planet images,
but i suspect the planets aren't even being generated because the script
is being stopped before that point due to the lack of a stdout/stderr
channel. Does anyone else agree?

Thanks again, people!

Richard.


Neil Baker wrote:
> Just the qmaster.  What we've actually tried today is to copy the $SGE_ROOT
> directory from the Redhat 8 box over to Solaris 10 and install just the
> Solaris 10 binaries rather than the common files first.  This way the
> $SGE_ROOT directory contains binaries for Solaris 10 (qmaster) and Linux x86
> (Exec hosts).
> 
> After a day's work we've got it starting up and even have a exec host
> starting up using the same $SGE_ROOT (over nfs) using the Linux x86
> binaries.  qstat works against this exec host and the old Redhat 8 exec
> hosts that were left over in the original configuration files.
> 
> However, when we submit jobs to the new exec host (using the corrent
> $SGE_ROOT over nfs) it transfers the jobs, but the jobs don't appear to run
> proberly and we don't get any of the stdout or stderr recorded as we
> normally would.  Anyone experienced this problem, or are we experience the
> Solaris 10 incompatibility?  
> 
> Regards
> 
> Neil
> 
> -----Original Message-----
> From: Rayson Ho [mailto:rayrayson at gmail.com] 
> Sent: 04 December 2007 15:10
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] SGE 5.3p7 on Solaris 10
> 
> On Dec 4, 2007 6:05 AM, Neil Baker <neil.baker at crl.toshiba.co.uk> wrote:
>> We're migrating our Grid Engine from an unstable Redhat 8 / SGE 5.3p6
> setup
>> to a Solaris Sparc setup to provide better stability.
> 
> The whole cluster or just the qmaster host??
> 
> 
>> As we have other Solaris machines already installed with Solaris 10
>> (06/2006) we're ideally looking to install SGE 5.3p7 (as we can't find SGE
>> 5.3p6) onto this platform so that we can have spare machines to run it on
> if
>> the main server dies.  We're doing this instead of migrating to SGE 6
>> because we're hoping the configuration files are compatible between p6 and
>> p7 and the Linux and Solaris binaries.
> 
> The configuration files between patch and update releases are compatible.
> 
> 
>> I've read in the "Bugs fixed in SGE 5.3p7 since release 5.3p6":
>> http://gridengine.sunsource.net/project/gridengine/53patches.txt that one
> of
>> the bugs that has been fixed is bug 4822799 "cannot install on Solaris
> 10",
>> however the download page says it is for Solaris 7, 8 or 9 64-bit and
>> doesn't mention Solaris 10 64-bit.
> 
> 4822799 was fixed by 5.3p4, you should be able to go up a bit and find
> the line "Bugs fixed in SGE 5.3p4 since release 5.3p3".
> 
> Rayson
> 
> 
>> Does anyone have any experience of running this version on Solaris 10?
>>
>> Regards
>>
>> Neil
>>
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date: 02/12/2007
> 20:34
>  
> 
> No virus found in this outgoing message.
> Checked by AVG Free Edition. 
> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date: 02/12/2007
> 20:34
>  
> 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> 

-- 
Richard Hobbs (Systems Administrator)
Toshiba Research Europe Ltd. - Cambridge Research Laboratory
Email: richard.hobbs at crl.toshiba.co.uk
Web: http://www.toshiba-europe.com/research/
Tel: +44 1223 436999        Mobile: +44 7811 803377

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list