[GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being submitted, going into 't' state, then disappearing")

Andy Schwierskott andy.schwierskott at sun.com
Wed Dec 5 09:53:28 GMT 2007


Richard,

> Hello,
>
> Let's combine these two email problems, as Neil Baker and I are working
> on the same grid, and our separate problems have combined into one now!
>
> Basically, the jobs enter the queue ("qw"), then go into state "t" as
> you would expect. The jobs then get executed, kind of.
>
> For example, if i submit "sleep 100" it works perfectly. The job runs,
> and it sits there in state "r" for around 100 seconds. I still get no
> stderr and stdout files though.
>
> However, if i submit a script i've got that generates fractal planet
> images, immediately after state "t", the job disappears.
>
> I have a thought though - we are seeing no stdout and stderr files being
> generated, and apart from qacct showing the job existing, nothing else
> gets logged.
>
> I have also run "find $SGE_ROOT -type f | xargs grep <jobID>" and it
> only returns the "accounting" file.
>
> So, onto my question:
>
> Do the stderr and stdout files get generated by the exec host, or the
> qmaster? If it's the qmaster then that may explain the problem - we have
> not put our qmaster into the same automount setup as the rest of the
> network, and as a result it cannot see any of our network drives,
> including people's home directories etc... Therefore, the qmaster itself
> is unable to write any stderr and stdout files to the locations it needs to.

The stdout/errfiles are created by the shepherd process (the child of the
execd) *before* the job is started. By default it's the user's home
directory unless otherwise redirected (or set to /dev/null).

You could configure KEEP_ACTIVE=true in the "execd_param" section for a
specific execd host ("qconf -mconf <host_name>")- after the job had
disappeared you could look in the <execd_spool_dir>/active_jobs/<job_id>
directory and specifically watch the content of the "trace" and "error"
file.

This will very likely give you some hints what was going wrong.

What values do in the "qacct -j <jobid>" output the fields "failed" and
"exit_status" contain?

In addition I recommend to set "loglevel" to "log_info" in the global
cluster config ("qconf -mconf") if not done yet.






> Could this be the cause?
>
> The planet generation script basically generates an image (which takes
> around 90 seconds), writes it to a network location, prints some stdout,
> and then does it 4 more times.
>
> We are not seeing the stdout, stderr *or* the generated planet images,
> but i suspect the planets aren't even being generated because the script
> is being stopped before that point due to the lack of a stdout/stderr
> channel. Does anyone else agree?
>
> Thanks again, people!
>
> Richard.
>
>
> Neil Baker wrote:
>> Just the qmaster.  What we've actually tried today is to copy the $SGE_ROOT
>> directory from the Redhat 8 box over to Solaris 10 and install just the
>> Solaris 10 binaries rather than the common files first.  This way the
>> $SGE_ROOT directory contains binaries for Solaris 10 (qmaster) and Linux x86
>> (Exec hosts).
>>
>> After a day's work we've got it starting up and even have a exec host
>> starting up using the same $SGE_ROOT (over nfs) using the Linux x86
>> binaries.  qstat works against this exec host and the old Redhat 8 exec
>> hosts that were left over in the original configuration files.
>>
>> However, when we submit jobs to the new exec host (using the corrent
>> $SGE_ROOT over nfs) it transfers the jobs, but the jobs don't appear to run
>> proberly and we don't get any of the stdout or stderr recorded as we
>> normally would.  Anyone experienced this problem, or are we experience the
>> Solaris 10 incompatibility?
>>
>> Regards
>>
>> Neil
>>
>> -----Original Message-----
>> From: Rayson Ho [mailto:rayrayson at gmail.com]
>> Sent: 04 December 2007 15:10
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] SGE 5.3p7 on Solaris 10
>>
>> On Dec 4, 2007 6:05 AM, Neil Baker <neil.baker at crl.toshiba.co.uk> wrote:
>>> We're migrating our Grid Engine from an unstable Redhat 8 / SGE 5.3p6
>> setup
>>> to a Solaris Sparc setup to provide better stability.
>>
>> The whole cluster or just the qmaster host??
>>
>>
>>> As we have other Solaris machines already installed with Solaris 10
>>> (06/2006) we're ideally looking to install SGE 5.3p7 (as we can't find SGE
>>> 5.3p6) onto this platform so that we can have spare machines to run it on
>> if
>>> the main server dies.  We're doing this instead of migrating to SGE 6
>>> because we're hoping the configuration files are compatible between p6 and
>>> p7 and the Linux and Solaris binaries.
>>
>> The configuration files between patch and update releases are compatible.
>>
>>
>>> I've read in the "Bugs fixed in SGE 5.3p7 since release 5.3p6":
>>> http://gridengine.sunsource.net/project/gridengine/53patches.txt that one
>> of
>>> the bugs that has been fixed is bug 4822799 "cannot install on Solaris
>> 10",
>>> however the download page says it is for Solaris 7, 8 or 9 64-bit and
>>> doesn't mention Solaris 10 64-bit.
>>
>> 4822799 was fixed by 5.3p4, you should be able to go up a bit and find
>> the line "Bugs fixed in SGE 5.3p4 since release 5.3p3".
>>
>> Rayson
>>
>>
>>> Does anyone have any experience of running this version on Solaris 10?
>>>
>>> Regards
>>>
>>> Neil
>>>
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date: 02/12/2007
>> 20:34
>>
>>
>> No virus found in this outgoing message.
>> Checked by AVG Free Edition.
>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date: 02/12/2007
>> 20:34
>>
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list