[GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being submitted, going into 't' state, then disappearing")

Neil Baker neil.baker at crl.toshiba.co.uk
Thu Dec 6 15:44:36 GMT 2007


Hi Andy,

Many thanks for your help.  We feel that we're finally getting somewhere.

Here is the information logged when running the jobs from the file:
/export/sge/default/spool/qmaster/messages

==============================================================
Thu Dec  6 13:08:21 2007|qmaster|stg-sun3|I|job 7269776.1 finished on host
stg-dell19.crl.toshiba.co.uk
==============================================================

Here is the output from the "qacct -j" command:

==============================================================
bash-3.00# qacct -j 7269776
==============================================================
qname        dell19L1
hostname     stg-dell19.crl.toshiba.co.uk
group        stg
owner        rhobbs
jobname      makeEarth.sh
jobnumber    7269776
taskid       undefined
account      sge
priority     0
qsub_time    Thu Dec  6 13:08:13 2007
start_time   Thu Dec  6 13:07:25 2007
end_time     Thu Dec  6 13:07:28 2007
granted_pe   none
slots        1
failed       0
exit_status  1
ru_wallclock 3
ru_utime     0
ru_stime     0
ru_maxrss    0
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    3933
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     2474
ru_nivcsw    132
cpu          0
mem          0.000
io           0.000
iow          0.000
maxvmem      2.34M
bash-3.00#
==============================================================

After setting KEEP_ACTIVE=TRUE we get in the trace file for a job:

==============================================================
bash-3.00# cat trace
12/06/2007 13:37:25 [700:11093]: shepherd called with uid = 0, euid = 700
12/06/2007 13:37:25 [700:11093]: sigaction for signal 32 failed: Invalid
argument
12/06/2007 13:37:25 [700:11093]: sigaction for signal 33 failed: Invalid
argument
12/06/2007 13:37:25 [700:11093]: starting up 5.3p6
12/06/2007 13:37:25 [700:11093]: setpgid(11093, 11093) returned 0
12/06/2007 13:37:25 [700:11093]: no prolog script to start
12/06/2007 13:37:25 [700:11094]: pid=11094 pgrp=11094 sid=11094 old
pgrp=11093 getlogin()=<no login set>
12/06/2007 13:37:25 [700:11094]: setosjobid: uid = 0, euid = 700
12/06/2007 13:37:25 [700:11093]: forked "job" with pid 11094
12/06/2007 13:37:25 [700:11094]: RLIMIT_CPU setting: (soft 604800 hard
604800) resulting: (soft 604800 hard 604800)
12/06/2007 13:37:25 [700:11094]: RLIMIT_FSIZE setting: (soft -1 hard -1)
resulting: (soft -1 hard -1)
12/06/2007 13:37:25 [700:11094]: RLIMIT_DATA setting: (soft -1 hard -1)
resulting: (soft -1 hard -1)
12/06/2007 13:37:25 [700:11094]: RLIMIT_STACK setting: (soft -1 hard -1)
resulting: (soft -1 hard -1)
12/06/2007 13:37:25 [700:11094]: RLIMIT_CORE setting: (soft -1 hard -1)
resulting: (soft -1 hard -1)
12/06/2007 13:37:25 [700:11094]: RLIMIT_VMEM/RLIMIT_AS setting: (soft -1
hard -1) resulting: (soft -1 hard -1)
12/06/2007 13:37:25 [700:11094]: RLIMIT_RSS setting: (soft -1 hard -1)
resulting: (soft -1 hard -1)
12/06/2007 13:37:25 [700:11093]: child: job - pid: 11094
12/06/2007 13:37:25 [721:11094]: closing all filedescriptors
12/06/2007 13:37:25 [721:11094]: further messages are in "error" and "trace"
12/06/2007 13:37:25 [721:11094]: execvp(/bin/csh, -csh
/rmt/sge/default/spool/stg-dell19/job_scripts/7269791)
12/06/2007 13:37:26 [700:11093]: wait3 returned 11094 (status: 256;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 1)
12/06/2007 13:37:26 [700:11093]: job exited with exit status 1
12/06/2007 13:37:26 [700:11093]: reaped "job" with pid 11094
12/06/2007 13:37:26 [700:11093]: job exited not due to signal
12/06/2007 13:37:26 [700:11093]: now sending signal 9 to pid -11094
12/06/2007 13:37:26 [700:11093]: job exited with status 1
12/06/2007 13:37:26 [700:11093]: writing usage file to "usage"
12/06/2007 13:37:26 [700:11093]: no tasker to notify
12/06/2007 13:37:26 [700:11093]: no epilog script to start
bash-3.00#
==============================================================

And the error file is empty.

One promising bit of progress is that looking at the config file it appears
to be writing stderr and stdout to Richard's home directory instead of the
place it should be.  Could this be because of the following:

[user at submithost scripts]$ cat script.sh.o.7269808
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
[user at submithost scripts]$

Any ideas?

Regards

Neil

-----Original Message-----
From: Andy Schwierskott [mailto:andy.schwierskott at sun.com] 
Sent: 05 December 2007 09:53
To: users at gridengine.sunsource.net
Subject: Re: [GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being
submitted, going into 't' state, then disappearing")

Richard,

> Hello,
>
> Let's combine these two email problems, as Neil Baker and I are working
> on the same grid, and our separate problems have combined into one now!
>
> Basically, the jobs enter the queue ("qw"), then go into state "t" as
> you would expect. The jobs then get executed, kind of.
>
> For example, if i submit "sleep 100" it works perfectly. The job runs,
> and it sits there in state "r" for around 100 seconds. I still get no
> stderr and stdout files though.
>
> However, if i submit a script i've got that generates fractal planet
> images, immediately after state "t", the job disappears.
>
> I have a thought though - we are seeing no stdout and stderr files being
> generated, and apart from qacct showing the job existing, nothing else
> gets logged.
>
> I have also run "find $SGE_ROOT -type f | xargs grep <jobID>" and it
> only returns the "accounting" file.
>
> So, onto my question:
>
> Do the stderr and stdout files get generated by the exec host, or the
> qmaster? If it's the qmaster then that may explain the problem - we have
> not put our qmaster into the same automount setup as the rest of the
> network, and as a result it cannot see any of our network drives,
> including people's home directories etc... Therefore, the qmaster itself
> is unable to write any stderr and stdout files to the locations it needs
to.

The stdout/errfiles are created by the shepherd process (the child of the
execd) *before* the job is started. By default it's the user's home
directory unless otherwise redirected (or set to /dev/null).

You could configure KEEP_ACTIVE=true in the "execd_param" section for a
specific execd host ("qconf -mconf <host_name>")- after the job had
disappeared you could look in the <execd_spool_dir>/active_jobs/<job_id>
directory and specifically watch the content of the "trace" and "error"
file.

This will very likely give you some hints what was going wrong.

What values do in the "qacct -j <jobid>" output the fields "failed" and
"exit_status" contain?

In addition I recommend to set "loglevel" to "log_info" in the global
cluster config ("qconf -mconf") if not done yet.






> Could this be the cause?
>
> The planet generation script basically generates an image (which takes
> around 90 seconds), writes it to a network location, prints some stdout,
> and then does it 4 more times.
>
> We are not seeing the stdout, stderr *or* the generated planet images,
> but i suspect the planets aren't even being generated because the script
> is being stopped before that point due to the lack of a stdout/stderr
> channel. Does anyone else agree?
>
> Thanks again, people!
>
> Richard.
>
>
> Neil Baker wrote:
>> Just the qmaster.  What we've actually tried today is to copy the
$SGE_ROOT
>> directory from the Redhat 8 box over to Solaris 10 and install just the
>> Solaris 10 binaries rather than the common files first.  This way the
>> $SGE_ROOT directory contains binaries for Solaris 10 (qmaster) and Linux
x86
>> (Exec hosts).
>>
>> After a day's work we've got it starting up and even have a exec host
>> starting up using the same $SGE_ROOT (over nfs) using the Linux x86
>> binaries.  qstat works against this exec host and the old Redhat 8 exec
>> hosts that were left over in the original configuration files.
>>
>> However, when we submit jobs to the new exec host (using the corrent
>> $SGE_ROOT over nfs) it transfers the jobs, but the jobs don't appear to
run
>> proberly and we don't get any of the stdout or stderr recorded as we
>> normally would.  Anyone experienced this problem, or are we experience
the
>> Solaris 10 incompatibility?
>>
>> Regards
>>
>> Neil
>>
>> -----Original Message-----
>> From: Rayson Ho [mailto:rayrayson at gmail.com]
>> Sent: 04 December 2007 15:10
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] SGE 5.3p7 on Solaris 10
>>
>> On Dec 4, 2007 6:05 AM, Neil Baker <neil.baker at crl.toshiba.co.uk> wrote:
>>> We're migrating our Grid Engine from an unstable Redhat 8 / SGE 5.3p6
>> setup
>>> to a Solaris Sparc setup to provide better stability.
>>
>> The whole cluster or just the qmaster host??
>>
>>
>>> As we have other Solaris machines already installed with Solaris 10
>>> (06/2006) we're ideally looking to install SGE 5.3p7 (as we can't find
SGE
>>> 5.3p6) onto this platform so that we can have spare machines to run it
on
>> if
>>> the main server dies.  We're doing this instead of migrating to SGE 6
>>> because we're hoping the configuration files are compatible between p6
and
>>> p7 and the Linux and Solaris binaries.
>>
>> The configuration files between patch and update releases are compatible.
>>
>>
>>> I've read in the "Bugs fixed in SGE 5.3p7 since release 5.3p6":
>>> http://gridengine.sunsource.net/project/gridengine/53patches.txt that
one
>> of
>>> the bugs that has been fixed is bug 4822799 "cannot install on Solaris
>> 10",
>>> however the download page says it is for Solaris 7, 8 or 9 64-bit and
>>> doesn't mention Solaris 10 64-bit.
>>
>> 4822799 was fixed by 5.3p4, you should be able to go up a bit and find
>> the line "Bugs fixed in SGE 5.3p4 since release 5.3p3".
>>
>> Rayson
>>
>>
>>> Does anyone have any experience of running this version on Solaris 10?
>>>
>>> Regards
>>>
>>> Neil
>>>
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date:
02/12/2007
>> 20:34
>>
>>
>> No virus found in this outgoing message.
>> Checked by AVG Free Edition.
>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date:
02/12/2007
>> 20:34
>>
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________




______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list