[GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being submitted, going into 't' state, then disappearing")

Rayson Ho rayrayson at gmail.com
Thu Dec 13 14:37:05 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Did you have the -cwd switch in the "sge_request" file in your
original setting??

Otherwise, by default the stdout and stderr file are placed in the job
owners' home directory.

Rayson



On Dec 13, 2007 9:30 AM, Richard Hobbs <richard.hobbs at crl.toshiba.co.uk> wrote:
> Hello,
>
> To summarise the seemingly unpopular question below:
>
> ======================================================================
> Does anyone know where the setting is that defines the default location
> for the stdout and stderr files for jobs?
> ======================================================================
>
> Basically, we've moved our entire 5.3v6 configuration from Linux to
> Solaris, and since doing do, all of the stdout and stderr files are
> placed in the user's home directory, whereas previously they were placed
> in the directory from which they submitted their jobs.
>
> While we realise this can be changed on a job-per-job basis, this is not
> acceptable unfortunately, and we need to change the default setting back
> to what it was when we had the Linux qmaster.
>
> Thanks in advance to anyone who can help! :-)
>
> Richard.
>
>
>
> Neil Baker wrote:
> > Hi Andy,
> >
> > Many thanks for your help.  We feel that we're finally getting somewhere.
> >
> > Here is the information logged when running the jobs from the file:
> > /export/sge/default/spool/qmaster/messages
> >
> > ==============================================================
> > Thu Dec  6 13:08:21 2007|qmaster|stg-sun3|I|job 7269776.1 finished on host
> > stg-dell19.crl.toshiba.co.uk
> > ==============================================================
> >
> > Here is the output from the "qacct -j" command:
> >
> > ==============================================================
> > bash-3.00# qacct -j 7269776
> > ==============================================================
> > qname        dell19L1
> > hostname     stg-dell19.crl.toshiba.co.uk
> > group        stg
> > owner        rhobbs
> > jobname      makeEarth.sh
> > jobnumber    7269776
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Thu Dec  6 13:08:13 2007
> > start_time   Thu Dec  6 13:07:25 2007
> > end_time     Thu Dec  6 13:07:28 2007
> > granted_pe   none
> > slots        1
> > failed       0
> > exit_status  1
> > ru_wallclock 3
> > ru_utime     0
> > ru_stime     0
> > ru_maxrss    0
> > ru_ixrss     0
> > ru_ismrss    0
> > ru_idrss     0
> > ru_isrss     0
> > ru_minflt    3933
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   0
> > ru_oublock   0
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     2474
> > ru_nivcsw    132
> > cpu          0
> > mem          0.000
> > io           0.000
> > iow          0.000
> > maxvmem      2.34M
> > bash-3.00#
> > ==============================================================
> >
> > After setting KEEP_ACTIVE=TRUE we get in the trace file for a job:
> >
> > ==============================================================
> > bash-3.00# cat trace
> > 12/06/2007 13:37:25 [700:11093]: shepherd called with uid = 0, euid = 700
> > 12/06/2007 13:37:25 [700:11093]: sigaction for signal 32 failed: Invalid
> > argument
> > 12/06/2007 13:37:25 [700:11093]: sigaction for signal 33 failed: Invalid
> > argument
> > 12/06/2007 13:37:25 [700:11093]: starting up 5.3p6
> > 12/06/2007 13:37:25 [700:11093]: setpgid(11093, 11093) returned 0
> > 12/06/2007 13:37:25 [700:11093]: no prolog script to start
> > 12/06/2007 13:37:25 [700:11094]: pid=11094 pgrp=11094 sid=11094 old
> > pgrp=11093 getlogin()=<no login set>
> > 12/06/2007 13:37:25 [700:11094]: setosjobid: uid = 0, euid = 700
> > 12/06/2007 13:37:25 [700:11093]: forked "job" with pid 11094
> > 12/06/2007 13:37:25 [700:11094]: RLIMIT_CPU setting: (soft 604800 hard
> > 604800) resulting: (soft 604800 hard 604800)
> > 12/06/2007 13:37:25 [700:11094]: RLIMIT_FSIZE setting: (soft -1 hard -1)
> > resulting: (soft -1 hard -1)
> > 12/06/2007 13:37:25 [700:11094]: RLIMIT_DATA setting: (soft -1 hard -1)
> > resulting: (soft -1 hard -1)
> > 12/06/2007 13:37:25 [700:11094]: RLIMIT_STACK setting: (soft -1 hard -1)
> > resulting: (soft -1 hard -1)
> > 12/06/2007 13:37:25 [700:11094]: RLIMIT_CORE setting: (soft -1 hard -1)
> > resulting: (soft -1 hard -1)
> > 12/06/2007 13:37:25 [700:11094]: RLIMIT_VMEM/RLIMIT_AS setting: (soft -1
> > hard -1) resulting: (soft -1 hard -1)
> > 12/06/2007 13:37:25 [700:11094]: RLIMIT_RSS setting: (soft -1 hard -1)
> > resulting: (soft -1 hard -1)
> > 12/06/2007 13:37:25 [700:11093]: child: job - pid: 11094
> > 12/06/2007 13:37:25 [721:11094]: closing all filedescriptors
> > 12/06/2007 13:37:25 [721:11094]: further messages are in "error" and "trace"
> > 12/06/2007 13:37:25 [721:11094]: execvp(/bin/csh, -csh
> > /rmt/sge/default/spool/stg-dell19/job_scripts/7269791)
> > 12/06/2007 13:37:26 [700:11093]: wait3 returned 11094 (status: 256;
> > WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 1)
> > 12/06/2007 13:37:26 [700:11093]: job exited with exit status 1
> > 12/06/2007 13:37:26 [700:11093]: reaped "job" with pid 11094
> > 12/06/2007 13:37:26 [700:11093]: job exited not due to signal
> > 12/06/2007 13:37:26 [700:11093]: now sending signal 9 to pid -11094
> > 12/06/2007 13:37:26 [700:11093]: job exited with status 1
> > 12/06/2007 13:37:26 [700:11093]: writing usage file to "usage"
> > 12/06/2007 13:37:26 [700:11093]: no tasker to notify
> > 12/06/2007 13:37:26 [700:11093]: no epilog script to start
> > bash-3.00#
> > ==============================================================
> >
> > And the error file is empty.
> >
> > One promising bit of progress is that looking at the config file it appears
> > to be writing stderr and stdout to Richard's home directory instead of the
> > place it should be.  Could this be because of the following:
> >
> > [user at submithost scripts]$ cat script.sh.o.7269808
> > Warning: no access to tty (Bad file descriptor).
> > Thus no job control in this shell.
> > [user at submithost scripts]$
> >
> > Any ideas?
> >
> > Regards
> >
> > Neil
> >
> > -----Original Message-----
> > From: Andy Schwierskott [mailto:andy.schwierskott at sun.com]
> > Sent: 05 December 2007 09:53
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being
> > submitted, going into 't' state, then disappearing")
> >
> > Richard,
> >
> >> Hello,
> >>
> >> Let's combine these two email problems, as Neil Baker and I are working
> >> on the same grid, and our separate problems have combined into one now!
> >>
> >> Basically, the jobs enter the queue ("qw"), then go into state "t" as
> >> you would expect. The jobs then get executed, kind of.
> >>
> >> For example, if i submit "sleep 100" it works perfectly. The job runs,
> >> and it sits there in state "r" for around 100 seconds. I still get no
> >> stderr and stdout files though.
> >>
> >> However, if i submit a script i've got that generates fractal planet
> >> images, immediately after state "t", the job disappears.
> >>
> >> I have a thought though - we are seeing no stdout and stderr files being
> >> generated, and apart from qacct showing the job existing, nothing else
> >> gets logged.
> >>
> >> I have also run "find $SGE_ROOT -type f | xargs grep <jobID>" and it
> >> only returns the "accounting" file.
> >>
> >> So, onto my question:
> >>
> >> Do the stderr and stdout files get generated by the exec host, or the
> >> qmaster? If it's the qmaster then that may explain the problem - we have
> >> not put our qmaster into the same automount setup as the rest of the
> >> network, and as a result it cannot see any of our network drives,
> >> including people's home directories etc... Therefore, the qmaster itself
> >> is unable to write any stderr and stdout files to the locations it needs
> > to.
> >
> > The stdout/errfiles are created by the shepherd process (the child of the
> > execd) *before* the job is started. By default it's the user's home
> > directory unless otherwise redirected (or set to /dev/null).
> >
> > You could configure KEEP_ACTIVE=true in the "execd_param" section for a
> > specific execd host ("qconf -mconf <host_name>")- after the job had
> > disappeared you could look in the <execd_spool_dir>/active_jobs/<job_id>
> > directory and specifically watch the content of the "trace" and "error"
> > file.
> >
> > This will very likely give you some hints what was going wrong.
> >
> > What values do in the "qacct -j <jobid>" output the fields "failed" and
> > "exit_status" contain?
> >
> > In addition I recommend to set "loglevel" to "log_info" in the global
> > cluster config ("qconf -mconf") if not done yet.
> >
> >
> >
> >
> >
> >
> >> Could this be the cause?
> >>
> >> The planet generation script basically generates an image (which takes
> >> around 90 seconds), writes it to a network location, prints some stdout,
> >> and then does it 4 more times.
> >>
> >> We are not seeing the stdout, stderr *or* the generated planet images,
> >> but i suspect the planets aren't even being generated because the script
> >> is being stopped before that point due to the lack of a stdout/stderr
> >> channel. Does anyone else agree?
> >>
> >> Thanks again, people!
> >>
> >> Richard.
> >>
> >>
> >> Neil Baker wrote:
> >>> Just the qmaster.  What we've actually tried today is to copy the
> > $SGE_ROOT
> >>> directory from the Redhat 8 box over to Solaris 10 and install just the
> >>> Solaris 10 binaries rather than the common files first.  This way the
> >>> $SGE_ROOT directory contains binaries for Solaris 10 (qmaster) and Linux
> > x86
> >>> (Exec hosts).
> >>>
> >>> After a day's work we've got it starting up and even have a exec host
> >>> starting up using the same $SGE_ROOT (over nfs) using the Linux x86
> >>> binaries.  qstat works against this exec host and the old Redhat 8 exec
> >>> hosts that were left over in the original configuration files.
> >>>
> >>> However, when we submit jobs to the new exec host (using the corrent
> >>> $SGE_ROOT over nfs) it transfers the jobs, but the jobs don't appear to
> > run
> >>> proberly and we don't get any of the stdout or stderr recorded as we
> >>> normally would.  Anyone experienced this problem, or are we experience
> > the
> >>> Solaris 10 incompatibility?
> >>>
> >>> Regards
> >>>
> >>> Neil
> >>>
> >>> -----Original Message-----
> >>> From: Rayson Ho [mailto:rayrayson at gmail.com]
> >>> Sent: 04 December 2007 15:10
> >>> To: users at gridengine.sunsource.net
> >>> Subject: Re: [GE users] SGE 5.3p7 on Solaris 10
> >>>
> >>> On Dec 4, 2007 6:05 AM, Neil Baker <neil.baker at crl.toshiba.co.uk> wrote:
> >>>> We're migrating our Grid Engine from an unstable Redhat 8 / SGE 5.3p6
> >>> setup
> >>>> to a Solaris Sparc setup to provide better stability.
> >>> The whole cluster or just the qmaster host??
> >>>
> >>>
> >>>> As we have other Solaris machines already installed with Solaris 10
> >>>> (06/2006) we're ideally looking to install SGE 5.3p7 (as we can't find
> > SGE
> >>>> 5.3p6) onto this platform so that we can have spare machines to run it
> > on
> >>> if
> >>>> the main server dies.  We're doing this instead of migrating to SGE 6
> >>>> because we're hoping the configuration files are compatible between p6
> > and
> >>>> p7 and the Linux and Solaris binaries.
> >>> The configuration files between patch and update releases are compatible.
> >>>
> >>>
> >>>> I've read in the "Bugs fixed in SGE 5.3p7 since release 5.3p6":
> >>>> http://gridengine.sunsource.net/project/gridengine/53patches.txt that
> > one
> >>> of
> >>>> the bugs that has been fixed is bug 4822799 "cannot install on Solaris
> >>> 10",
> >>>> however the download page says it is for Solaris 7, 8 or 9 64-bit and
> >>>> doesn't mention Solaris 10 64-bit.
> >>> 4822799 was fixed by 5.3p4, you should be able to go up a bit and find
> >>> the line "Bugs fixed in SGE 5.3p4 since release 5.3p3".
> >>>
> >>> Rayson
> >>>
> >>>
> >>>> Does anyone have any experience of running this version on Solaris 10?
> >>>>
> >>>> Regards
> >>>>
> >>>> Neil
> >>>>
> >>>>
> >>>>
> >>>> ______________________________________________________________________
> >>>> This email has been scanned by the MessageLabs Email Security System.
> >>>> For more information please visit http://www.messagelabs.com/email
> >>>> ______________________________________________________________________
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>>
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>>
> >>> ______________________________________________________________________
> >>> This email has been scanned by the MessageLabs Email Security System.
> >>> For more information please visit http://www.messagelabs.com/email
> >>> ______________________________________________________________________
> >>>
> >>> ______________________________________________________________________
> >>> This email has been scanned by the MessageLabs Email Security System.
> >>> For more information please visit http://www.messagelabs.com/email
> >>> ______________________________________________________________________
> >>>
> >>> No virus found in this incoming message.
> >>> Checked by AVG Free Edition.
> >>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date:
> > 02/12/2007
> >>> 20:34
> >>>
> >>>
> >>> No virus found in this outgoing message.
> >>> Checked by AVG Free Edition.
> >>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date:
> > 02/12/2007
> >>> 20:34
> >>>
> >>>
> >>>
> >>> ______________________________________________________________________
> >>> This email has been scanned by the MessageLabs Email Security System.
> >>> For more information please visit http://www.messagelabs.com/email
> >>> ______________________________________________________________________
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> > ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email Security System.
> > For more information please visit http://www.messagelabs.com/email
> > ______________________________________________________________________
> >
> > ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email Security System.
> > For more information please visit http://www.messagelabs.com/email
> > ______________________________________________________________________
> >
> >
> >
> >
> > ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email Security System.
> > For more information please visit http://www.messagelabs.com/email
> > ______________________________________________________________________
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> > ______________________________________________________________________
> > This email has been scanned by the MessageLabs Email Security System.
> > For more information please visit http://www.messagelabs.com/email
> > ______________________________________________________________________
> >
> >
>
> --
> Richard Hobbs (Systems Administrator)
> Toshiba Research Europe Ltd. - Cambridge Research Laboratory
> Email: richard.hobbs at crl.toshiba.co.uk
> Web: http://www.toshiba-europe.com/research/
> Tel: +44 1223 436999        Mobile: +44 7811 803377
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list