[GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being submitted, going into 't' state, then disappearing")

Richard Hobbs richard.hobbs at crl.toshiba.co.uk
Tue Dec 18 16:24:08 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

Perfect! Thank you Rayson (and everyone else who got us this far, of
course)! :-)

Richard.


Rayson Ho wrote:
> Did you have the -cwd switch in the "sge_request" file in your
> original setting??
> 
> Otherwise, by default the stdout and stderr file are placed in the job
> owners' home directory.
> 
> Rayson
> 
> 
> 
> On Dec 13, 2007 9:30 AM, Richard Hobbs <richard.hobbs at crl.toshiba.co.uk> wrote:
>> Hello,
>>
>> To summarise the seemingly unpopular question below:
>>
>> ======================================================================
>> Does anyone know where the setting is that defines the default location
>> for the stdout and stderr files for jobs?
>> ======================================================================
>>
>> Basically, we've moved our entire 5.3v6 configuration from Linux to
>> Solaris, and since doing do, all of the stdout and stderr files are
>> placed in the user's home directory, whereas previously they were placed
>> in the directory from which they submitted their jobs.
>>
>> While we realise this can be changed on a job-per-job basis, this is not
>> acceptable unfortunately, and we need to change the default setting back
>> to what it was when we had the Linux qmaster.
>>
>> Thanks in advance to anyone who can help! :-)
>>
>> Richard.
>>
>>
>>
>> Neil Baker wrote:
>>> Hi Andy,
>>>
>>> Many thanks for your help.  We feel that we're finally getting somewhere.
>>>
>>> Here is the information logged when running the jobs from the file:
>>> /export/sge/default/spool/qmaster/messages
>>>
>>> ==============================================================
>>> Thu Dec  6 13:08:21 2007|qmaster|stg-sun3|I|job 7269776.1 finished on host
>>> stg-dell19.crl.toshiba.co.uk
>>> ==============================================================
>>>
>>> Here is the output from the "qacct -j" command:
>>>
>>> ==============================================================
>>> bash-3.00# qacct -j 7269776
>>> ==============================================================
>>> qname        dell19L1
>>> hostname     stg-dell19.crl.toshiba.co.uk
>>> group        stg
>>> owner        rhobbs
>>> jobname      makeEarth.sh
>>> jobnumber    7269776
>>> taskid       undefined
>>> account      sge
>>> priority     0
>>> qsub_time    Thu Dec  6 13:08:13 2007
>>> start_time   Thu Dec  6 13:07:25 2007
>>> end_time     Thu Dec  6 13:07:28 2007
>>> granted_pe   none
>>> slots        1
>>> failed       0
>>> exit_status  1
>>> ru_wallclock 3
>>> ru_utime     0
>>> ru_stime     0
>>> ru_maxrss    0
>>> ru_ixrss     0
>>> ru_ismrss    0
>>> ru_idrss     0
>>> ru_isrss     0
>>> ru_minflt    3933
>>> ru_majflt    0
>>> ru_nswap     0
>>> ru_inblock   0
>>> ru_oublock   0
>>> ru_msgsnd    0
>>> ru_msgrcv    0
>>> ru_nsignals  0
>>> ru_nvcsw     2474
>>> ru_nivcsw    132
>>> cpu          0
>>> mem          0.000
>>> io           0.000
>>> iow          0.000
>>> maxvmem      2.34M
>>> bash-3.00#
>>> ==============================================================
>>>
>>> After setting KEEP_ACTIVE=TRUE we get in the trace file for a job:
>>>
>>> ==============================================================
>>> bash-3.00# cat trace
>>> 12/06/2007 13:37:25 [700:11093]: shepherd called with uid = 0, euid = 700
>>> 12/06/2007 13:37:25 [700:11093]: sigaction for signal 32 failed: Invalid
>>> argument
>>> 12/06/2007 13:37:25 [700:11093]: sigaction for signal 33 failed: Invalid
>>> argument
>>> 12/06/2007 13:37:25 [700:11093]: starting up 5.3p6
>>> 12/06/2007 13:37:25 [700:11093]: setpgid(11093, 11093) returned 0
>>> 12/06/2007 13:37:25 [700:11093]: no prolog script to start
>>> 12/06/2007 13:37:25 [700:11094]: pid=11094 pgrp=11094 sid=11094 old
>>> pgrp=11093 getlogin()=<no login set>
>>> 12/06/2007 13:37:25 [700:11094]: setosjobid: uid = 0, euid = 700
>>> 12/06/2007 13:37:25 [700:11093]: forked "job" with pid 11094
>>> 12/06/2007 13:37:25 [700:11094]: RLIMIT_CPU setting: (soft 604800 hard
>>> 604800) resulting: (soft 604800 hard 604800)
>>> 12/06/2007 13:37:25 [700:11094]: RLIMIT_FSIZE setting: (soft -1 hard -1)
>>> resulting: (soft -1 hard -1)
>>> 12/06/2007 13:37:25 [700:11094]: RLIMIT_DATA setting: (soft -1 hard -1)
>>> resulting: (soft -1 hard -1)
>>> 12/06/2007 13:37:25 [700:11094]: RLIMIT_STACK setting: (soft -1 hard -1)
>>> resulting: (soft -1 hard -1)
>>> 12/06/2007 13:37:25 [700:11094]: RLIMIT_CORE setting: (soft -1 hard -1)
>>> resulting: (soft -1 hard -1)
>>> 12/06/2007 13:37:25 [700:11094]: RLIMIT_VMEM/RLIMIT_AS setting: (soft -1
>>> hard -1) resulting: (soft -1 hard -1)
>>> 12/06/2007 13:37:25 [700:11094]: RLIMIT_RSS setting: (soft -1 hard -1)
>>> resulting: (soft -1 hard -1)
>>> 12/06/2007 13:37:25 [700:11093]: child: job - pid: 11094
>>> 12/06/2007 13:37:25 [721:11094]: closing all filedescriptors
>>> 12/06/2007 13:37:25 [721:11094]: further messages are in "error" and "trace"
>>> 12/06/2007 13:37:25 [721:11094]: execvp(/bin/csh, -csh
>>> /rmt/sge/default/spool/stg-dell19/job_scripts/7269791)
>>> 12/06/2007 13:37:26 [700:11093]: wait3 returned 11094 (status: 256;
>>> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 1)
>>> 12/06/2007 13:37:26 [700:11093]: job exited with exit status 1
>>> 12/06/2007 13:37:26 [700:11093]: reaped "job" with pid 11094
>>> 12/06/2007 13:37:26 [700:11093]: job exited not due to signal
>>> 12/06/2007 13:37:26 [700:11093]: now sending signal 9 to pid -11094
>>> 12/06/2007 13:37:26 [700:11093]: job exited with status 1
>>> 12/06/2007 13:37:26 [700:11093]: writing usage file to "usage"
>>> 12/06/2007 13:37:26 [700:11093]: no tasker to notify
>>> 12/06/2007 13:37:26 [700:11093]: no epilog script to start
>>> bash-3.00#
>>> ==============================================================
>>>
>>> And the error file is empty.
>>>
>>> One promising bit of progress is that looking at the config file it appears
>>> to be writing stderr and stdout to Richard's home directory instead of the
>>> place it should be.  Could this be because of the following:
>>>
>>> [user at submithost scripts]$ cat script.sh.o.7269808
>>> Warning: no access to tty (Bad file descriptor).
>>> Thus no job control in this shell.
>>> [user at submithost scripts]$
>>>
>>> Any ideas?
>>>
>>> Regards
>>>
>>> Neil
>>>
>>> -----Original Message-----
>>> From: Andy Schwierskott [mailto:andy.schwierskott at sun.com]
>>> Sent: 05 December 2007 09:53
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] SGE 5.3p7 on Solaris 10 (and "SGE 5.3p6 - jobs being
>>> submitted, going into 't' state, then disappearing")
>>>
>>> Richard,
>>>
>>>> Hello,
>>>>
>>>> Let's combine these two email problems, as Neil Baker and I are working
>>>> on the same grid, and our separate problems have combined into one now!
>>>>
>>>> Basically, the jobs enter the queue ("qw"), then go into state "t" as
>>>> you would expect. The jobs then get executed, kind of.
>>>>
>>>> For example, if i submit "sleep 100" it works perfectly. The job runs,
>>>> and it sits there in state "r" for around 100 seconds. I still get no
>>>> stderr and stdout files though.
>>>>
>>>> However, if i submit a script i've got that generates fractal planet
>>>> images, immediately after state "t", the job disappears.
>>>>
>>>> I have a thought though - we are seeing no stdout and stderr files being
>>>> generated, and apart from qacct showing the job existing, nothing else
>>>> gets logged.
>>>>
>>>> I have also run "find $SGE_ROOT -type f | xargs grep <jobID>" and it
>>>> only returns the "accounting" file.
>>>>
>>>> So, onto my question:
>>>>
>>>> Do the stderr and stdout files get generated by the exec host, or the
>>>> qmaster? If it's the qmaster then that may explain the problem - we have
>>>> not put our qmaster into the same automount setup as the rest of the
>>>> network, and as a result it cannot see any of our network drives,
>>>> including people's home directories etc... Therefore, the qmaster itself
>>>> is unable to write any stderr and stdout files to the locations it needs
>>> to.
>>>
>>> The stdout/errfiles are created by the shepherd process (the child of the
>>> execd) *before* the job is started. By default it's the user's home
>>> directory unless otherwise redirected (or set to /dev/null).
>>>
>>> You could configure KEEP_ACTIVE=true in the "execd_param" section for a
>>> specific execd host ("qconf -mconf <host_name>")- after the job had
>>> disappeared you could look in the <execd_spool_dir>/active_jobs/<job_id>
>>> directory and specifically watch the content of the "trace" and "error"
>>> file.
>>>
>>> This will very likely give you some hints what was going wrong.
>>>
>>> What values do in the "qacct -j <jobid>" output the fields "failed" and
>>> "exit_status" contain?
>>>
>>> In addition I recommend to set "loglevel" to "log_info" in the global
>>> cluster config ("qconf -mconf") if not done yet.
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Could this be the cause?
>>>>
>>>> The planet generation script basically generates an image (which takes
>>>> around 90 seconds), writes it to a network location, prints some stdout,
>>>> and then does it 4 more times.
>>>>
>>>> We are not seeing the stdout, stderr *or* the generated planet images,
>>>> but i suspect the planets aren't even being generated because the script
>>>> is being stopped before that point due to the lack of a stdout/stderr
>>>> channel. Does anyone else agree?
>>>>
>>>> Thanks again, people!
>>>>
>>>> Richard.
>>>>
>>>>
>>>> Neil Baker wrote:
>>>>> Just the qmaster.  What we've actually tried today is to copy the
>>> $SGE_ROOT
>>>>> directory from the Redhat 8 box over to Solaris 10 and install just the
>>>>> Solaris 10 binaries rather than the common files first.  This way the
>>>>> $SGE_ROOT directory contains binaries for Solaris 10 (qmaster) and Linux
>>> x86
>>>>> (Exec hosts).
>>>>>
>>>>> After a day's work we've got it starting up and even have a exec host
>>>>> starting up using the same $SGE_ROOT (over nfs) using the Linux x86
>>>>> binaries.  qstat works against this exec host and the old Redhat 8 exec
>>>>> hosts that were left over in the original configuration files.
>>>>>
>>>>> However, when we submit jobs to the new exec host (using the corrent
>>>>> $SGE_ROOT over nfs) it transfers the jobs, but the jobs don't appear to
>>> run
>>>>> proberly and we don't get any of the stdout or stderr recorded as we
>>>>> normally would.  Anyone experienced this problem, or are we experience
>>> the
>>>>> Solaris 10 incompatibility?
>>>>>
>>>>> Regards
>>>>>
>>>>> Neil
>>>>>
>>>>> -----Original Message-----
>>>>> From: Rayson Ho [mailto:rayrayson at gmail.com]
>>>>> Sent: 04 December 2007 15:10
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] SGE 5.3p7 on Solaris 10
>>>>>
>>>>> On Dec 4, 2007 6:05 AM, Neil Baker <neil.baker at crl.toshiba.co.uk> wrote:
>>>>>> We're migrating our Grid Engine from an unstable Redhat 8 / SGE 5.3p6
>>>>> setup
>>>>>> to a Solaris Sparc setup to provide better stability.
>>>>> The whole cluster or just the qmaster host??
>>>>>
>>>>>
>>>>>> As we have other Solaris machines already installed with Solaris 10
>>>>>> (06/2006) we're ideally looking to install SGE 5.3p7 (as we can't find
>>> SGE
>>>>>> 5.3p6) onto this platform so that we can have spare machines to run it
>>> on
>>>>> if
>>>>>> the main server dies.  We're doing this instead of migrating to SGE 6
>>>>>> because we're hoping the configuration files are compatible between p6
>>> and
>>>>>> p7 and the Linux and Solaris binaries.
>>>>> The configuration files between patch and update releases are compatible.
>>>>>
>>>>>
>>>>>> I've read in the "Bugs fixed in SGE 5.3p7 since release 5.3p6":
>>>>>> http://gridengine.sunsource.net/project/gridengine/53patches.txt that
>>> one
>>>>> of
>>>>>> the bugs that has been fixed is bug 4822799 "cannot install on Solaris
>>>>> 10",
>>>>>> however the download page says it is for Solaris 7, 8 or 9 64-bit and
>>>>>> doesn't mention Solaris 10 64-bit.
>>>>> 4822799 was fixed by 5.3p4, you should be able to go up a bit and find
>>>>> the line "Bugs fixed in SGE 5.3p4 since release 5.3p3".
>>>>>
>>>>> Rayson
>>>>>
>>>>>
>>>>>> Does anyone have any experience of running this version on Solaris 10?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Neil
>>>>>>
>>>>>>
>>>>>>
>>>>>> ______________________________________________________________________
>>>>>> This email has been scanned by the MessageLabs Email Security System.
>>>>>> For more information please visit http://www.messagelabs.com/email
>>>>>> ______________________________________________________________________
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>> ______________________________________________________________________
>>>>> This email has been scanned by the MessageLabs Email Security System.
>>>>> For more information please visit http://www.messagelabs.com/email
>>>>> ______________________________________________________________________
>>>>>
>>>>> ______________________________________________________________________
>>>>> This email has been scanned by the MessageLabs Email Security System.
>>>>> For more information please visit http://www.messagelabs.com/email
>>>>> ______________________________________________________________________
>>>>>
>>>>> No virus found in this incoming message.
>>>>> Checked by AVG Free Edition.
>>>>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date:
>>> 02/12/2007
>>>>> 20:34
>>>>>
>>>>>
>>>>> No virus found in this outgoing message.
>>>>> Checked by AVG Free Edition.
>>>>> Version: 7.5.488 / Virus Database: 269.16.13/1165 - Release Date:
>>> 02/12/2007
>>>>> 20:34
>>>>>
>>>>>
>>>>>
>>>>> ______________________________________________________________________
>>>>> This email has been scanned by the MessageLabs Email Security System.
>>>>> For more information please visit http://www.messagelabs.com/email
>>>>> ______________________________________________________________________
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>>
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>>
>>>
>> --
>> Richard Hobbs (Systems Administrator)
>> Toshiba Research Europe Ltd. - Cambridge Research Laboratory
>> Email: richard.hobbs at crl.toshiba.co.uk
>> Web: http://www.toshiba-europe.com/research/
>> Tel: +44 1223 436999        Mobile: +44 7811 803377
>>
>> ---------------------------------------------------------------------
>>
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email 
> ______________________________________________________________________
> 
> 

-- 
Richard Hobbs (Systems Administrator)
Toshiba Research Europe Ltd. - Cambridge Research Laboratory
Email: richard.hobbs at crl.toshiba.co.uk
Web: http://www.toshiba-europe.com/research/
Tel: +44 1223 436999        Mobile: +44 7811 803377

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list