[GE users] SEGFAULT on sge_qmaster 6.0u1

Ron Chen ron_chen_123 at yahoo.com
Fri Apr 22 14:45:09 BST 2005


Mark, looks like some of the files are corrupted!

If you have a copy of the directories, then may be you
can see which one is causing the problem by doing a
binary search. And then see if we should be adding
some sort of checks to make sure qmaster can handle
invalid fields in those files.

 -Ron

--- "Olesen, Mark" wrote:
> I found a solution to the problem, but don't
> understand the cause.
> I removed the jobs/ and job_scripts/ dirs and now
> the qmaster starts without
> a problem.
> 
> Seems very strange!
> 
> 
> Dr. Mark Olesen
> Principal Engineer Thermofluids Analysis
> ArvinMeritor Light Vehicle Systems
> ArvinMeritor Emissions Technologies GmbH
> Biberbachstr. 9
> D-86154 Augsburg, GERMANY
> tel: +49 (821) 4103 - 862
> fax: +49 (821) 4103 - 7862
> Mark.Olesen at ArvinMeritor.com
> 
> > -----Original Message-----
> > From: Olesen, Mark
> [mailto:Mark.Olesen at arvinmeritor.com]
> > Sent: Friday, April 22, 2005 12:36 PM
> > To: 'users at gridengine.sunsource.net'
> > Subject: RE: [GE users] SEGFAULT on sge_qmaster
> 6.0u1
> > 
> > Using 'strace -f .../sge_qmaster' it would appear
> that the parent process
> > has the problem:
> > 
> > [pid 11376] gettimeofday({1114165748, 386248},
> {4294967176, 0}) = 0
> > [pid 11376] write(6, "04/22/2005
> 12:29:08|qmaster|deal"..., 85) = 85
> > [pid 11376] close(6)                    = 0
> > [pid 11376] brk(0)                      =
> 0x8235000
> > [pid 11376] brk(0x8236000)              =
> 0x8236000
> > [pid 11376] brk(0)                      =
> 0x8236000
> > [pid 11376] brk(0x8237000)              =
> 0x8237000
> > [pid 11376] brk(0)                      =
> 0x8237000
> > [pid 11376] brk(0x8238000)              =
> 0x8238000
> > [pid 11376] brk(0)                      =
> 0x8238000
> > [pid 11376] brk(0x8239000)              =
> 0x8239000
> > [pid 11376] brk(0)                      =
> 0x8239000
> > [pid 11376] brk(0x823a000)              =
> 0x823a000
> > [pid 11376] brk(0)                      =
> 0x823a000
> > [pid 11376] brk(0x823b000)              =
> 0x823b000
> > [pid 11376] brk(0)                      =
> 0x823b000
> > [pid 11376] brk(0x823c000)              =
> 0x823c000
> > [pid 11376] brk(0)                      =
> 0x823c000
> > [pid 11376] brk(0x823d000)              =
> 0x823d000
> > [pid 11376] brk(0)                      =
> 0x823d000
> > [pid 11376] brk(0x823e000)              =
> 0x823e000
> > [pid 11376] gettimeofday({1114165748, 392706},
> {4294967176, 0}) = 0
> > [pid 11376] --- SIGSEGV (Segmentation fault) @ 0
> (0) ---
> > upeek: ptrace(PTRACE_PEEKUSER,11378,44,0):
> Operation not permitted
> > detach: ptrace(PTRACE_DETACH, ...): Operation not
> permitted
> > upeek: ptrace(PTRACE_PEEKUSER,11380,44,0):
> Operation not permitted
> > detach: ptrace(PTRACE_DETACH, ...): Operation not
> permitted
> > upeek: ptrace(PTRACE_PEEKUSER,11381,44,0):
> Operation not permitted
> > detach: ptrace(PTRACE_DETACH, ...): Operation not
> permitted
> > upeek: ptrace(PTRACE_PEEKUSER,11379,44,0):
> Operation not permitted
> > detach: ptrace(PTRACE_DETACH, ...): Operation not
> permitted
> > upeek: ptrace(PTRACE_PEEKUSER,11377,44,0):
> Operation not permitted
> > detach: ptrace(PTRACE_DETACH, ...): Operation not
> permitted
> > upeek: ptrace(PTRACE_PEEKUSER,11382,44,0):
> Operation not permitted
> > detach: ptrace(PTRACE_DETACH, ...): Operation not
> permitted
> > 
> > 
> > BTW: I am using classic spooling
> > 
> > Dr. Mark Olesen
> > Principal Engineer Thermofluids Analysis
> > ArvinMeritor Light Vehicle Systems
> > ArvinMeritor Emissions Technologies GmbH
> > Biberbachstr. 9
> > D-86154 Augsburg, GERMANY
> > tel: +49 (821) 4103 - 862
> > fax: +49 (821) 4103 - 7862
> > Mark.Olesen at ArvinMeritor.com
> > 
> > > -----Original Message-----
> > > From: Olesen, Mark
> [mailto:Mark.Olesen at arvinmeritor.com]
> > > Sent: Friday, April 22, 2005 12:05 PM
> > > To: GridEngine
> > > Subject: [GE users] SEGFAULT on sge_qmaster
> 6.0u1
> > >
> > > After restarting, the qmaster daemon fails to
> start (lx24-x86) -
> > actually
> > > it
> > > forks and then fails.
> > >
> > >
> > > AFAIK I haven't changed anything significant on
> the configuration (Admin
> > > email address, complexes, load-sensor) within
> the last while that should
> > > affect sge_qmaster.  Some time ago I did have a
> problem with spaces
> > within
> > > a
> > > complex string preventing the files from being
> re-read, but I've since
> > > removed the problem.
> > >
> > > The message file displays the following:
> > >
> > > 04/22/2005 11:48:47|qmaster|dealog01|W|local
> configuration
> > > dealog01.zeunastaerker.de not defined - using
> global configuration
> > > 04/22/2005 11:48:48|qmaster|dealog01|I|read job
> database with 5 entries
> > in
> > > 0
> > > seconds
> > >
> > >
> > > using debug level 'dl 1' I receive the following
> info:
> > >
> > >    889  10995 16384     TSTSOS: 1 slots used
> (limit 1) -> suspended
> > >    890  10995 16384     qinstance "(null)" 
> suspended on subordinate
> > >    891  10995 16384     Due to other suspend
> states signal will NOT be
> > > delivered
> > >    892  10995 16384     QUEUE (null): queued
> signal STOP (retry after 60
> > > seconds) host dealc02.zeunastaerker.de
> > >    893  10995 16384     te_delete_event: (t:5
> u1:0 u2:0 s:(null))
> > >    894  10995 16384     te_add_event: (t:5
> w:1114163876 m:1 s:(null))
> > > Segmentation fault
> > >
> > > With debug level 'dl 2' I receive the following
> info:
> > >
> > >
> > >   7228  11006 16384 <-- te_add_event()
> > > ../daemons/qmaster/sge_qmaster_timed_event.c 345
> }
> > >   7229  11006 16384 --> te_free_event() {
> > >   7230  11006 16384 <-- te_free_event()
> > > ../daemons/qmaster/sge_qmaster_timed_event.c 259
> }
> > >   7231  11006 16384 -->
> signal_slave_jobs_in_queue() {
> > > Segmentation fault
> > >
> > >
> > > Based on these messages, where should I start
> looking for sorting out
> > the
> > > problem.
> > >
> > >
> > >
> > >
> > > Dr. Mark Olesen
> > > Principal Engineer Thermofluids Analysis
> > > ArvinMeritor Light Vehicle Systems
> > > ArvinMeritor Emissions Technologies GmbH
> 
=== message truncated ===


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list