[GE users] crazy and frustrating SGE 6.0u1 problems with Mac OS X Server

Stephan Grell stephan.grell at sun.com
Tue Jan 4 10:30:05 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Chris,

another thought about your first problem:

You might be affected by:
                 Issue:     1316

Did you delete the hosts, which are named in the error messages?
If so, they might be referenced in one of your queues and you will the 
the error message.

Just an idea.

Stephan

Chris Dagdigian wrote:

>
> Just my luck!
>
> 2 of the wierdest, most unreproducible SGE errors I've seen with SGE 
> 6.0 on Mac OS X systems have just popped up on the same cluster at the 
> same time leaving me unable to reinstall or fix a previously working 
> cluster configuration.
>
> Any debug tips appreciated.  My biggest problem right now is that this 
> is a production cluster with a waitlist of users so #1 priority is to 
> get SGE back up. I'll have to try to debug/reproduce on a different 
> cluster system....
>
>
> Issue #1 -- the "! lGetHost(): got NULL element for EH_name !" error
> ====================================================================
>
> I posted about this one before
>
> Thread is here: 
> http://gridengine.sunsource.net/servlets/ReadMsg?msgId=21337&listName=users" 
>
>
> In that message I was inclined to blame the pre-release version of the 
> Apple XSAN software that we had been running. Now I'm more inclined to 
> think the problem is more related to berkeleydb spooling as it just 
> popped up again on a different cluster with a very straightforward 
> configuration (20 dual G5 Xserves connected via GigE)
>
> Problem Summary:
>
>  o In *most* (but not every time) time where there is an unexpected 
> power loss, system crash or cluster reboot, SGE 6 will fail to restart 
> and will fill its spool logs with errors that look like:
>
>> 12/29/2004 15:26:10|qmaster|cat|C|!!!!!!!!!! lGetHost(): got NULL 
>> element for EH_name !!!!!!!!!!
>> 12/29/2004 15:30:13|qmaster|cat|W|local configuration xxx.xxx.edu not 
>> defined - using global configuration
>> 12/29/2004 15:30:13|qmaster|cat|I|read job database with 53 entries 
>> in 0 seconds
>> 12/29/2004 15:30:13|qmaster|cat|C|!!!!!!!!!! lGetHost(): got NULL 
>> element for EH_name !!!!!!!!!!
>> 12/29/2004 15:30:23|qmaster|cat|W|local configuration xxx.xxx not 
>> defined - using global configuration
>
>
> It appears that something catastrophic happens to the SGE 
> configuration data. Every time I've seen this problem (2 clusters now) 
> the only viable fix was to completely blow away the SGE_CELL directory 
> and reinstall from scratch. The only clusters  I have seen this are 
> doing berkeley db spooling to a local filesystem that is exported to 
> other nodes via NFS (spooling to local disk; should be ok, right?)
>
> I encountered issue #2 (which I have also seen before) when trying to 
> manually install SGE again with classic spooling enabled...
>
>
> Issue #2 -- sge_qmaster crashes on Mac OS X Server
> ===========================================================
>
> In attempting to reinstall SGE to work around issue #1 I ran into this:
>
> sge_qmaster crashes:
>
> In the 2 clusters where I have seen this happen it appears that the 
> crash may be correlated to some sort of Apple OS X update. I saw this 
> crash behavior happen a month or two ago to a cluster where the admin 
> had just updated his system. He ended up going to SGE 5.3p6 rather 
> than take the time to debug the issue.
>
>  > cat:/tmp root# tail /var/log/system.log
>
>> Dec 29 16:10:03 cat crashdump: Finished writing crash report to: 
>> /Library/Logs/CrashReporter/sge_qmaster.crash.log
>> Dec 29 16:13:54 cat crashdump: Unable to determine CPSProcessSerNum 
>> pid: 2631 name: sge_qmaster
>> Dec 29 16:13:54 cat crashdump: Started writing crash report to: 
>> /Library/Logs/CrashReporter/sge_qmaster.crash.log
>> Dec 29 16:13:54 cat crashdump: Finished writing crash report to: 
>> /Library/Logs/CrashReporter/sge_qmaster.crash.log
>
>
> The crash log looks like this:
>
> One of the more interesting things in the crash log is that the server 
> FDQN hostname is incorrect.  Nowhere in /etc/hosts do we use the name 
> "portal2net.cluster.private" and here is the output of the SGE utilbin 
> commands:
>
>> cat:/common/sge/utilbin/darwin root# ./gethostname Hostname: xx.xx.edu
>> Aliases:  cat Host Address(es): xx.xx.xx.142 
>> cat:/common/sge/utilbin/darwin root# cat:/common/sge/utilbin/darwin 
>> root# ./gethostbyaddr xx.xx.xx.142
>> Hostname: cat.xx.xx
>> Aliases:  cat Host Address(es): xx.xx.xx.142 
>> cat:/common/sge/utilbin/darwin root#
>
>
> I'm going to check the various server and redezvous names etc. to see 
> if some other sort of DNS or machine name issue could be causing this.
>
>
>> **********
>>
>> Host Name:      portal2net.cluster.private
>> Date/Time:      2004-12-29 16:17:46 -0600
>> OS Version:     10.3.6 (Build 7R28)
>> Report Version: 2
>>
>> Command: sge_qmaster
>> Path:    /common/sge/bin/darwin/sge_qmaster
>> Version: ??? (???)
>> PID:     3288
>> Thread:  0
>>
>> Exception:  EXC_BAD_ACCESS (0x0001)
>> Codes:      KERN_INVALID_ADDRESS (0x0001) at 0x54485244
>>
>> Thread 0 Crashed:
>> 0   libspoolc.dylib     0x0059ab2c sge_set_admin_username + 0x90
>> 1   libspoolc.dylib     0x0050c3f4 read_all_configurations + 0x158
>> 2   libspoolc.dylib     0x005069d8 spool_classic_default_list_func + 
>> 0x1f4
>> 3   sge_qmaster         0x00049804 spool_read_list + 0x244
>> 4   sge_qmaster         0x000329e4 sge_read_configuration + 0x7c
>> 5   sge_qmaster         0x0003856c setup_qmaster + 0x128
>> 6   sge_qmaster         0x00037e70 sge_setup_qmaster + 0x114
>> 7   sge_qmaster         0x00002910 main + 0x190
>> 8   sge_qmaster         0x000024b0 _start + 0x188 (crt.c:267)
>> 9   sge_qmaster         0x00002324 start + 0x30
>>
>> Thread 1:
>> 0   libSystem.B.dylib   0x90018be8 semaphore_timedwait_signal_trap + 0x8
>> 1   libSystem.B.dylib   0x9000e788 _pthread_cond_wait + 0x268
>> 2   sge_qmaster         0x000e0ed0 
>> cl_thread_wait_for_thread_condition + 0x144
>> 3   sge_qmaster         0x000e175c cl_thread_wait_for_event + 0x78
>> 4   sge_qmaster         0x000cb16c cl_com_trigger_thread + 0x1fc
>> 5   libSystem.B.dylib   0x900246e8 _pthread_body + 0x28
>>
>> Thread 2:
>> 0   libSystem.B.dylib   0x90018be8 semaphore_timedwait_signal_trap + 0x8
>> 1   libSystem.B.dylib   0x9000e788 _pthread_cond_wait + 0x268
>> 2   sge_qmaster         0x000e0ed0 
>> cl_thread_wait_for_thread_condition + 0x144
>> 3   sge_qmaster         0x000e175c cl_thread_wait_for_event + 0x78
>> 4   sge_qmaster         0x000cb458 cl_com_handle_service_thread + 0x20c
>> 5   libSystem.B.dylib   0x900246e8 _pthread_body + 0x28
>>
>> Thread 3:
>> 0   libSystem.B.dylib   0x90018be8 semaphore_timedwait_signal_trap + 0x8
>> 1   libSystem.B.dylib   0x9000e788 _pthread_cond_wait + 0x268
>> 2   sge_qmaster         0x000e0ed0 
>> cl_thread_wait_for_thread_condition + 0x144
>> 3   sge_qmaster         0x000e175c cl_thread_wait_for_event + 0x78
>> 4   sge_qmaster         0x000cbed4 cl_com_handle_read_thread + 0x994
>> 5   libSystem.B.dylib   0x900246e8 _pthread_body + 0x28
>>
>> Thread 4:
>> 0   libSystem.B.dylib   0x90018be8 semaphore_timedwait_signal_trap + 0x8
>> 1   libSystem.B.dylib   0x9000e788 _pthread_cond_wait + 0x268
>> 2   sge_qmaster         0x000e0ed0 
>> cl_thread_wait_for_thread_condition + 0x144
>> 3   sge_qmaster         0x000e175c cl_thread_wait_for_event + 0x78
>> 4   sge_qmaster         0x000cc6f8 cl_com_handle_write_thread + 0x73c
>> 5   libSystem.B.dylib   0x900246e8 _pthread_body + 0x28
>>
>> Thread 5:
>> 0   libSystem.B.dylib   0x90018be8 semaphore_timedwait_signal_trap + 0x8
>> 1   libSystem.B.dylib   0x9000e788 _pthread_cond_wait + 0x268
>> 2   sge_qmaster         0x0003a198 deliver_events + 0x17c
>> 3   libSystem.B.dylib   0x900246e8 _pthread_body + 0x28
>>
>> PPC Thread State:
>>   srr0: 0x0059ab2c srr1: 0x0200f030                vrsave: 0x00000000
>>     cr: 0x22000444  xer: 0x00000000   lr: 0x0059aae8  ctr: 0x005a2e74
>>     r0: 0x00000001   r1: 0xbfffb560   r2: 0x0059c060   r3: 0x00000003
>>     r4: 0x00000001   r5: 0x00000000   r6: 0x005d2e7c   r7: 0x005d0068
>>     r8: 0x005cbfb4   r9: 0x00000000  r10: 0x005a2e7c  r11: 0x005d0800
>>    r12: 0x005a2e74  r13: 0x00000000  r14: 0x00000000  r15: 0x00000000
>>    r16: 0x00000000  r17: 0x00000000  r18: 0x00000000  r19: 0x00000000
>>    r20: 0x00000000  r21: 0x0013844c  r22: 0x00000000  r23: 0x005dc2a4
>>    r24: 0x00129498  r25: 0xbfffe780  r26: 0xbfffe380  r27: 0x00405280
>>    r28: 0xbfffeb80  r29: 0x54485244  r30: 0xbfffb700  r31: 0x0059aaa4
>>
>> Binary Images Description:
>>     0x1000 -   0x128fff sge_qmaster     
>> /common/sge/bin/darwin/sge_qmaster
>>   0x505000 -   0x5c7fff libspoolc.dylib         
>> /common/sge/lib/darwin/libspoolc.dylib
>> 0x8fe00000 - 0x8fe4ffff dyld    /usr/lib/dyld
>> 0x90000000 - 0x90122fff libSystem.B.dylib       
>> /usr/lib/libSystem.B.dylib
>> 0x939d0000 - 0x939d4fff libmathCommon.A.dylib   
>> /usr/lib/system/libmathCommon.A.dylib
>
>
>
>
> Regards,
> Chris
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list