[GE users] problem with migrating to shadow master

rumpelkeks tina.friedrich at diamond.ac.uk
Fri Nov 26 11:43:12 GMT 2010


I have found something, maybe. On my (64bit) qmaster, SGE segfaults when 
you try to stop it. Everytime you try to stop it, is quite reproduceable.

Nov 26 11:33:36 cs04r-sc-serv-17 kernel: sge_qmaster[32443]: segfault at 
000000001d100000 rip 00000000005ba107 rsp 00007fff3ef
Nov 26 11:36:33 cs04r-sc-serv-17 kernel: sge_qmaster[380]: segfault at 
0000000019e00000 rip 00000000005ba107 rsp 00007fffc66d6230 error 4
Nov 26 11:37:50 cs04r-sc-serv-17 kernel: sge_qmaster[1328]: segfault at 
0000000019400000 rip 00000000005ba107 rsp 00007fff47cfe160 error 4
Nov 26 11:39:04 cs04r-sc-serv-17 kernel: sge_qmaster[1823]: segfault at 
000000001f100000 rip 00000000005ba107 rsp 00007fff2ca03970 error 4

...probably when it tries to write the lock file, as it doesn't segfault 
if there is a lock file. I'll try to trace it to see where, exactly. 
Doesn't do this on my other machine (segfaulting).

Tina

On 26/11/10 09:20, rumpelkeks wrote:
> Hello,
>
>>>> <snip>
>>>> it looks like the lock file is written to confirm a successful shutdown then (the opposite of what I was used to), and will prevent that a shadowd will take action of an unchanged heartbeat file then, as -migrate will first shut down the actual master and then start its own.
>>>>
>>>> Do you want to have two qmasters, which can startup when the other is missing two-way, so you have a shadowd running on both of them?
>>>
>>> That seems to be the theory, only in my case it doesn't seem to work
>>> very well (and I'm trying to find out why - it might be a timing issue,
>>> my $SGE_ROOT/$SGE_CELL/spool etc is on an NFS share).
>>
>> Do you use classic spooling? The common directory is also on the share?
>
> Yes, yes,
>
>> Both machines can also write to these shares?
>
> and yes:
>
> -bash-3.2$ touch /dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
> -bash-3.2$ ls -l /dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
> -rw-r--r-- 1 sgeadmin sgeadmin 0 Nov 26 09:14
> /dls_sw/apps/sge/SGE6.2/TEST/spool/qmaster/lock
>
> (this was as sgeadmin on the current master, into the spool directory)
>
> Accounting&  reporting files, and logs etc, are all being written; the
> heartbeat file is updated; and, as I said, if I manually create a lock
> file prior to calling migrate it is very certainly removed.
>
> <snip>
>
>> If it's a timing issue, you should at least see the lock file on the machine where it was created, as it should have it already in his cache. Only the NFS share might get the final write later.
>
> Good point. It should not be the master going down that removes it but
> the new one starting up; so it should turn up eventually (even if too
> late for the migration; which it doesn't. It looks rather like one of my
> two shadow hosts can created it, the other can't - but I can write to
> the share from both machines (as sgeadmin), and the way it got installed
> is the same. Only difference is that one of the machines is 64bit the
> other 32bit (of the test ones that is; my 'real' qmasters are both 64bit).
>
>> Maybe running SGE in debug mode will show more, as the creation should show up there when it's happening if I get the source right.
>
> I'll see if I can try that on my test cluster cell.
>
> Tina
>
>


-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=299088

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list