[GE users] RE: cannot submit job because of error bug?

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Fri Aug 31 14:35:26 BST 2007


Hi Dmitry,

On Fri, 31 Aug 2007, Dmitry Zhukovski wrote:

> Hi all,
>
>  I moved a bit further. I found issue #2249 regarding too long group
> entries. The fix contains replacing of hard coded constant by system
> call sysconf(SC_GETGR_R_SIZE_MAX). It's fine but in my case it gives
> 1024 and that's too less than required for my groups to be resolved.
> Whenever I increase this value(and allocated buffer) by let say 10 times
> - qstat works!

What a mess! sysconf(SC_GETGR_R_SIZE_MAX) had better returned -1 
instead of 1024, then our 20k buffer size had been in effect:

    int get_group_buffer_size(void)
    {
       enum { buf_size = 20480 };  /* default is 20 KB */

       int sz = buf_size;

    #ifdef _SC_GETGR_R_SIZE_MAX
       if ((sz = (int)sysconf(_SC_GETGR_R_SIZE_MAX)) == -1) {
          sz = buf_size;
       }
    #endif

       return sz;
    }

according POSIX sysconf(SC_GETGR_R_SIZE_MAX) is supposed to return the
"maximum size needed for this buffer"!

>
>  Is it possible to increase that system value by sysctl or anything
> else? Or I have to compile my own version SGE?

My expectation is no manual tuning were required for this, but maybe
I'm wrong.

Actually what OS is this? Wasn't it Red Hat?

Regards,
Andreas

>
> Br,
> dmitry
>
> -----Original Message-----
> From: Dmitry Zhukovski [mailto:DZH at maerskoil.com]
> Sent: 30. august 2007 13:22
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] RE: cannot submit job because of error bug?
>
> Hi all,
>
>  Henk gave me good idea - to find limitation for number of users per
> group. An hour of adding and removing of users from test group gave me
> number 120! qstat and qdel doesn't complain anymore about not resolved
> group.
>
>  If there are any developers here - why is such limitation and is it
> possible to increase it? One of my primary groups contains more than 200
> users.
>
> Br,
> dmitry
>
> -----Original Message-----
> From: SLIM H.A. [mailto:h.a.slim at durham.ac.uk]
> Sent: 29. august 2007 17:14
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] RE: cannot submit job because of error bug?
>
>
> In my case there are actually two users (who both have a primary group)
> with this problem.
>
> Indeed the first user is in a secondary group that also contains users
> without primary group which our systems people should fix.
>
> The second user's primary group is also a secondary group but all users
> listed for that secondary group can be identified with the id command
> and all have a primary group.
>
> So I don't thing users without primary group listed in a secondary group
> is necessarily the problem.
> Also this problem was not in version 6.0u7 from which I upgraded to 6.1.
>
>
> Of course I don't want to ask the standard question "has anything been
> changed?" that every sysadmin is badgered with when something is
> suddenly not working anymore but if I shorten the list of users in the
> secondary group, the commands do work again.
>
> I compared the output from qstat v6.1 with that of v6.0u7 for level 10.
> There are 3 lines from qstat v6.1 that signal an error:
>    63  17496 47463950073600 --> sge_log() {
>    64  17496 47463950073600     sge_log: ctx is NULL
>    65  17496 47463950073600     ../libs/sgeobj/sge_answer.c 937 can't
> resolve group
>
> whereas v6.07 finishes with
>
> error: can't unpack gdi request
> error: error unpacking gdi request: bad argument
> failed receiving gdi request
>
> and with debug level 10 it prints the uid and gid of the user.
>
> Best wishes
>
> Henk
>
>> -----Original Message-----
>> From: Dmitry Zhukovski [mailto:DZH at maerskoil.com]
>> Sent: 29 August 2007 14:46
>> To: users at gridengine.sunsource.net
>> Subject: RE: [GE users] RE: cannot submit job because of error bug?
>>
>> Hi all,
>>
>>   I have exactly same output for one of my users - qstat,
>> qdel, qsub and other gives 'can't resolve group'.
>>
>>   A little bit of googling gave me next issue
>> http://gridengine.sunsource.net/issues/show_bug.cgi?id=1256 .
>> I checked and found user primary group's ID was not listed in
>> ldap set of groups.
>> So search on that user gave me list of all slave groups he
>> belongs to but not primary.
>>
>>   I added primary group but still get 'can't resolve group' message.
>> Question - can it be cached somewhere?
>>
>> Br,
>> dmitry
>>
>> -----Original Message-----
>> From: SLIM H.A. [mailto:h.a.slim at durham.ac.uk]
>> Sent: 29. august 2007 11:40
>> To: users at gridengine.sunsource.net
>> Subject: RE: [GE users] RE: cannot submit job because of error bug?
>>
>> Dear Daniel
>>
>> I have set dl 4 and attach the output. I had a look at the
>> source, is it possible to build qstat by itself for debug purpose?
>>
>> Thanks
>>
>> Henk
>>
>>> -----Original Message-----
>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
>>> Sent: 28 August 2007 19:07
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] RE: cannot submit job because of error bug?
>>>
>>> The debug levels aren't monotonic.  10 is actually less information
>>> than some lower levels.  4 might give you more info.  See:
>>>
>>> http://blogs.sun.com/templedf/entry/using_debugging_output
>>>
>>> Daniel
>>>
>>> SLIM H.A. wrote:
>>>> Further information to the failure of the sge commands for
>>> some unix
>>>> groups of users.
>>>>
>>>> Setting the debug level to 10 and running the qstat command
>>> gives for
>>>> the last few lines of stdout:
>>>>
>>>>     63  15359 47241863851776 		--> sge_log() {
>>>>     64  15359 47241863851776     sge_log: ctx is NULL
>>>>     65  15359 47241863851776
>>> ../libs/sgeobj/sge_answer.c 937 can't
>>>> resolve group
>>>>
>>>> I attached the full debug output.
>>>>
>>>> Thanks
>>>>
>>>> Henk
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: SLIM H.A.
>>>>> Sent: 28 August 2007 16:46
>>>>> To: SLIM H.A.
>>>>> Subject: cannot submit job because of error bug?
>>>>>
>>>>>
>>>>>
>>>>> Some users are unable to submit jobs under sge 6.1. The
>>> error message
>>>>> is this:
>>>>>
>>>>> % qsub
>>>>> Unable to initialize environment because of error: can't resolve
>>>>> group
>>>>>
>>>>
>>>>
>>>>> Exiting.
>>>>>
>>>>>
>>>>> It appears that a limit is hit by the grid engine commands when
>>>>> reading one of the secondary group entries in the
>>> /etc/group file. It
>>>>> seems the commands cannot process lines that have more than some
>>>>> small
>>>>>
>>>>
>>>>
>>>>> number of charcters, probably 512.
>>>>> Any userid that has that particular offending secondary
>>> group as its
>>>>> primary group cannot submit jobs.
>>>>>
>>>>> When the number of userids for the offending secondary group is
>>>>> reduced, the userid is able to submit again.
>>>>>
>>>>> Is this a bug as 6.0u7 did not have this problem?
>>>>>
>>>>> Thanks for any advice
>>>>>
>>>>>
>>>>> Henk
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: SLIM H.A.
>>>>>> Sent: 28 August 2007 11:34
>>>>>> To: 'users at gridengine.sunsource.net'
>>>>>> Subject: RE: [GE users] 6.1: critical error: can't resolve group
>>>>>>
>>>>>> Chris,
>>>>>>
>>>>>> I tried this, it seems to be ok:
>>>>>>
>>>>>> # grpck
>>>>>> Checking `/etc/group'
>>>>>>
>>>>>> is the only response I get
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Henk
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: chris.harwell at novartis.com
>>>>>>>
>>>>> [mailto:chris.harwell at novartis.com]
>>>>>
>>>>>>> Sent: 28 August 2007 11:03
>>>>>>> To: users
>>>>>>> Subject: Re: [GE users] 6.1: critical error: can't
>> resolve group
>>>>>>>
>>>>>>> Try running grpck as root.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>> From: "SLIM H.A." [h.a.slim at durham.ac.uk]
>>>>>>> Sent: 08/28/2007 04:56 AM
>>>>>>> To: <users at gridengine.sunsource.net>
>>>>>>> Subject: [GE users] 6.1: critical error: can't resolve group
>>>>>>>
>>>>>>>
>>>>>>> I just upgraded from 6.0u7 to 6.1 and have come across a
>>>>>>>
>>>>>> problem. The
>>>>>>
>>>>>>> Grid Engine commands now give for some users an error,
>>> for example
>>>>>>>
>>>>>>> %qstat
>>>>>>> critical error: can't resolve group
>>>>>>>
>>>>>>> Has anyone seen this before or have an idea why this now
>>> shows up?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Henk
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>
>> ---------------------------------------------------------------------
>>>>>
>>>>>>> To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail:
>>>>>>>
>>>>> users-help at gridengine.sunsource.net
>>>>>
>>>>>>>
>>>>>
>>>
>> ---------------------------------------------------------------------
>>>>>
>>>>>>> To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail:
>>>>>>>
>>>>> users-help at gridengine.sunsource.net
>>>>>
>>>>>>>
>>>>>>>
>>> -------------------------------------------------------------------
>>>>>>> -----
>>>>>>>
>>>>>>>
>>> -------------------------------------------------------------------
>>>>>>> -- To unsubscribe, e-mail:
>>>>>>> users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail:
>>>>>>> users-help at gridengine.sunsource.net
>>>
>>>
>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> **********************************************************************
>> This e-mail and any files transmitted with it are
>> confidential and intended solely for the use of the
>> individual or entity to which they are addressed. If you have
>> received this e-mail in error please notify the system
>> manager at helpdesk at maerskoil.com.
>>
>> This e-mail and its contents do not constitute and shall not
>> be considered as a financial commitment of Maersk Olie og Gas
>> AS and its affiliates.
>> Maersk Olie og Gas AS expressly disclaims any responsibility
>> as to the accuracy and use of this e-mail and its contents.
>> **********************************************************************
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> **********************************************************************
> This e-mail and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to which they
> are addressed. If you have received this e-mail in error please notify
> the system manager at helpdesk at maerskoil.com.
>
> This e-mail and its contents do not constitute and shall not be
> considered as a financial commitment of Maersk Olie og Gas AS
> and its affiliates.
> Maersk Olie og Gas AS expressly disclaims any responsibility
> as to the accuracy and use of this e-mail and its contents.
> **********************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
> **********************************************************************
> This e-mail and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to which they
> are addressed. If you have received this e-mail in error please notify
> the system manager at helpdesk at maerskoil.com.
>
> This e-mail and its contents do not constitute and shall not be
> considered as a financial commitment of Maersk Olie og Gas AS
> and its affiliates.
> Maersk Olie og Gas AS expressly disclaims any responsibility
> as to the accuracy and use of this e-mail and its contents.
> **********************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list