[GE users] SDM GE adapter sve got in trouble

easymf michal.bachorik at sun.com
Mon Aug 24 17:53:11 BST 2009


Chansup,

there seems to be a problem on sge side (something is causing JGDI,
which is used as interface between sdm and sge, to malfunction) so
trying to solve it on SDM side alone will not help. Tomorrow should
return Andre from vacation, he knows jgdi the best so we will try to
solve it.

Regards,

M.

cbyun wrote:
> Hi Michal,
>
> I tried the following steps but it still failed to clear the issue.
> Any suggestion to clear the issue?
>
> - Remove GE adapter service
>         sdmadm rs -s gesvc -force
>
> - Shutdonw/Startup SDM master service
>         sdmadm sdj -all -h localhost
>         sdmadm suj
>
> - Add GE adapter service
>         sdmadm ags -h localhost -j cs_vm -s gesvc -f \
>                   <path>/ge_adapter_svc_config.xml
>
> - Startup GE component
>         sdmadm suc -c gesvc -h localhost
>
> I'm still getting the same error:
>
> # 08/24/2009 12:20:08|16|e.impl.ge.GEServiceAdapterImpl.doStartService|I|Service gesvc: Starting Grid Engine service
> 08/24/2009 12:20:08|16|rm.service.impl.AbstractServiceAdapter$1.call|E|Service startup failed: jgdi error: java.lang.IllegalStateException: content field STU_name not found in descriptor
>                                                                       |set_object_attribute: set_list of property reportVariables failed
>                                                                       |
> 08/24/2009 12:20:08|17|rm.impl.AbstractComponent$3.performTransition|W|Componentgesvc: Error in startup procedure: Service gesvc: Unexpected error in state transition UnknownStateHandler[UNKNOWN] -> StartingStateHandler[STARTING]: Service startup failed: jgdi error: java.lang.IllegalStateException: content field STU_name not found in descriptor
>                                                                       |set_object_attribute: set_list of property reportVariables failed
>                                                                       |
>
>
> Thanks,
> - Chansup
>
>
>
>> -----Original Message-----
>> From: cbyun [mailto:cbyun at ll.mit.edu]
>> Sent: Monday, August 24, 2009 9:51 AM
>> To: users at gridengine.sunsource.net
>> Subject: RE: [GE users] SDM GE adapter sve got in trouble
>>
>> Michal,
>>
>>
>>> -----Original Message-----
>>> From: Michal.Bachorik at sun.com [mailto:Michal.Bachorik at sun.com]
>>> Sent: Monday, August 24, 2009 9:23 AM
>>> To: users at gridengine.sunsource.net
>>> Subject: Re: [GE users] SDM GE adapter sve got in trouble
>>>
>>> Chansup,
>>>
>>> actually is something called "STU_name" part of your changes? The error
>>> says something about "STU_name" not being part of descriptor, so I'd
>>> like to know whether it is something you introduced or touched or not ..
>>>
>>>
>> No, I have no such a thing called as "STU_name" in my configuration.
>> Here is my configuration:
>>
>> # qconf -sconf
>> #global:
>> execd_spool_dir              /var/spool/sge
>> mailer                       /bin/mail
>> xterm                        /usr/bin/X11/xterm
>> load_sensor                  none
>> prolog                       none
>> epilog                       none
>> shell_start_mode             posix_compliant
>> login_shells                 sh,ksh,csh,tcsh
>> min_uid                      0
>> min_gid                      0
>> user_lists                   none
>> xuser_lists                  none
>> projects                     none
>> xprojects                    none
>> enforce_project              false
>> enforce_user                 auto
>> load_report_time             00:00:40
>> max_unheard                  00:05:00
>> reschedule_unknown           00:00:00
>> loglevel                     log_warning
>> administrator_mail           none
>> set_token_cmd                none
>> pag_cmd                      none
>> token_extend_time            none
>> shepherd_cmd                 none
>> qmaster_params               none
>> execd_params                 none
>> reporting_params             accounting=true reporting=true \
>>                              flush_time=00:00:15 joblog=true
>> sharelog=00:00:00
>> finished_jobs                100
>> gid_range                    20000-20100
>> qlogin_command               builtin
>> qlogin_daemon                builtin
>> rlogin_command               builtin
>> rlogin_daemon                builtin
>> rsh_command                  builtin
>> rsh_daemon                   builtin
>> max_aj_instances             2000
>> max_aj_tasks                 75000
>> max_u_jobs                   0
>> max_jobs                     0
>> max_advance_reservations     0
>> auto_user_oticket            0
>> auto_user_fshare             0
>> auto_user_default_project    none
>> auto_user_delete_time        86400
>> delegated_file_staging       false
>> reprioritize                 0
>> jsv_url                      none
>> libjvm_path
>> /usr/java/latest/jre/lib/amd64/server/libjvm.so
>> additional_jvm_args          -Xmx256m
>> jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w
>>
>>
>> # qconf -ssconf
>> algorithm                         default
>> schedule_interval                 0:2:0
>> maxujobs                          0
>> queue_sort_method                 load
>> job_load_adjustments              NONE
>> load_adjustment_decay_time        0:0:0
>> load_formula                      np_load_avg
>> schedd_job_info                   true
>> flush_submit_sec                  2
>> flush_finish_sec                  2
>> params                            none
>> reprioritize_interval             0:0:0
>> halftime                          168
>> usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
>> compensation_factor               5.000000
>> weight_user                       0.250000
>> weight_project                    0.250000
>> weight_department                 0.250000
>> weight_job                        0.250000
>> weight_tickets_functional         0
>> weight_tickets_share              0
>> share_override_tickets            TRUE
>> share_functional_shares           TRUE
>> max_functional_jobs_to_schedule   200
>> report_pjob_tickets               FALSE
>> max_pending_tasks_per_job         50
>> halflife_decay_list               none
>> policy_hierarchy                  OFS
>> weight_ticket                     0.010000
>> weight_waiting_time               0.000000
>> weight_deadline                   3600000.000000
>> weight_urgency                    0.100000
>> weight_priority                   1.000000
>> max_reservation                   0
>> default_duration                  INFINITY
>>
>>
>> # qconf -srqs
>> {
>>    name         host_slot_limit
>>    description  Limit total number of slots per hosts (assume uniform
>> machines)
>>    enabled      TRUE
>>    limit        hosts {@allhosts} to slots=2
>> }
>> {
>>    name         max_u_jobs
>>    description  max jobs per user
>>    enabled      TRUE
>>    limit        users {*} to slots=256
>> }
>>
>>
>> # for i in `qconf -sql`; do echo " ";echo Qname: $i; qconf -sq $i; done
>>
>> Qname: all.q
>> qname                 all.q
>> hostlist              @nohosts
>> seq_no                0
>> load_thresholds       np_load_avg=1.75
>> suspend_thresholds    NONE
>> nsuspend              1
>> suspend_interval      00:05:00
>> priority              0
>> min_cpu_interval      00:05:00
>> processors            UNDEFINED
>> qtype                 BATCH INTERACTIVE
>> ckpt_list             NONE
>> pe_list               make
>> rerun                 FALSE
>> slots                 2
>> tmpdir                /tmp
>> shell                 /bin/csh
>> prolog                NONE
>> epilog                NONE
>> shell_start_mode      posix_compliant
>> starter_method        NONE
>> suspend_method        NONE
>> resume_method         NONE
>> terminate_method      NONE
>> notify                00:00:60
>> owner_list            NONE
>> user_lists            NONE
>> xuser_lists           NONE
>> subordinate_list      NONE
>> complex_values        NONE
>> projects              NONE
>> xprojects             NONE
>> calendar              NONE
>> initial_state         default
>> s_rt                  INFINITY
>> h_rt                  INFINITY
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 INFINITY
>> h_rss                 INFINITY
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>>
>> Qname: normal
>> qname                 normal
>> hostlist              @allhosts
>> seq_no                0
>> load_thresholds       np_load_avg=1.75
>> suspend_thresholds    NONE
>> nsuspend              1
>> suspend_interval      00:05:00
>> priority              0
>> min_cpu_interval      00:05:00
>> processors            UNDEFINED
>> qtype                 BATCH INTERACTIVE
>> ckpt_list             NONE
>> pe_list               make
>> rerun                 FALSE
>> slots                 2
>> tmpdir                /tmp
>> shell                 /bin/csh
>> prolog                NONE
>> epilog                NONE
>> shell_start_mode      posix_compliant
>> starter_method        NONE
>> suspend_method        NONE
>> resume_method         NONE
>> terminate_method      NONE
>> notify                00:00:60
>> owner_list            NONE
>> user_lists            NONE
>> xuser_lists           NONE
>> subordinate_list      NONE
>> complex_values        NONE
>> projects              NONE
>> xprojects             NONE
>> calendar              NONE
>> initial_state         default
>> s_rt                  INFINITY
>> h_rt                  INFINITY
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 INFINITY
>> h_rss                 INFINITY
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>>
>> Qname: pmatlab
>> qname                 pmatlab
>> hostlist              @allhosts
>> seq_no                0
>> load_thresholds       np_load_avg=1.75
>> suspend_thresholds    NONE
>> nsuspend              1
>> suspend_interval      00:05:00
>> priority              0
>> min_cpu_interval      00:05:00
>> processors            UNDEFINED
>> qtype                 BATCH INTERACTIVE
>> ckpt_list             NONE
>> pe_list               make
>> rerun                 FALSE
>> slots                 2
>> tmpdir                /tmp
>> shell                 /bin/csh
>> prolog                NONE
>> epilog                NONE
>> shell_start_mode      posix_compliant
>> starter_method        NONE
>> suspend_method        NONE
>> resume_method         NONE
>> terminate_method      NONE
>> notify                00:00:60
>> owner_list            NONE
>> user_lists            NONE
>> xuser_lists           NONE
>> subordinate_list      NONE
>> complex_values        NONE
>> projects              NONE
>> xprojects             NONE
>> calendar              NONE
>> initial_state         default
>> s_rt                  INFINITY
>> h_rt                  INFINITY
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 INFINITY
>> h_rss                 INFINITY
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>>
>>
>> Currently no hosts are assigned to neither of host groups since none of
>> hosts are not being used by SGE adapter service:
>>
>> # qconf -shgrp @allhosts
>> group_name @allhosts
>> hostlist NONE
>>
>> # qconf -shgrp @nohosts
>> group_name @nohosts
>> hostlist NONE
>>
>> I also turned on the exclusive mode:
>>
>> # qconf -sc
>> #name               shortcut   type        relop   requestable consumable
>> default  urgency
>> #-------------------------------------------------------------------------
>> -----------------
>> ...
>> exclusive           excl       BOOL        EXCL    YES         YES
>> 0        1000
>>
>>
>> Thanks,
>> - Chansup
>>
>>
>>
>>
>>
>>> M.
>>>
>>>
>>> cbyun wrote:
>>>
>>>> Michal,
>>>>
>>>> Yes, I made a few changes in the SGE configuration.
>>>>
>>>> I added a couple of RQS rules, a couple of new cluster queues and a
>>>>
>> new
>>
>>> hostgroup, @nohosts.
>>>
>>>> Then, in order to make all.q from being used, I assigned @nohosts
>>>>
>> group
>>
>>> to the all.q.
>>>
>>>> I believe the issue appeared after these customizations.
>>>>
>>>> Thanks,
>>>> - Chansup
>>>>
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Michal.Bachorik at sun.com [mailto:Michal.Bachorik at sun.com]
>>>>> Sent: Monday, August 24, 2009 3:28 AM
>>>>> To: users at gridengine.sunsource.net
>>>>> Subject: Re: [GE users] SDM GE adapter sve got in trouble
>>>>>
>>>>> Chansup,
>>>>>
>>>>> did not your SGE changed in any way? There is error coming from jgdi
>>>>> (from SGE side). Did not you do some kind of "upgrade/downgrade" of
>>>>> jgdi.jar? I have not seen such error before, so I will need to dig in
>>>>>
>>> it
>>>
>>>>> - I will let you know once I found something.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Michal
>>>>>
>>>>> cbyun wrote:
>>>>>
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Somehow my ge adapter service got in trouble and I couldn't start it
>>>>>>
>>> any
>>>
>>>>> more:
>>>>>
>>>>>
>>>>>> # sdmadm suc -c gesvc2
>>>>>> comp   host            message
>>>>>> ----------------------------------------
>>>>>> gesvc2 llgriddev.local startup triggered
>>>>>>
>>>>>>
>>>>>> 08/21/2009
>>>>>>
>>>>>>
>>>>> 16:25:11|20|e.impl.ge.GEServiceAdapterImpl.doStartService|I|Service
>>>>> gesvc2: Starting Grid Engine service
>>>>>
>>>>>
>>>>>> 08/21/2009
>>>>>>
>>>>>>
>>>>> 16:25:11|20|rm.service.impl.AbstractServiceAdapter$1.call|E|Service
>>>>> startup failed: jgdi error: java.lang.IllegalStateException: content
>>>>>
>>> field
>>>
>>>>> STU_name not found in descriptor
>>>>>
>>>>> |set_object_attribute: set_list of property reportVariables failed
>>>>>
>>>>>
>>> |
>>>
>>>>>> 08/21/2009
>>>>>>
>>>>>>
>> 16:25:11|21|rm.impl.AbstractComponent$3.performTransition|W|Componentgesvc
>>
>>>>> 2: Error in startup procedure: Service gesvc2: Unexpected error in
>>>>>
>>> state
>>>
>>>>> transition UnknownStateHandler[UNKNOWN] ->
>>>>>
>>> StartingStateHandler[STARTING]:
>>>
>>>>> Service startup failed: jgdi error: java.lang.IllegalStateException:
>>>>> content field STU_name not found in descriptor
>>>>>
>>>>> |set_object_attribute: set_list of property reportVariables failed
>>>>>
>>>>>
>>> |
>>>
>>>>>> Is there any way to clear up this error?
>>>>>>
>>>>>> Thanks,
>>>>>> - Chansup
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
>>
>>>>> =213536
>>>>>
>>>>>
>>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>>>
>>>>>>
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>> ------------------------------------------------------
>>>>>
>>>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
>>
>>>>> =213890
>>>>>
>>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>> unsubscribe at gridengine.sunsource.net].
>>>>>
>>>>>
>>>> ------------------------------------------------------
>>>>
>>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
>>
>>> =213947
>>>
>>>> To unsubscribe from this discussion, e-mail: [users-
>>>>
>>> unsubscribe at gridengine.sunsource.net].
>>>
>>> ------------------------------------------------------
>>>
>>>
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
>>
>>> =213956
>>>
>>> To unsubscribe from this discussion, e-mail: [users-
>>> unsubscribe at gridengine.sunsource.net].
>>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
>> =213969
>>
>> To unsubscribe from this discussion, e-mail: [users-
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213992
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213994

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list