[GE users] SDM GE adapter sve got in trouble

andre Andre.Alefeld at sun.com
Wed Aug 26 14:12:42 BST 2009


Hi Chansup,

can you try to add the following permission to the first grant block in $SGE_ROOT/default/common/jmx/java.policy (and $SGE_ROOT/util/java.policy.template)

grant codeBase "file:${com.sun.grid.jgdi.sgeRoot}/lib/jgdi.jar"<file:${com.sun.grid.jgdi.sgeRoot}/lib/jgdi.jar>  {
....
   permission java.lang.RuntimePermission "modifyThread";
....
}

Then the exception should no longer appear.

I guess it is not the cause for the STU_name problem.
Maybe there is some mismatch between the jgdi.jar versions. Did you rebuild the gridengine code yourself ? Or did you use everything from the distribution ?
Could you try to copy the *.jar files from $SGE_ROOT/lib/ to $SGE_ROOT/sgeinspect/sgeinspect/modules/ext and check if the problem still occurs ?
I will try to have the arco variable setup you sent in your last email and check if I can reproduce it for my setup.

Andre


cbyun wrote:

I got some more information.

When I started the GE adapter service, JGDI log show the following access denied error.

# sdmadm suc -c gesvc
comp  host            message
---------------------------------------
gesvc llgriddev.local startup triggered

# tail -f default/spool/qmaster/jgdi0.log
...
24/08/2009 14:46:54|11|jgdi.jni.EventClientImpl.close|W|Close of event client failed
                              java.security.AccessControlException: access denied (java.lang.RuntimePermission modifyThread)
                                java.security.AccessControlContext.checkPermission(AccessControlContext.java:323)
                                java.security.AccessController.checkPermission(AccessController.java:546)
                                java.lang.SecurityManager.checkPermission(SecurityManager.java:532)
                                java.util.concurrent.ThreadPoolExecutor.shutdown(ThreadPoolExecutor.java:1094)
                                java.util.concurrent.Executors$DelegatedExecutorService.shutdown(Executors.java:591)
                                com.sun.grid.jgdi.jni.EventClientImpl.close(EventClientImpl.java:157)
                                com.sun.grid.jgdi.management.NotificationBridge.close(NotificationBridge.java:182)
                                com.sun.grid.jgdi.management.JGDISession.close(JGDISession.java:139)
                                com.sun.grid.jgdi.management.JGDISession.closeSession(JGDISession.java:219)
                                com.sun.grid.jgdi.management.JGDIAgent$MyNotificationListener.handleNotification(JGDIAgent.java:403)

Thanks,
- Chansup




-----Original Message-----
From: Ryszard.Macidlowski at sun.com<mailto:Ryszard.Macidlowski at sun.com> [mailto:Ryszard.Macidlowski at sun.com]
Sent: Monday, August 24, 2009 1:09 PM
To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
Subject: Re: [GE users] SDM GE adapter sve got in trouble

Hi Chansup,

 From the error that I see, I dont think that it's SDM adapter problem.
What adapter does during startup it connects using jgdi to the qmaster
(jmx thread) and retrieves data (using this jgdi connection). What I can
see in your stacktrace is that you get IllegalStateException from jgdi
(as long as jgdi throws this exception service will not start and
nothing can be done on SDM side. You didnt make any modifications in
ge_adapter_svc_config.xml I suppose. So either there is a problem with
qmaster (try to restart qmaster and see if you are able to start gesvc
 >> I suppose you already tried this and this is not a problem) or there
is problem/possible bug in jgdi. You possibly customized the SGE that
way that jgdi cannot report/process customized values.

So to clear the error I supposed you could "undo" your customizations. I
would suggest to do it step by step (after each step try to start SDM
gesvc service to see if it starts and track the problematic change).

The second approach would be to use install new SGE cell with default
setup and add customizations step by step (after each step try to start
SDM gesvc service to see if it starts and track the problematic change).

Of course the third approach would be to debug the jgdi :)

BTW. Have you checked jgdi logs if there is any information

Rys

cbyun pisze:


Hi Michal,

I tried the following steps but it still failed to clear the issue.
Any suggestion to clear the issue?

- Remove GE adapter service
        sdmadm rs -s gesvc -force

- Shutdonw/Startup SDM master service
        sdmadm sdj -all -h localhost
        sdmadm suj

- Add GE adapter service
        sdmadm ags -h localhost -j cs_vm -s gesvc -f \
                  <path>/ge_adapter_svc_config.xml

- Startup GE component
        sdmadm suc -c gesvc -h localhost

I'm still getting the same error:

# 08/24/2009


12:20:08|16|e.impl.ge.GEServiceAdapterImpl.doStartService|I|Service gesvc:
Starting Grid Engine service


08/24/2009


12:20:08|16|rm.service.impl.AbstractServiceAdapter$1.call|E|Service
startup failed: jgdi error: java.lang.IllegalStateException: content field
STU_name not found in descriptor


|set_object_attribute: set_list of property reportVariables failed


                                                                      |
08/24/2009


12:20:08|17|rm.impl.AbstractComponent$3.performTransition|W|Componentgesvc
: Error in startup procedure: Service gesvc: Unexpected error in state
transition UnknownStateHandler[UNKNOWN] -> StartingStateHandler[STARTING]:
Service startup failed: jgdi error: java.lang.IllegalStateException:
content field STU_name not found in descriptor


|set_object_attribute: set_list of property reportVariables failed


                                                                      |


Thanks,
- Chansup





-----Original Message-----
From: cbyun [mailto:cbyun at ll.mit.edu]
Sent: Monday, August 24, 2009 9:51 AM
To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
Subject: RE: [GE users] SDM GE adapter sve got in trouble

Michal,




-----Original Message-----
From: Michal.Bachorik at sun.com<mailto:Michal.Bachorik at sun.com> [mailto:Michal.Bachorik at sun.com]
Sent: Monday, August 24, 2009 9:23 AM
To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
Subject: Re: [GE users] SDM GE adapter sve got in trouble

Chansup,

actually is something called "STU_name" part of your changes? The


error


says something about "STU_name" not being part of descriptor, so I'd
like to know whether it is something you introduced or touched or


not ..





No, I have no such a thing called as "STU_name" in my configuration.
Here is my configuration:

# qconf -sconf
#global:
execd_spool_dir              /var/spool/sge
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
min_uid                      0
min_gid                      0
user_lists                   none
xuser_lists                  none
projects                     none
xprojects                    none
enforce_project              false
enforce_user                 auto
load_report_time             00:00:40
max_unheard                  00:05:00
reschedule_unknown           00:00:00
loglevel                     log_warning
administrator_mail           none
set_token_cmd                none
pag_cmd                      none
token_extend_time            none
shepherd_cmd                 none
qmaster_params               none
execd_params                 none
reporting_params             accounting=true reporting=true \
                             flush_time=00:00:15 joblog=true
sharelog=00:00:00
finished_jobs                100
gid_range                    20000-20100
qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin
max_aj_instances             2000
max_aj_tasks                 75000
max_u_jobs                   0
max_jobs                     0
max_advance_reservations     0
auto_user_oticket            0
auto_user_fshare             0
auto_user_default_project    none
auto_user_delete_time        86400
delegated_file_staging       false
reprioritize                 0
jsv_url                      none
libjvm_path
/usr/java/latest/jre/lib/amd64/server/libjvm.so
additional_jvm_args          -Xmx256m
jsv_allowed_mod              ac,h,i,e,o,j,M,N,p,w


# qconf -ssconf
algorithm                         default
schedule_interval                 0:2:0
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              NONE
load_adjustment_decay_time        0:0:0
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  2
flush_finish_sec                  2
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               FALSE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  INFINITY


# qconf -srqs
{
   name         host_slot_limit
   description  Limit total number of slots per hosts (assume uniform
machines)
   enabled      TRUE
   limit        hosts {@allhosts} to slots=2
}
{
   name         max_u_jobs
   description  max jobs per user
   enabled      TRUE
   limit        users {*} to slots=256
}


# for i in `qconf -sql`; do echo " ";echo Qname: $i; qconf -sq $i; done

Qname: all.q
qname                 all.q
hostlist              @nohosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 2
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

Qname: normal
qname                 normal
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 2
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

Qname: pmatlab
qname                 pmatlab
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 2
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY


Currently no hosts are assigned to neither of host groups since none of
hosts are not being used by SGE adapter service:

# qconf -shgrp @allhosts
group_name @allhosts
hostlist NONE

# qconf -shgrp @nohosts
group_name @nohosts
hostlist NONE

I also turned on the exclusive mode:

# qconf -sc
#name               shortcut   type        relop   requestable


consumable


default  urgency
#----------------------------------------------------------------------


---


-----------------
...
exclusive           excl       BOOL        EXCL    YES         YES
0        1000


Thanks,
- Chansup







M.


cbyun wrote:



Michal,

Yes, I made a few changes in the SGE configuration.

I added a couple of RQS rules, a couple of new cluster queues and a



new



hostgroup, @nohosts.



Then, in order to make all.q from being used, I assigned @nohosts



group



to the all.q.



I believe the issue appeared after these customizations.

Thanks,
- Chansup






-----Original Message-----
From: Michal.Bachorik at sun.com<mailto:Michal.Bachorik at sun.com> [mailto:Michal.Bachorik at sun.com]
Sent: Monday, August 24, 2009 3:28 AM
To: users at gridengine.sunsource.net<mailto:users at gridengine.sunsource.net>
Subject: Re: [GE users] SDM GE adapter sve got in trouble

Chansup,

did not your SGE changed in any way? There is error coming from jgdi
(from SGE side). Did not you do some kind of "upgrade/downgrade" of
jgdi.jar? I have not seen such error before, so I will need to dig


in


it



- I will let you know once I found something.

Regards,

Michal

cbyun wrote:




Hi,

Somehow my ge adapter service got in trouble and I couldn't start


it


any



more:




# sdmadm suc -c gesvc2
comp   host            message
----------------------------------------
gesvc2 llgriddev.local startup triggered


08/21/2009




16:25:11|20|e.impl.ge.GEServiceAdapterImpl.doStartService|I|Service
gesvc2: Starting Grid Engine service




08/21/2009




16:25:11|20|rm.service.impl.AbstractServiceAdapter$1.call|E|Service
startup failed: jgdi error: java.lang.IllegalStateException: content



field



STU_name not found in descriptor

|set_object_attribute: set_list of property reportVariables failed




|



08/21/2009




16:25:11|21|rm.impl.AbstractComponent$3.performTransition|W|Componentgesvc


2: Error in startup procedure: Service gesvc2: Unexpected error in



state



transition UnknownStateHandler[UNKNOWN] ->



StartingStateHandler[STARTING]:



Service startup failed: jgdi error: java.lang.IllegalStateException:
content field STU_name not found in descriptor

|set_object_attribute: set_list of property reportVariables failed




|



Is there any way to clear up this error?

Thanks,
- Chansup

------------------------------------------------------





http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId


=213536




To unsubscribe from this discussion, e-mail: [users-




unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].

------------------------------------------------------




http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId


=213890

To unsubscribe from this discussion, e-mail: [users-
unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].




------------------------------------------------------




http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId


=213947



To unsubscribe from this discussion, e-mail: [users-



unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].

------------------------------------------------------




http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId


=213956

To unsubscribe from this discussion, e-mail: [users-
unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].



------------------------------------------------------



http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId


=213969

To unsubscribe from this discussion, e-mail: [users-
unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].



------------------------------------------------------



http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
=213992


To unsubscribe from this discussion, e-mail: [users-


unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].


------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId
=213997

To unsubscribe from this discussion, e-mail: [users-
unsubscribe at gridengine.sunsource.net<mailto:unsubscribe at gridengine.sunsource.net>].



------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=214028

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].



--
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Andre Alefeld                Phone: ++49 (0)941 3075-255
Software Engineering         Fax:   ++49 (0)941 3075-222
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7        mailto: andre.alefeld at sun.com<mailto:andre.alefeld at sun.com>
D-93049 Regensburg           http://www.sun.com/gridware




More information about the gridengine-users mailing list