[GE users] Reg: windows domain users access to grid

manju a manju.kudu at gmail.com
Thu Nov 1 13:06:11 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Harald,

yes i got the still  same messages "load sensor died through
 signal = 11"

thanks
manjunath A.



On Nov 1, 2007 4:07 PM, Harald Pollinger <Harald.Pollinger at sun.com> wrote:
> Hi Manju,
>
> with the static_load.sh, do you still get the "load sensor died through
> signal = 11" messages in the execd messages file?
>
>
> Regards,
> Harald
>
> manju a wrote:
> > Hi Harald,
> >
> > I changed the load sensors to static_load using this command qconf
> > -mconf win_host_name n i restarted the execution host.. still not able
> > to get that sge_execd process under process list!!!! and another thing
> > i observed is qloadersensor.exe is running on the working
> > machine(where the sge_execd is running)... but i couldn't find it in
> > the not working machine... i reinstalled the SFU n applied the patch,
> > yes off course  i disabled the DEP.... no luck yet.... this type of
> > problem i m facing in nearly 5 to 6 machines... which is having two
> > drives n user profile pointing in to D drive n checked the OS patch
> > level it looks same in all the machines.........
> >
> > thanks
> > Manjunath
> >
> > On Oct 31, 2007 10:40 PM, Harald Pollinger <Harald.Pollinger at sun.com> wrote:
> >> Manju,
> >>
> >> signal 11 means "segmentation fault", i.e. the load sensor tries to
> >> access some not allocated memory. As the load sensors works on all other
> >> systems, I still assume that something is wrong with your Windows version...
> >>
> >> As a workaround, you could configure a different load sensor, at first
> >> only a dummy load sensor. "qconf -sconf <windows_host_name>" shows you
> >> what load sensor is currently configured, it should be
> >> "$SGE_ROOT/util/resources/loadsensors/interix-loadsensor.sh".
> >>
> >> Just configure a different load sensor script, e.g. one that only
> >> returns static data like the static_load.sh I've attached to this E-Mail.
> >>
> >> If the execd works with this load sensor, we should dig deeper into the
> >> load sensors problem.
> >>
> >> Regards,
> >> Harald
> >>
> >>
> >>
> >>
> >> manju a wrote:
> >>> Hi Harald,
> >>>
> >>> please find the logs from the execution host (this logs from windows
> >>> execution host)
> >>>
> >>> 10/31/2007 06:23:29|execd|execd_host|I|starting up SGE 6.1u2 (win32-x86)
> >>> 10/31/2007 06:23:31|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:23:33|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:23:35|execd|execd_host|I|controlled shutdown 6.1u2
> >>> 10/31/2007 06:24:04|execd|execd_host|I|starting up SGE 6.1u2 (win32-x86)
> >>> 10/31/2007 06:24:05|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:24:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:24:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:25:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:26:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:26:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:27:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:28:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:28:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:29:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:30:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:30:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:31:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:32:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:32:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:33:27|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:34:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:34:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:35:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:36:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:36:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:37:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:38:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:38:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:39:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:40:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:40:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:41:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:42:06|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:42:46|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:43:26|execd|execd_host|E|load sensor died through signal = 11
> >>> 10/31/2007 06:43:29|execd|execd_host|I|controlled shutdown 6.1u2
> >>>
> >>> if i tried to start from $SGE_ROOT/bin/win32-x86/sge_execd after i m
> >>> in to dl 5 mode.... it will keep on struck at starting sge_execd , at
> >>> that time i can see the sge_execd process under process list !!!! but
> >>> it will not start.....
> >>>
> >>> please let me know if you got any thing atleast with this logs....
> >>>
> >>> thanks
> >>> Manjunath A.
> >>>
> >>>
> >>>
> >>> On Oct 31, 2007 7:00 PM, manju a <manju.kudu at gmail.com> wrote:
> >>>> yes i done that one also!!!! but no luck..... if i start
> >>>> /etc/init.d/sgeexecd start process it will come n disappear within a
> >>>> sec from the process list....
> >>>>
> >>>>
> >>>> On Oct 31, 2007 6:52 PM, Harald Pollinger <Harald.Pollinger at sun.com> wrote:
> >>>>> manju a wrote:
> >>>>>> Hi Harald,
> >>>>>>
> >>>>>> i already applied this hot fixes but not as suggested in the page(i
> >>>>>> mean order while applying the hot fixes) does it makes any difference
> >>>>>> ???
> >>>>> The page says: Yes, it makes a difference, apply them only in the
> >>>>> suggested order to make sure all of them work.
> >>>>>
> >>>>> Regards,
> >>>>> Harald
> >>>>>
> >>>>>
> >>>>>> thanks
> >>>>>> Manju
> >>>>>>
> >>>>>> On Oct 31, 2007 4:56 PM, Harald Pollinger <Harald.Pollinger at sun.com> wrote:
> >>>>>>> manju a wrote:
> >>>>>>>> Hi Harald,
> >>>>>>>>
> >>>>>>>>  i can see the below messages from master at  this location
> >>>>>>>> $SGE_ROOT/farm/spool/qmaster/messages
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 10/30/2007 23:00:07|qmaster|testqmaster|E|commlib error: got read
> >>>>>>>> error (closing "execd_host.abc.com/qstat/85")
> >>>>>>>> 10/30/2007 23:28:44|qmaster|testqmaster|E|commlib error: got read
> >>>>>>>> error (closing "execd_host.abc.com/qconf/121")
> >>>>>>>> 10/30/2007 23:32:03|qmaster|testqmaster|E|commlib error: got read
> >>>>>>>> error (closing "execd_host.abc.com/qconf/125")
> >>>>>>>> 10/30/2007 23:32:44|qmaster|testqmaster|E|commlib error: got read
> >>>>>>>> error (closing "execd_host.abc.com/execd/1")
> >>>>>>>> 10/30/2007 23:33:28|qmaster|testqmaster|E|commlib error: got read
> >>>>>>>> error (closing "execd_host.abc.com/execd/1")
> >>>>>>>>
> >>>>>>>> what this commlib error means ???
> >>>>>>> It means that the qmaster is just before reading or already is reading
> >>>>>>> some data from the execd, but then the execd suddenly 'vanishes'.
> >>>>>>>
> >>>>>>>
> >>>>>>>> testqmaster ----> qmaster server name
> >>>>>>>>
> >>>>>>>> execd_host-----> execution host name (sge_execd process not running
> >>>>>>>> machine name )
> >>>>>>>>
> >>>>>>>> pleas let me know if you got any thing from this logs....i m still
> >>>>>>>> trying to make run this sge_execd process but no luck.. please help me
> >>>>>>>> out this.......
> >>>>>>> I guess there is simply something wrong with your Windows/SFU host. You
> >>>>>>> could try to apply all Hotfixes suggested on this page:
> >>>>>>> http://www.duh.org/interix/hotfixes.php
> >>>>>>>
> >>>>>>> This is just a collection of all official Microsoft Hotfixes for SFU.
> >>>>>>> It's very difficult to find the Hotfixes and their dependencies on the
> >>>>>>> Microsoft pages, so just apply them in the order suggested here for your
> >>>>>>> Windows version.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Harald
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> thanks
> >>>>>>>> Manjunath A,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Oct 30, 2007 4:45 PM, Harald Pollinger <Harald.Pollinger at sun.com> wrote:
> >>>>>>>>> manju a wrote:
> >>>>>>>>>> Hi Harald,
> >>>>>>>>>>
> >>>>>>>>>> i like to give some more information
> >>>>>>>>>>
> >>>>>>>>>> whatever working machines all the windows profiles pointing in to
> >>>>>>>>>> "C:\Document and Settings"  do you think this will matters ????
> >>>>>>>>>>
> >>>>>>>>>> and non working machines (sge_execd not running) Windows profiles
> >>>>>>>>>> pointing to "D:\Users"
> >>>>>>>>>>
> >>>>>>>>>> this is the only difference i found between working and non working machines...
> >>>>>>>>>>
> >>>>>>>>>> no error found while i m installing machine as execution host, i
> >>>>>>>>>> carried out the installation process same as the working machines....
> >>>>>>>>> This might matter for job execution, but it doesn't matter for the Execd
> >>>>>>>>> startup.
> >>>>>>>>>
> >>>>>>>>> I found this:
> >>>>>>>>> The "TERMINATE dispatcher because j == 1" line gets printed when the
> >>>>>>>>> Execd either receives a SIGTERM or a SIGINT signal or it receives a
> >>>>>>>>> shutdown request from the QMaster.
> >>>>>>>>>
> >>>>>>>>> As I can't see any request from the QMaster in the trace file sniplet
> >>>>>>>>> you sent me, I assume the Execd received a SIGTERM or a SIGINT signal.
> >>>>>>>>> I can't tell you where this signal came from.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Harald
>
>
> --
> Sun Microsystems GmbH         Harald Pollinger
> Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
> D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
> Germany                       Fax: +49 (0)941 3075-222  (x60222)
> http://www.sun.com/gridware
> mailto:harald.pollinger at sun.com
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1,
> D-85551 Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list