[GE users] JSV scripts running unreliably

ernst Ernst.Bablick at sun.com
Wed Jun 10 12:32:47 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Andreas,

Your JSV scripts are restarted due to two reasons:

1) The Message "JSV modification time in ..." indicates that the 
modification time stamp of your JSV script has changed. Within GE a 
worker thread detects that and restarts the corresponding JSV process 
when the next incoming job should be verified.

2) There is a  protocol error between a JSV process and the 
corresponding thread in master. I assume that your JSV script is not 
implemented correctly. The first job that is verified by JSV process is 
handled correctly but the second results in a protocol error.
To debug your JSV script you can set the "logging_enabled" and 
"log_file" variable in the file that is included in your JSV script 
(e.g. JSV.pm,  jsc_include.tcl or  jsv_include.sh). After enabling this 
you can find the data that is exchanged between master and JSV process 
in the log_file.

Cheers,

Ernst

ah_sunsource wrote:
> Hi,
>
> I'm experiencing a bit with the new jsv feature in SGE 6.2u2. I've
> written a server side jsv that checks whether the user requests at least
> 256M for h_vmem (below that, the prolog script might die due to missing
> memory and leaving the queue in an error state).
>
> Unfortunately the jsv feature is not reliable:
>
> [oreade38] ~ % for i in {1..5}; do               
> echo hostname | qsub -l h_vmem=128M              
> done
> Unable to run job: Do not require less than 256M for h_vmem.
> Exiting.
> Unable to run job: Do not require less than 256M for h_vmem.
> Exiting.
> Unable to run job: master got unknown command from JSV: "ERROR".
> Exiting.
> Unable to run job: master got unknown command from JSV: "ERROR".
> Exiting.
> Unable to run job: Do not require less than 256M for h_vmem.
> Exiting.
>
> On the server logs I see messages like this:
>
> 06/10/2009 11:30:35|worker|lolek-vm1|I|JSV modification time in "worker001" has changed
> 06/10/2009 11:30:36|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been stopped
> 06/10/2009 11:30:36|worker|lolek-vm1|I|JSV modification time in "worker001" has changed
> 06/10/2009 11:30:36|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been started
> 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" rejected job 921
> 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV modification time in "worker000" has changed
> 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV modification time in "worker000" has changed
> 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been started
> 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker000" rejected job 922
> 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" rejected job 923
> 06/10/2009 11:30:37|worker|lolek-vm1|I|JSV "worker001" will be restarted.
> 06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been stopped
> 06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "worker000" rejected job 924
> 06/10/2009 11:30:38|worker|lolek-vm1|I|JSV "worker000" will be restarted.
> 06/10/2009 11:30:39|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been stopped
> 06/10/2009 11:30:39|worker|lolek-vm1|I|JSV "/usr/gridengine/util/job_verifier" has been started
> 06/10/2009 11:30:40|worker|lolek-vm1|I|JSV "worker001" rejected job 925
>
> Looks like the success of the script is oscillating. Is it be a bug?
>
> Cheers,
> Andreas
>   


-- 
Sun Microsystems GmbH             Ernst Bablick
Dr.-Leo-Ritter-Str. 7             Software Engineer
D-93049 Regensburg                Phone: +49 (0)941 3075 135
Germany                           Fax:   +49 (0)941 3075 222
http://www.sun.de                 mailto: ernst.bablick at sun.com

Sitz der Gesellschaft:
Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht München: HRB 161028
Geschäftsführer: Thomas Schröder, Wolfgang Engels, Wolf Frenkel
Vorsitzender des Aufsichtsrates: Martin Häring

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=201411

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list