[GE globus] [GE users] Problem sending jobs with globusrun-ws: Current job state: Unsubmitted

R. Jeff Porter rjporter at lbl.gov
Thu Dec 6 00:18:14 GMT 2007


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hi Esteban,

I checked the reporting file syntax and that was fine as expected.  
One thing to look at is whether the SGE job-id matches what globus
thinks it is.  You should check for the job-id in the globus log file
(our setup has it in $GLOBUS_LOCATION/var/container-real.log) via a
message like:

2007-12-05 15:30:20,491 INFO  exec.StateMachine
[RunQueueThread_9,logJobSubmitted:3525] Job 0aafcce0-a38a-11dc-aa2c-
a94fc0e89ad8 submitted with local job ID '12345'

In the above example, 12345 is the SGE job-id. 

On my testbed I actually hard-coded a bogus job-id into my sge.pm file
and reproduced your symptoms as seen from my submit client

Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:0a1700f0-a38a-11dc-8a02-00304889ddce
Termination time: 12/06/2007 23:30 GMT
Current job state: Unsubmitted

So having a job-id mismatch (globus log vs the reporting file) is one
way to experience what you observe. 

Jeff


On Wed, 2007-12-05 at 18:38 +0100, Esteban Freire Garcia wrote:
> Hi Jeff,
> 
> Ok. Thanks for your help. Looking the file
> $GLOBUS_LOCATION/etc/globus-sge.conf , looks fine.
> I don't know what else look.
> 
> [globus at svgd GRAM]$ cat $GLOBUS_LOCATION/etc/globus-sge.conf
> log_path=/opt/cesga/sge60/default/common/reporting
> 
> Thanks,
> Esteban
> 
> Jeff Porter escribió: 
> > Hi Esteban,
> > 
> > By eye these two reporting file dumps look fine.  I don't know the details about sge variations - I'm using 6.0u10 - but the gt4 submission is clearly working.   
> > 
> > The gt4 code looks for the reporting file by checking a globus config file. Specifically,
> > 
> > $GLOBUS_LOCATION/etc/globus-sge.conf
> > 
> > It should have the line:  
> > 
> > logfile=/actual-path-to-sge/default/common/reporting
> > 
> > Having that config file correct may depend on whether you have $SGE_ROOT and $SGE_CELL defined in your shell during your globus install.
> > 
> > Jeff
> > 
> > 
> >   
> > > Hi Jeff,
> > > 
> > > Thanks for you answer. Ok, I have the file 
> > > $SGE_ROOT/default/common/reporting.
> > > No,  we are not using ARCO.  The only  thing  that  I think  maybe 
> > > is 
> > > happening, it's  that globus cannot read this file, but I tested to 
> > > read 
> > > this file as user "globus" and as user who sent the job, and I 
> > > could 
> > > read without any problem. Is there any place where I can indicate 
> > > to 
> > > globus to read this file?
> > > 
> > > I put below the output to the file 'reporting', after send a job 
> > > using 
> > > globus and send a job with qsub.
> > > 
> > > tail -f $SGE_ROOT/default/common/reporting
> > > --------------------------------------------------------------------
> > > --------------------------------------------------------------------
> > > --------------------------------------------------------------------
> > > ------
> > > 1196873426:new_job:1196873426:1417619:-
> > > 1:NONE:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:10241196873426:job_log:1196873426:pending:1417619:-1:NONE::cyteduser:svgd.cesga.es:0:1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:new 
> > > job
> > > 1196873437:job_log:1196873437:sent:1417619:0:NONE:t:master:svgd.cesga.es:0:1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:sent 
> > > to execd
> > > 1196873437:host_consumable:compute-1-
> > > 12.local:1196873437:X:num_proc=1.000000=1.000000,s_vmem=524288000.000000=1.300G1196873437:queue_consumable:pro_cytedgrid:compute-1-12.local:1196873437::num_proc=1.000000=1.000000,s_vmem=524288000.000000=1.000G,slots=1.000000=1.000000
> > > 1196873437:job_log:1196873437:delivered:1417619:0:NONE:r:master:svgd.cesga.es:0:1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job 
> > > received by execd
> > > 1196873437:acct:pro_cytedgrid:compute-1-
> > > 12.local:cesga:cyteduser:sge_job_script.28406:1417619:sge:0:1196873426:1196873390:1196873391:0:0:1:0:0:0.000000:0:0:0:0:5330:0:0:0.000000:0:0:0:0:258:45:NONE:defaultdepartment:NONE:1:0:0.000000:0.000000:0.000000:-U 
> > > pro_cytedgrid -l 
> > > arch=i386,h_fsize=1G,h_stack=16M,num_proc=1,s_rt=3600,s_vmem=500M:0.000000:NONE:0.000000
> > > 1196873437:job_log:1196873437:finished:1417619:0:NONE:r:execution 
> > > daemon:compute-1-
> > > 12.local:0:1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job 
> > > exited
> > > 1196873437:job_log:1196873437:finished:1417619:0:NONE:r:master:svgd.cesga.es:0:1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job 
> > > waits for schedds deletion
> > > 1196873437:host_consumable:compute-1-
> > > 12.local:1196873437:X:num_proc=0.000000=1.000000,s_vmem=0.000000=1.300G1196873437:queue_consumable:pro_cytedgrid:compute-1-12.local:1196873437::num_proc=0.000000=1.000000,s_vmem=0.000000=1.000G,slots=0.000000=1.000000
> > > 1196873448:job_log:1196873448:deleted:1417619:0:NONE:T:scheduler:svgd.cesga.es:0:1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job 
> > > deleted by schedd
> > > 
> > > 1196873742:new_job:1196873742:1417621:-
> > > 1:NONE:test.sh:esfreire:cesga::defaultdepartment:sge:10241196873742:job_log:1196873742:pending:1417621:-1:NONE::esfreire:svgd.cesga.es:0:1024:1196873742:test.sh:esfreire:cesga::defaultdepartment:sge:new 
> > > job
> > > 1196873753:job_log:1196873753:sent:1417621:0:NONE:t:master:svgd.cesga.es:0:1024:1196873742:test.sh:esfreire:cesga::defaultdepartment:sge:sent 
> > > to execd
> > > 1196873753:host_consumable:compute-1-
> > > 14.local:1196873753:X:num_proc=1.000000=1.000000,s_vmem=1073741824.000000=1.300G1196873753:queue_consumable:GRID:compute-1-14.local:1196873753::num_proc=1.000000=1.000000,s_vmem=1073741824.000000=2.000G,slots=1.000000=1.000000
> > > 1196873753:job_log:1196873753:delivered:1417621:0:NONE:r:master:svgd.cesga.es:0:1024:1196873742:test.sh:esfreire:cesga::defaultdepartment:sge:job 
> > > received by execd
> > > 1196873754:acct:GRID:compute-1-
> > > 14.local:cesga:esfreire:test.sh:1417621:sge:0:1196873742:1196873658:1196873658:0:0:0:0:0:0.000000:0:0:0:0:689:0:0:0.000000:0:0:0:0:202:2:NONE:defaultdepartment:NONE:1:0:0.000000:0.000000:0.000000:-U 
> > > paralelo-gigabit,jmourino,esfreire,blades_dell -l 
> > > arch=i386,h_fsize=1G,h_stack=16M,network=gigabit,num_proc=1,s_rt=3600,s_vmem=1G:0.000000:NONE:0.000000
> > > 1196873754:job_log:1196873754:finished:1417621:0:NONE:r:execution 
> > > daemon:compute-1-
> > > 14.local:0:1024:1196873742:test.sh:esfreire:cesga::defaultdepartment:sge:job 
> > > exited
> > > 1196873754:job_log:1196873754:finished:1417621:0:NONE:r:master:svgd.cesga.es:0:1024:1196873742:test.sh:esfreire:cesga::defaultdepartment:sge:job 
> > > waits for schedds deletion
> > > 1196873754:host_consumable:compute-1-
> > > 14.local:1196873754:X:num_proc=0.000000=1.000000,s_vmem=0.000000=1.300G1196873754:queue_consumable:GRID:compute-1-14.local:1196873754::num_proc=0.000000=1.000000,s_vmem=0.000000=2.000G,slots=0.000000=1.000000
> > > 1196873764:job_log:1196873764:deleted:1417621:0:NONE:T:scheduler:svgd.cesga.es:0:1024:1196873742:test.sh:esfreire:cesga::defaultdepartment:sge:job 
> > > deleted by schedd
> > > 
> > > --------------------------------------------------------------------
> > > --------------------------------------------------------------------
> > > --------------------------------------------------------------------
> > > ------
> > > 
> > > On the other hand, we are using SGE 6.0u6
> > > 
> > > Thanks,
> > > Esteban
> > > 
> > > Jeff Porter escribió:
> > >     
> > > > Hi Esteban,
> > > > 
> > > > the logfile noted in the docs is the 'reporting' file: 
> > > >       
> > > $SGE_ROOT/default/common/reporting.  The gt4 c-code reads that file 
> > > for jobs state information instead of the calling qsub from sge.pm 
> > > as is done for gt2.  I wouldn't spend much time on the sge.pm file 
> > > as its use in gt4 is essentially just for submission.  And the 
> > > patch you say you applied before is directed at fixing gt2-specific 
> > > details that break gt4 submissions.
> > >     
> > > > One other issue is if you are running ARCO you may have this 
> > > >       
> > > problem. I understand the dbwriter code deletes the reporting file 
> > > with each read as its mechanism for checkpointing. Thus gt4 will 
> > > never see the change in state through this file. 
> > >     
> > > > Thanks, Jeff
> > > > 
> > > >   
> > > >       
> > > > > Hi Melvin,
> > > > > 
> > > > > Thanks for you answer. I have "reporting=true" but I had 
> > > > > "joblog=false", 
> > > > > at these moments I already have changed this and now I have 
> > > > > "joblog=true", after this, I have reinstalled the packages of 
> > > > > "London 
> > > > > e-Science Centre" y I have ran the gpt-postinstall again, but 
> > > > > unfortunately,  it keeps without pass of the state "unsubmitted":
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > ----------------------------------------------
> > > > > [esfreire at svgd ~]$ globusrun-ws -submit -pft -T 10000 -s -S -
> > > > > factory 
> > > > > svgd.cesga.es -Ft SGE -c /bin/hostname
> > > > > Delegating user credentials...Done.
> > > > > Submitting job...Done.
> > > > > Job ID: uuid:1fe5c0d2-a31d-11dc-a78b-000423ac0723
> > > > > Termination time: 12/06/2007 10:30 GMT
> > > > > Current job state: Unsubmitted
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > ----------------------------------------------
> > > > > One thing that I don't understand is that in the link to "London 
> > > > > e-Science Centre" say, "Your SGE installation must also be 
> > > > > configured 
> > > > > with support for the reporting logfile enabled, and that logfile 
> > > > > must be 
> > > > > accessible from the server on which you are installing GT4", I 
> > > > > don't 
> > > > > know which is this "logfile"? I suppose that is 
> > > > > "$SGE_ROOT/default/spool/qmaster/messages"
> > > > > 
> > > > > Other thing that it's indicating that something go wrong,  I 
> > > > >         
> > > think 
> > >     
> > > > > is 
> > > > > that the job only run about 1 second.
> > > > > 
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > ----------------------------------------------
> > > > > [globus at svgd JobManager]$ qacct -j 1417415
> > > > > ==============================================================
> > > > > qname        pro_cytedgrid      
> > > > > hostname     compute-1-12.local 
> > > > > group        cesga              
> > > > > owner        cyteduser          
> > > > > project      NONE               
> > > > > department   defaultdepartment  
> > > > > jobname      sge_job_script.1784
> > > > > jobnumber    1417415            
> > > > > taskid       undefined
> > > > > account      sge                
> > > > > priority     0                  
> > > > > qsub_time    Wed Dec  5 11:18:41 2007
> > > > > start_time   Wed Dec  5 11:18:05 2007
> > > > > end_time     Wed Dec  5 11:18:06 2007
> > > > > granted_pe   NONE               
> > > > > slots        1                  
> > > > > failed       0   
> > > > > exit_status  0                  
> > > > > ru_wallclock 1           
> > > > > ru_utime     0           
> > > > > ru_stime     0           
> > > > > ru_maxrss    0                  
> > > > > ru_ixrss     0                  
> > > > > ru_ismrss    0                  
> > > > > ru_idrss     0                  
> > > > > ru_isrss     0                  
> > > > > ru_minflt    5328               
> > > > > ru_majflt    0                  
> > > > > ru_nswap     0                  
> > > > > ru_inblock   0                  
> > > > > ru_oublock   0                  
> > > > > ru_msgsnd    0                  
> > > > > ru_msgrcv    0                  
> > > > > ru_nsignals  0                  
> > > > > ru_nvcsw     262                
> > > > > ru_nivcsw    44                 
> > > > > cpu          0           
> > > > > mem          0.000            
> > > > > io           0.000            
> > > > > iow          0.000            
> > > > > maxvmem      0.000
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > -----------------------------------------------------------------
> > > > >         
> > > ---
> > >     
> > > > > ----------------------------------------------
> > > > > I don't know what else change.
> > > > > 
> > > > > 
> > > > > Thank you very much,
> > > > > Esteban
> > > > > 
> > > > > Melvin Koh escribió:
> > > > >     
> > > > >         
> > > > > > Have you enabled "reporting=true" and "joblog=true" in "qconf -
> > > > > >       
> > > > > >           
> > > > > mconf"?>
> > > > >     
> > > > >         
> > > > > > On Fri, 23 Nov 2007, Esteban Freire Garcia wrote:
> > > > > > 
> > > > > >   
> > > > > >       
> > > > > >           
> > > > > > > Hi,
> > > > > > > 
> > > > > > > First of all, thanks for answer me. We installed the patch 
> > > > > > >         
> > > > > > >             
> > > > > yesterday, 
> > > > >     
> > > > >         
> > > > > > > unfortunately, we continue with the same problem, we will try 
> > > > > > >         
> > > > > > >             
> > > > > look the 
> > > > >     
> > > > >         
> > > > > > > jobmanager, because I think for some reason, the 
> > > > > > >         
> > > > > > >             
> > > > > jobmanager(sge.pm) is 
> > > > >     
> > > > >         
> > > > > > > not seeing the status for the job correctly, and it doesn't 
> > > > > > >             
> > > know 
> > >     
> > > > > > >                     
> > > > > when 
> > > > >     
> > > > >         
> > > > > > > the job have finished.
> > > > > > > 
> > > > > > > ---------------------------------------------------------------
> > > > > > >             
> > > --
> > >     
> > > > > > >                     
> > > > > -----------------------------------------------------------------
> > > > >         
> > > --
> > >     
> > > > >             
> > > > > > > [esfreire at svgd ~]$  globusrun-ws -submit -pft -s -S -F  
> > > > > > > 
> > > > > > >         
> > > > > > >             
> > > https://svgd.cesga.es:8443/wsrf/services/ManagedJobFactoryService -
> > >     
> > > > > Ft 
> > > > >     
> > > > >         
> > > > > > > SGE -c /bin/hostname
> > > > > > > Delegating user credentials...Done.
> > > > > > > Submitting job...Done.
> > > > > > > Job ID: uuid:580a49d2-9923-11dc-9646-000423ac0723
> > > > > > > Termination time: 11/23/2007 17:49 GMT
> > > > > > > Current job state: Unsubmitted
> > > > > > > 
> > > > > > > globusrun-ws: Error querying job state
> > > > > > > ---------------------------------------------------------------
> > > > > > >             
> > > --
> > >     
> > > > > > >                     
> > > > > -----------------------------------------------------------------
> > > > >         
> > > --
> > >     
> > > > >             
> > > > > > > Thank you very much,
> > > > > > > Esteban
> > > > > > > 
> > > > > > > Otheus (aka Timothy J. Shelling) escribi?:
> > > > > > > Hi,
> > > > > > >     
> > > > > > >         
> > > > > > >             
> > > > > > > > On Nov 20, 2007 9:13 AM, Esteban Freire Garcia 
> > > > > > > >           
> > > > > > > >               
> > > > > <esfreire at cesga.es 
> > > > >     
> > > > >         
> > > > > > > > <mailto:esfreire at cesga.es>> wrote:
> > > > > > > > 
> > > > > > > >     Hi,
> > > > > > > > 
> > > > > > > >     We have installed 'gt4.0.5-x86_64_rhas_4-installer' on 
> > > > > > > >               
> > > "Red 
> > >     
> > > > > > > >                         
> > > > > Hat>>>     Enterprise Linux ES release 4 (Nahant)".  ...
> > > > >     
> > > > >         
> > > > > > > >     Now, we are trying to integrate Globus with SGE 6.0u6, 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I don't know if this will help or not. I had to patch gt4.0.2 
> > > > > > > >           
> > > > > > > >               
> > > > > to work 
> > > > >     
> > > > >         
> > > > > > > > with SGE 6.0u4 as follows:
> > > > > > > > 
> > > > > > > >       
> > > > > > > >           
> > > > > > > >               
> > > > > >         
> > > > > >           
> > > > >             
> > > > ------------------------------------------------------------------
> > > >       
> > > ---
> > >     
> > > > To unsubscribe, e-mail: globus-unsubscribe at gridengine.sunsource.net
> > > > For additional commands, e-mail: globus-
> > > >       
> > > help at gridengine.sunsource.net>
> > >     
> > > >         
> > > 
> > >     
> > 
> >   
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: globus-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: globus-help at gridengine.sunsource.net




More information about the gridengine-users mailing list