[GE globus] [GE users] Problem sending jobs with globusrun-ws: Current job state: Unsubmitted

Otheus (aka Timothy J. Shelling) otheus at gmail.com
Thu Dec 6 08:14:36 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

The sge.pm module is not properly coded. It should be invoking the SUPER
method of the JobHandler class in order to turn on Globus logging within
this module, but it does not do this. If it did this, we could see what's
happening within the SGE module.

I provided Esteban a patch to hard-code the value of the log file, to make
sure something got logged.

Esteban?

On Dec 6, 2007 1:18 AM, R. Jeff Porter <rjporter at lbl.gov> wrote:

> Hi Esteban,
>
> I checked the reporting file syntax and that was fine as expected.
> One thing to look at is whether the SGE job-id matches what globus
> thinks it is.  You should check for the job-id in the globus log file
> (our setup has it in $GLOBUS_LOCATION/var/container-real.log) via a
> message like:
>
> 2007-12-05 15:30:20,491 INFO  exec.StateMachine
> [RunQueueThread_9,logJobSubmitted:3525] Job 0aafcce0-a38a-11dc-aa2c-
> a94fc0e89ad8 submitted with local job ID '12345'
>
> In the above example, 12345 is the SGE job-id.
>
> On my testbed I actually hard-coded a bogus job-id into my sge.pm file
> and reproduced your symptoms as seen from my submit client
>
> Delegating user credentials...Done.
> Submitting job...Done.
> Job ID: uuid:0a1700f0-a38a-11dc-8a02-00304889ddce
> Termination time: 12/06/2007 23:30 GMT
> Current job state: Unsubmitted
>
> So having a job-id mismatch (globus log vs the reporting file) is one
> way to experience what you observe.
>
> Jeff
>
>
> On Wed, 2007-12-05 at 18:38 +0100, Esteban Freire Garcia wrote:
> > Hi Jeff,
> >
> > Ok. Thanks for your help. Looking the file
> > $GLOBUS_LOCATION/etc/globus-sge.conf , looks fine.
> > I don't know what else look.
> >
> > [globus at svgd GRAM]$ cat $GLOBUS_LOCATION/etc/globus-sge.conf
> > log_path=/opt/cesga/sge60/default/common/reporting
> >
> > Thanks,
> > Esteban
> >
> > Jeff Porter escribió:
> > > Hi Esteban,
> > >
> > > By eye these two reporting file dumps look fine.  I don't know the
> details about sge variations - I'm using 6.0u10 - but the gt4 submission
> is clearly working.
> > >
> > > The gt4 code looks for the reporting file by checking a globus config
> file. Specifically,
> > >
> > > $GLOBUS_LOCATION/etc/globus-sge.conf
> > >
> > > It should have the line:
> > >
> > > logfile=/actual-path-to-sge/default/common/reporting
> > >
> > > Having that config file correct may depend on whether you have
> $SGE_ROOT and $SGE_CELL defined in your shell during your globus install.
> > >
> > > Jeff
> > >
> > >
> > >
> > > > Hi Jeff,
> > > >
> > > > Thanks for you answer. Ok, I have the file
> > > > $SGE_ROOT/default/common/reporting.
> > > > No,  we are not using ARCO.  The only  thing  that  I think  maybe
> > > > is
> > > > happening, it's  that globus cannot read this file, but I tested to
> > > > read
> > > > this file as user "globus" and as user who sent the job, and I
> > > > could
> > > > read without any problem. Is there any place where I can indicate
> > > > to
> > > > globus to read this file?
> > > >
> > > > I put below the output to the file 'reporting', after send a job
> > > > using
> > > > globus and send a job with qsub.
> > > >
> > > > tail -f $SGE_ROOT/default/common/reporting
> > > > --------------------------------------------------------------------
> > > > --------------------------------------------------------------------
> > > > --------------------------------------------------------------------
> > > > ------
> > > > 1196873426:new_job:1196873426:1417619:-
> > > >
> 1:NONE:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:10241196873426:job_log:1196873426:pending:1417619:-1:NONE::cyteduser:
> svgd.cesga.es:0
> :1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:new
> > > > job
> > > > 1196873437:job_log:1196873437:sent:1417619:0:NONE:t:master:
> svgd.cesga.es:0
> :1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:sent
> > > > to execd
> > > > 1196873437:host_consumable:compute-1-
> > > > 12.local:1196873437:X:num_proc=1.000000=1.000000,s_vmem=
> 524288000.000000=1.300G1196873437:queue_consumable:pro_cytedgrid:compute-1-12.local:1196873437::num_proc=1.000000=1.000000
> ,s_vmem=524288000.000000=1.000G,slots=1.000000=1.000000
> > > > 1196873437:job_log:1196873437:delivered:1417619:0:NONE:r:master:
> svgd.cesga.es:0
> :1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job
> > > > received by execd
> > > > 1196873437:acct:pro_cytedgrid:compute-1-
> > > >
> 12.local:cesga:cyteduser:sge_job_script.28406:1417619:sge:0:1196873426:1196873390:1196873391:0:0:1:0:0:0.000000:0:0:0:0:5330:0:0:0.000000:0:0:0:0:258:45:NONE:defaultdepartment:NONE:1:0:0.000000:0.000000:0.000000:-U
> > > > pro_cytedgrid -l
> > > > arch=i386,h_fsize=1G,h_stack=16M,num_proc=1,s_rt=3600,s_vmem=500M:
> 0.000000:NONE:0.000000
> > > > 1196873437:job_log:1196873437:finished:1417619:0:NONE:r:execution
> > > > daemon:compute-1-
> > > > 12.local
> :0:1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job
> > > > exited
> > > > 1196873437:job_log:1196873437:finished:1417619:0:NONE:r:master:
> svgd.cesga.es:0
> :1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job
> > > > waits for schedds deletion
> > > > 1196873437:host_consumable:compute-1-
> > > > 12.local:1196873437:X:num_proc=0.000000=1.000000,s_vmem=
> 0.000000=1.300G1196873437:queue_consumable:pro_cytedgrid:compute-1-12.local:1196873437::num_proc=0.000000=1.000000
> ,s_vmem=0.000000=1.000G,slots=0.000000=1.000000
> > > > 1196873448:job_log:1196873448:deleted:1417619:0:NONE:T:scheduler:
> svgd.cesga.es:0
> :1024:1196873426:sge_job_script.28406:cyteduser:cesga::defaultdepartment:sge:job
> > > > deleted by schedd
> > > >
> > > > 1196873742:new_job:1196873742:1417621:-
> > > > 1:NONE:
> test.sh:esfreire:cesga::defaultdepartment:sge:10241196873742:job_log:1196873742:pending:1417621:-1:NONE::esfreire:svgd.cesga.es:0:1024:1196873742:test.sh:esfreire:cesga::defaultdepartment:sge:new
> > > > job
> > > > 1196873753:job_log:1196873753:sent:1417621:0:NONE:t:master:
> svgd.cesga.es:0:1024:1196873742:
> test.sh:esfreire:cesga::defaultdepartment:sge:sent
> > > > to execd
> > > > 1196873753:host_consumable:compute-1-
> > > > 14.local:1196873753:X:num_proc=1.000000=1.000000,s_vmem=
> 1073741824.000000=1.300G1196873753:queue_consumable:GRID:compute-1-14.local:1196873753::num_proc=1.000000=1.000000
> ,s_vmem=1073741824.000000=2.000G,slots=1.000000=1.000000
> > > > 1196873753:job_log:1196873753:delivered:1417621:0:NONE:r:master:
> svgd.cesga.es:0:1024:1196873742:
> test.sh:esfreire:cesga::defaultdepartment:sge:job
> > > > received by execd
> > > > 1196873754:acct:GRID:compute-1-
> > > >
> 14.local:cesga:esfreire:test.sh:1417621:sge:0:1196873742:1196873658:1196873658:0:0:0:0:0:0.000000:0:0:0:0:689:0:0:0.000000:0:0:0:0:202:2:NONE:defaultdepartment:NONE:1:0:0.000000:0.000000:0.000000:-U
> > > > paralelo-gigabit,jmourino,esfreire,blades_dell -l
> > > >
> arch=i386,h_fsize=1G,h_stack=16M,network=gigabit,num_proc=1,s_rt=3600,s_vmem=1G:
> 0.000000:NONE:0.000000
> > > > 1196873754:job_log:1196873754:finished:1417621:0:NONE:r:execution
> > > > daemon:compute-1-
> > > > 14.local:0:1024:1196873742:
> test.sh:esfreire:cesga::defaultdepartment:sge:job
> > > > exited
> > > > 1196873754:job_log:1196873754:finished:1417621:0:NONE:r:master:
> svgd.cesga.es:0:1024:1196873742:
> test.sh:esfreire:cesga::defaultdepartment:sge:job
> > > > waits for schedds deletion
> > > > 1196873754:host_consumable:compute-1-
> > > > 14.local:1196873754:X:num_proc=0.000000=1.000000,s_vmem=
> 0.000000=1.300G1196873754:queue_consumable:GRID:compute-1-14.local:1196873754::num_proc=0.000000=1.000000
> ,s_vmem=0.000000=2.000G,slots=0.000000=1.000000
> > > > 1196873764:job_log:1196873764:deleted:1417621:0:NONE:T:scheduler:
> svgd.cesga.es:0:1024:1196873742:
> test.sh:esfreire:cesga::defaultdepartment:sge:job
> > > > deleted by schedd
> > > >
> > > > --------------------------------------------------------------------
> > > > --------------------------------------------------------------------
> > > > --------------------------------------------------------------------
> > > > ------
> > > >
> > > > On the other hand, we are using SGE 6.0u6
> > > >
> > > > Thanks,
> > > > Esteban
> > > >
> > > > Jeff Porter escribió:
> > > >
> > > > > Hi Esteban,
> > > > >
> > > > > the logfile noted in the docs is the 'reporting' file:
> > > > >
> > > > $SGE_ROOT/default/common/reporting.  The gt4 c-code reads that file
> > > > for jobs state information instead of the calling qsub from sge.pm
> > > > as is done for gt2.  I wouldn't spend much time on the sge.pm file
> > > > as its use in gt4 is essentially just for submission.  And the
> > > > patch you say you applied before is directed at fixing gt2-specific
> > > > details that break gt4 submissions.
> > > >
> > > > > One other issue is if you are running ARCO you may have this
> > > > >
> > > > problem. I understand the dbwriter code deletes the reporting file
> > > > with each read as its mechanism for checkpointing. Thus gt4 will
> > > > never see the change in state through this file.
> > > >
> > > > > Thanks, Jeff
> > > > >
> > > > >
> > > > >
> > > > > > Hi Melvin,
> > > > > >
> > > > > > Thanks for you answer. I have "reporting=true" but I had
> > > > > > "joblog=false",
> > > > > > at these moments I already have changed this and now I have
> > > > > > "joblog=true", after this, I have reinstalled the packages of
> > > > > > "London
> > > > > > e-Science Centre" y I have ran the gpt-postinstall again, but
> > > > > > unfortunately,  it keeps without pass of the state
> "unsubmitted":
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > > ----------------------------------------------
> > > > > > [esfreire at svgd ~]$ globusrun-ws -submit -pft -T 10000 -s -S -
> > > > > > factory
> > > > > > svgd.cesga.es -Ft SGE -c /bin/hostname
> > > > > > Delegating user credentials...Done.
> > > > > > Submitting job...Done.
> > > > > > Job ID: uuid:1fe5c0d2-a31d-11dc-a78b-000423ac0723
> > > > > > Termination time: 12/06/2007 10:30 GMT
> > > > > > Current job state: Unsubmitted
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > > ----------------------------------------------
> > > > > > One thing that I don't understand is that in the link to "London
> > > > > > e-Science Centre" say, "Your SGE installation must also be
> > > > > > configured
> > > > > > with support for the reporting logfile enabled, and that logfile
> > > > > > must be
> > > > > > accessible from the server on which you are installing GT4", I
> > > > > > don't
> > > > > > know which is this "logfile"? I suppose that is
> > > > > > "$SGE_ROOT/default/spool/qmaster/messages"
> > > > > >
> > > > > > Other thing that it's indicating that something go wrong,  I
> > > > > >
> > > > think
> > > >
> > > > > > is
> > > > > > that the job only run about 1 second.
> > > > > >
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > > ----------------------------------------------
> > > > > > [globus at svgd JobManager]$ qacct -j 1417415
> > > > > > ==============================================================
> > > > > > qname        pro_cytedgrid
> > > > > > hostname     compute-1-12.local
> > > > > > group        cesga
> > > > > > owner        cyteduser
> > > > > > project      NONE
> > > > > > department   defaultdepartment
> > > > > > jobname      sge_job_script.1784
> > > > > > jobnumber    1417415
> > > > > > taskid       undefined
> > > > > > account      sge
> > > > > > priority     0
> > > > > > qsub_time    Wed Dec  5 11:18:41 2007
> > > > > > start_time   Wed Dec  5 11:18:05 2007
> > > > > > end_time     Wed Dec  5 11:18:06 2007
> > > > > > granted_pe   NONE
> > > > > > slots        1
> > > > > > failed       0
> > > > > > exit_status  0
> > > > > > ru_wallclock 1
> > > > > > ru_utime     0
> > > > > > ru_stime     0
> > > > > > ru_maxrss    0
> > > > > > ru_ixrss     0
> > > > > > ru_ismrss    0
> > > > > > ru_idrss     0
> > > > > > ru_isrss     0
> > > > > > ru_minflt    5328
> > > > > > ru_majflt    0
> > > > > > ru_nswap     0
> > > > > > ru_inblock   0
> > > > > > ru_oublock   0
> > > > > > ru_msgsnd    0
> > > > > > ru_msgrcv    0
> > > > > > ru_nsignals  0
> > > > > > ru_nvcsw     262
> > > > > > ru_nivcsw    44
> > > > > > cpu          0
> > > > > > mem          0.000
> > > > > > io           0.000
> > > > > > iow          0.000
> > > > > > maxvmem      0.000
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > ---
> > > >
> > > > > > ----------------------------------------------
> > > > > > I don't know what else change.
> > > > > >
> > > > > >
> > > > > > Thank you very much,
> > > > > > Esteban
> > > > > >
> > > > > > Melvin Koh escribió:
> > > > > >
> > > > > >
> > > > > > > Have you enabled "reporting=true" and "joblog=true" in "qconf
> -
> > > > > > >
> > > > > > >
> > > > > > mconf"?>
> > > > > >
> > > > > >
> > > > > > > On Fri, 23 Nov 2007, Esteban Freire Garcia wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > First of all, thanks for answer me. We installed the patch
> > > > > > > >
> > > > > > > >
> > > > > > yesterday,
> > > > > >
> > > > > >
> > > > > > > > unfortunately, we continue with the same problem, we will
> try
> > > > > > > >
> > > > > > > >
> > > > > > look the
> > > > > >
> > > > > >
> > > > > > > > jobmanager, because I think for some reason, the
> > > > > > > >
> > > > > > > >
> > > > > > jobmanager(sge.pm) is
> > > > > >
> > > > > >
> > > > > > > > not seeing the status for the job correctly, and it doesn't
> > > > > > > >
> > > > know
> > > >
> > > > > > > >
> > > > > > when
> > > > > >
> > > > > >
> > > > > > > > the job have finished.
> > > > > > > >
> > > > > > > >
> ---------------------------------------------------------------
> > > > > > > >
> > > > --
> > > >
> > > > > > > >
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > --
> > > >
> > > > > >
> > > > > > > > [esfreire at svgd ~]$  globusrun-ws -submit -pft -s -S -F
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > https://svgd.cesga.es:8443/wsrf/services/ManagedJobFactoryService -
> > > >
> > > > > > Ft
> > > > > >
> > > > > >
> > > > > > > > SGE -c /bin/hostname
> > > > > > > > Delegating user credentials...Done.
> > > > > > > > Submitting job...Done.
> > > > > > > > Job ID: uuid:580a49d2-9923-11dc-9646-000423ac0723
> > > > > > > > Termination time: 11/23/2007 17:49 GMT
> > > > > > > > Current job state: Unsubmitted
> > > > > > > >
> > > > > > > > globusrun-ws: Error querying job state
> > > > > > > >
> ---------------------------------------------------------------
> > > > > > > >
> > > > --
> > > >
> > > > > > > >
> > > > > >
> -----------------------------------------------------------------
> > > > > >
> > > > --
> > > >
> > > > > >
> > > > > > > > Thank you very much,
> > > > > > > > Esteban
> > > > > > > >
> > > > > > > > Otheus (aka Timothy J. Shelling) escribi?:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Nov 20, 2007 9:13 AM, Esteban Freire Garcia
> > > > > > > > >
> > > > > > > > >
> > > > > > <esfreire at cesga.es
> > > > > >
> > > > > >
> > > > > > > > > <mailto:esfreire at cesga.es>> wrote:
> > > > > > > > >
> > > > > > > > >     Hi,
> > > > > > > > >
> > > > > > > > >     We have installed 'gt4.0.5-x86_64_rhas_4-installer' on
> > > > > > > > >
> > > > "Red
> > > >
> > > > > > > > >
> > > > > > Hat>>>     Enterprise Linux ES release 4 (Nahant)".  ...
> > > > > >
> > > > > >
> > > > > > > > >     Now, we are trying to integrate Globus with SGE 6.0u6,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I don't know if this will help or not. I had to patch
> gt4.0.2
> > > > > > > > >
> > > > > > > > >
> > > > > > to work
> > > > > >
> > > > > >
> > > > > > > > > with SGE 6.0u4 as follows:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > ------------------------------------------------------------------
> > > > >
> > > > ---
> > > >
> > > > > To unsubscribe, e-mail:
> globus-unsubscribe at gridengine.sunsource.net
> > > > > For additional commands, e-mail: globus-
> > > > >
> > > > help at gridengine.sunsource.net>
> > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>


-- 
Otheus
otheus at gmail.com
+43.699.1049.7813



More information about the gridengine-users mailing list