[GE users] schedd dies - NULL element for JAT_status

Andy Schwierskott andy.schwierskott at sun.com
Tue Jun 29 12:15:40 BST 2004


Hi,

ok, so it looks this is a qmon-submit releated bug.

Could you please file an issue in Issuezilla (subcomponent "gui").

Andy

> Andy,
>
> I'm trying to learn how array jobs work with the executable
> 'array_test_script':
> #!/bin/csh
> set input1="test string "
> echo $SGE_TASK_ID $input1 >> array_output
> sleep 1
>
> Here's the process I go through:
> There are no jobs pending.
> I run the command 'ps -ef | grep sge' and see the master host is not running
> sge_schedd:
>     root 21372     1  0   Apr 22 ?       44:44
> [$SGE_ROOT]/bin/solaris64/sge_commd
>      awe 21374     1  0   Apr 22 ?       24:05
> [$SGE_ROOT]/bin/solaris64/sge_qmaster
>      awe 21583     1  1   Apr 22 ?       329:58
> [$SGE_ROOT]/bin/solaris64/sge_execd
>
> Then I use 'rcsge' to restart sge_schedd:
> prompt> $SGE_ROOT/default/common/rcsge
>    starting sge_qmaster
> critical error: found running qmaster with pid 21374 - not starting
>    starting sge_schedd
>    starting sge_execd
> critical error: execd is already running
>
> Ok, sge_schedd is now up.
> I watch it using the 'top' and rerun 'ps -ef | grep sge' to verify it's there.
> I submit the array job of the echo $SGE_TASK_ID script above:
> prompt> qsub -t 2-10:2 array_test_script
> your job-array 42.2-10:2 ("array_test_script") has been submitted
>
> Now, if I submit the job from the command prompt:
> qsub -t 2-10:2 array_test_script
>
> it works. But if I open qmon and submit the job with the following options
> sge_schedd dies:
> Job Script: [directory]/array_test_script
> Job Tasks: 2-10:2
> Job Name: array_test_script
> Merge Output: checked
> Start Job Immediately: checked
>
> I think it has to do with me checking Start Job Immediately.
>
> prompt> qstat -j 47
> job_number:                 47
> exec_file:                  job_scripts/47
> submission_time:            Mon Jun 28 13:54:08 2004
> owner:                      [removed]
> uid:                        50002
> group:                      staff
> gid:                        10
> sge_o_home:                 [removed]
> sge_o_log_name:             [removed]
> sge_o_path:
> [removed]/sge/bin/solaris64:/usr/local/bin:/usr/bin:/usr/openwin/bin:/opt/SUNWspro/bin:.
> sge_o_mail:                 /var/mail/[removed]
> sge_o_shell:                /bin/tcsh
> sge_o_tz:                   EST5EDT
> sge_o_workdir:              [removed]
> sge_o_host:                 pace122
> account:                    sge
> cell:                       default
> directive_prefix:           #$
> merge:                      y
> mail_options:               n
> mail_list:                  [removed]
> notify:                     FALSE
> job_name:                   array_test_script
> script_file:                [removed]/array_test_script
> job-array tasks:            2-10:2
> scheduling info:            queue "pace123.q" dropped because it is temporarily
> not available
>
> prompt> qstat -j 47 output
> Following jobs do not exist: output
>
> I'll submit through the command prompt or without Start... checked. I guess this
> might be a qmon/sge_schedd bug.
> Thanks for your help.
>
> -Sam
>
> Quoting Andy Schwierskott <andy.schwierskott at sun.com>:
>
> > Sam,
> >
> > Have there been any running jobs in the system?
> >
> > What was the exact command line for submitting the job? Please send the
> >
> >    qstat -r
> >    qstat -j <jobid> output
> >
> > (of course you may change any confidential data (if any) in this output).
> >
> > Then:
> >
> >    please delete the job
> >    submit it again
> >
> > Does scheduler die again?
> >
> > Andy
> >
> > > I haven't played around with my SGE5.3p6 installation or the 2 Solaris 8
> > boxes
> > > I'm testing it on for 3 weeks. I'm back, and submit an array job that only
> > > prints $SGE_TASK_ID and the job never executes. Schedd is not running. I
> > > restart it, but it immediately dies:
> > > (schedd message file)
> > > Fri Jun 25 17:32:11 2004|schedd|pace122|I|starting up 5.3p6 (sge)
> > > Fri Jun 25 17:32:11 2004|schedd|pace122|C|!!!!!!!!!! lGetUlong(): got NULL
> > > element for JAT_status !!!!!!!!!!
> > >
> > > No system changes/upgrades since last time it worked.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list