[GE users] schedd dies - NULL element for JAT_status
andy.schwierskott at sun.com
Tue Jun 29 12:15:40 BST 2004
ok, so it looks this is a qmon-submit releated bug.
Could you please file an issue in Issuezilla (subcomponent "gui").
> I'm trying to learn how array jobs work with the executable
> set input1="test string "
> echo $SGE_TASK_ID $input1 >> array_output
> sleep 1
> Here's the process I go through:
> There are no jobs pending.
> I run the command 'ps -ef | grep sge' and see the master host is not running
> root 21372 1 0 Apr 22 ? 44:44
> awe 21374 1 0 Apr 22 ? 24:05
> awe 21583 1 1 Apr 22 ? 329:58
> Then I use 'rcsge' to restart sge_schedd:
> prompt> $SGE_ROOT/default/common/rcsge
> starting sge_qmaster
> critical error: found running qmaster with pid 21374 - not starting
> starting sge_schedd
> starting sge_execd
> critical error: execd is already running
> Ok, sge_schedd is now up.
> I watch it using the 'top' and rerun 'ps -ef | grep sge' to verify it's there.
> I submit the array job of the echo $SGE_TASK_ID script above:
> prompt> qsub -t 2-10:2 array_test_script
> your job-array 42.2-10:2 ("array_test_script") has been submitted
> Now, if I submit the job from the command prompt:
> qsub -t 2-10:2 array_test_script
> it works. But if I open qmon and submit the job with the following options
> sge_schedd dies:
> Job Script: [directory]/array_test_script
> Job Tasks: 2-10:2
> Job Name: array_test_script
> Merge Output: checked
> Start Job Immediately: checked
> I think it has to do with me checking Start Job Immediately.
> prompt> qstat -j 47
> job_number: 47
> exec_file: job_scripts/47
> submission_time: Mon Jun 28 13:54:08 2004
> owner: [removed]
> uid: 50002
> group: staff
> gid: 10
> sge_o_home: [removed]
> sge_o_log_name: [removed]
> sge_o_mail: /var/mail/[removed]
> sge_o_shell: /bin/tcsh
> sge_o_tz: EST5EDT
> sge_o_workdir: [removed]
> sge_o_host: pace122
> account: sge
> cell: default
> directive_prefix: #$
> merge: y
> mail_options: n
> mail_list: [removed]
> notify: FALSE
> job_name: array_test_script
> script_file: [removed]/array_test_script
> job-array tasks: 2-10:2
> scheduling info: queue "pace123.q" dropped because it is temporarily
> not available
> prompt> qstat -j 47 output
> Following jobs do not exist: output
> I'll submit through the command prompt or without Start... checked. I guess this
> might be a qmon/sge_schedd bug.
> Thanks for your help.
> Quoting Andy Schwierskott <andy.schwierskott at sun.com>:
> > Sam,
> > Have there been any running jobs in the system?
> > What was the exact command line for submitting the job? Please send the
> > qstat -r
> > qstat -j <jobid> output
> > (of course you may change any confidential data (if any) in this output).
> > Then:
> > please delete the job
> > submit it again
> > Does scheduler die again?
> > Andy
> > > I haven't played around with my SGE5.3p6 installation or the 2 Solaris 8
> > boxes
> > > I'm testing it on for 3 weeks. I'm back, and submit an array job that only
> > > prints $SGE_TASK_ID and the job never executes. Schedd is not running. I
> > > restart it, but it immediately dies:
> > > (schedd message file)
> > > Fri Jun 25 17:32:11 2004|schedd|pace122|I|starting up 5.3p6 (sge)
> > > Fri Jun 25 17:32:11 2004|schedd|pace122|C|!!!!!!!!!! lGetUlong(): got NULL
> > > element for JAT_status !!!!!!!!!!
> > >
> > > No system changes/upgrades since last time it worked.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users