[GE users] schedd dies - NULL element for JAT_status

Jmail jmail at valleyserve.com
Mon Jun 28 19:08:53 BST 2004


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Andy,

I'm trying to learn how array jobs work with the executable
'array_test_script':
#!/bin/csh
set input1="test string "
echo $SGE_TASK_ID $input1 >> array_output
sleep 1

Here's the process I go through:
There are no jobs pending.
I run the command 'ps -ef | grep sge' and see the master host is not running
sge_schedd:
    root 21372     1  0   Apr 22 ?       44:44
[$SGE_ROOT]/bin/solaris64/sge_commd
     awe 21374     1  0   Apr 22 ?       24:05
[$SGE_ROOT]/bin/solaris64/sge_qmaster
     awe 21583     1  1   Apr 22 ?       329:58
[$SGE_ROOT]/bin/solaris64/sge_execd

Then I use 'rcsge' to restart sge_schedd:
prompt> $SGE_ROOT/default/common/rcsge
   starting sge_qmaster
critical error: found running qmaster with pid 21374 - not starting
   starting sge_schedd
   starting sge_execd
critical error: execd is already running

Ok, sge_schedd is now up.
I watch it using the 'top' and rerun 'ps -ef | grep sge' to verify it's there.
I submit the array job of the echo $SGE_TASK_ID script above:
prompt> qsub -t 2-10:2 array_test_script
your job-array 42.2-10:2 ("array_test_script") has been submitted

Now, if I submit the job from the command prompt:
qsub -t 2-10:2 array_test_script

it works. But if I open qmon and submit the job with the following options
sge_schedd dies:
Job Script: [directory]/array_test_script
Job Tasks: 2-10:2
Job Name: array_test_script
Merge Output: checked
Start Job Immediately: checked

I think it has to do with me checking Start Job Immediately.

prompt> qstat -j 47
job_number:                 47
exec_file:                  job_scripts/47
submission_time:            Mon Jun 28 13:54:08 2004
owner:                      [removed]
uid:                        50002
group:                      staff
gid:                        10
sge_o_home:                 [removed]
sge_o_log_name:             [removed]
sge_o_path:                
[removed]/sge/bin/solaris64:/usr/local/bin:/usr/bin:/usr/openwin/bin:/opt/SUNWspro/bin:.
sge_o_mail:                 /var/mail/[removed]
sge_o_shell:                /bin/tcsh
sge_o_tz:                   EST5EDT
sge_o_workdir:              [removed]
sge_o_host:                 pace122
account:                    sge
cell:                       default
directive_prefix:           #$
merge:                      y
mail_options:               n   
mail_list:                  [removed]
notify:                     FALSE
job_name:                   array_test_script
script_file:                [removed]/array_test_script
job-array tasks:            2-10:2
scheduling info:            queue "pace123.q" dropped because it is temporarily
not available

prompt> qstat -j 47 output
Following jobs do not exist: output

I'll submit through the command prompt or without Start... checked. I guess this
might be a qmon/sge_schedd bug.
Thanks for your help.

-Sam

Quoting Andy Schwierskott <andy.schwierskott at sun.com>:

> Sam,
> 
> Have there been any running jobs in the system?
> 
> What was the exact command line for submitting the job? Please send the
> 
>    qstat -r
>    qstat -j <jobid> output
> 
> (of course you may change any confidential data (if any) in this output).
> 
> Then:
> 
>    please delete the job
>    submit it again
> 
> Does scheduler die again?
> 
> Andy
> 
> > I haven't played around with my SGE5.3p6 installation or the 2 Solaris 8
> boxes
> > I'm testing it on for 3 weeks. I'm back, and submit an array job that only
> > prints $SGE_TASK_ID and the job never executes. Schedd is not running. I
> > restart it, but it immediately dies:
> > (schedd message file)
> > Fri Jun 25 17:32:11 2004|schedd|pace122|I|starting up 5.3p6 (sge)
> > Fri Jun 25 17:32:11 2004|schedd|pace122|C|!!!!!!!!!! lGetUlong(): got NULL
> > element for JAT_status !!!!!!!!!!
> >
> > No system changes/upgrades since last time it worked.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> 




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list