Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (130 - 132 of 431)

Ticket Resolution Summary Owner Reporter
#569 fixed IZ2716: interactive jobs (qlogin, qrsh without command) don't set the TZ environment variable correctly pollinger
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2716]

        Issue #:      2716             Platform:     All      Reporter: pollinger (pollinger)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      6.2         CC:    None defined
        Status:       NEW              Priority:     P4
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     interactive jobs (qlogin, qrsh without command) don't set the TZ environment variable correctly
   Status whiteboard:
      Attachments:

     Issue 2716 blocks:
   Votes for issue 2716:


   Opened: Thu Sep 4 06:16:00 -0700 2008 
------------------------


qrsh without command and qlogin don't set the TZ environment variable correctly,
therefore "date" prints it in the wrong format.
This can easily be reproduced on Linux, on Solaris, "date" seems to read the
time zone from somewhere else if TZ is not set.
#577 fixed IZ2740: Parallel jobs should be handled as a group Dave Love <d.love@…> reuti
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2740]

        Issue #:      2740             Platform:     All       Reporter: reuti (reuti)
       Component:     gridengine          OS:        All
     Subcomponent:    kernel           Version:      6.1u5        CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    FEATURE
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     Parallel jobs should be handled as a group
   Status whiteboard:
      Attachments:

     Issue 2740 blocks:
   Votes for issue 2740:


   Opened: Sat Sep 27 13:38:00 -0700 2008 
------------------------


If one process of a parallel job gets suspended for whatever reason, the complete job should be
affected, as a partial parallel job can't continue anyway. For now a suspension of a queue with a slave
process on it is simply not delivered. Neither to the slave process on this node, nor to the master of
this parallel job of course.

Even if the intention is still, that the suspension of a parallel job must be handled by a custom method,
this custom suspend_method should be invoked on the master node of this parallel job (like the
resume_method later on).

From the email discussion: http://gridengine.sunsource.net/servlets/ReadMsg?
list=users&msgNo=26104

I meant something different. You have one parallel job with just 2 slots running on node1 (master) and
node2 (slave) in a queue called parallel.q. On node2 a serial job starts in a superordinated queue
serial.q (or another parallel job with a different node allocation). Although the queue instance
parallel@node2 is flagged as "S" suspended, no signal is send to the parallel job running there.

This would lead to a further discussion: should the slave-execd talk to master-execd to suspend the
complete job? Most likely it can't run anyway when one of the slaves is suspended.


IMO, it should, since it really doesn't make sense to suspend part of a job any more than it makes sense
to kill part of a job or adjust the priority of part of a job.  A parallel job, whether it is distributed or not,
should be treated as a group of related processes where the entire job is treated as a unit.

   ------- Additional comments from rayson Sun Sep 28 23:27:46 -0700 2008 -------
If I read the code correctly, suspension on subordinate is triggered by qmaster,
not from the local execds. sge_signal_queue() and signal_slave_jobs_in_queue()
should be able to suspend the whole parallel job when any slave tasks get
subordinate suspended.

sge_signal_queue() calls signal_slave_jobs_in_queue(), which has:

  /* search master queue - needed for signalling of a job */

It eventually calls sge_signal_queue() when it finds the right master queue. And
then, we will hit the bug that Ron found when it calls signal_slave_tasks_of_job():

   if (!jep) {/* signalling a queue ? - handle slave jobs in this queue */
      signal_slave_jobs_in_queue(ctx, how, qep, monitor);
   }
   else {/* is this the master queue of this job to signal ? - then decide
whether slave tasks also must get signalled */
      if (!strcmp(lGetString(lFirst(lGetList(jatep,
JAT_granted_destin_identifier_list)),
            JG_qname), lGetString(qep, QU_full_name))) {
         signal_slave_tasks_of_job(ctx, how, jep, jatep, monitor);
      }
   }


I believe we will wait for the answer from Shannon to see what else is needed...

http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=26106

Rayson


   ------- Additional comments from reuti Mon Sep 29 06:16:02 -0700 2008 -------
But shouldn't show up all slots of the parallel job as suspended (or more correct: subordinated) then
(please see below)? Maybe there should be another state for the job: "g" - suspended because at least
one process of the group was suspended (for whatever reason). By looking at "qstat -f", it would be
easy to spot out the reason: all processes in "qstat -g t" are "g" for a job, and on at least one them the
queue is in state "S" or "s".

This feature should also only be enabled by an additional switch or a qmaster_params
SUSPEND_PARALLEL_GROUP=yes.

============================

Strange observation: during my tests with a 4 node parallel job and one serial job, it happened that
sometimes:

- none of the parallel processes shows "S"
- only one of the parallel processes shows "S"
- two or more (even all) show "S"

$ qstat -g t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID
-------------------------------------------------------------------------------
-----------------------------------
    386 0.61000 test1.sh   reuti        r     09/29/2008 14:39:04 parallel@node10                SLAVE
    386 0.61000 test1.sh   reuti        S     09/29/2008 14:39:04 parallel@node12                SLAVE
    386 0.61000 test1.sh   reuti        S     09/29/2008 14:39:04 parallel@node14                MASTER
    386 0.61000 test1.sh   reuti        S     09/29/2008 14:39:04 parallel@node15                SLAVE
    394 0.50125 test.sh    reuti        r     09/29/2008 15:06:32 vast@node12                    MASTER

Why not the process on node10?

But indeed: the processes on node 14 get the STOP, although the serial job runs on node 12. So it's
really designated in some way already.

   ------- Additional comments from svdavidson Mon Sep 29 07:11:32 -0700 2008 -------
By looking at the trace files for the slave tasks, I noticed that when
suspending a queue using qmod -s, the local processes all get sent a SIGSTOP
signal, but the remote processes do not.  However, when unsuspending the queue,
both the local and the remote processes receive a SIGCONT signal.

   ------- Additional comments from svdavidson Mon Sep 29 11:05:53 -0700 2008 -------
The signal events are being sent by the qmaster to the parallel queues.

 82040  19471 46922316396864     JOB 62: sent signal STOP (retry after 60
seconds) host: prod-0001
 82136  19471 46922316396864     JOB 62: sent signal STOP (retry after 60
seconds) host: prod-0002

When the execution daemon receives the signal event, in
execd_signal_queue.c:signal_job(), the job state is marked as SUSPENDED and
sge_execd_deliver_signal() is called.

685          state = lGetUlong(jatep, JAT_state);
686          if (!ISSET(state,JSUSPENDED)) {
687             suspend_change = 1;
688          }
689          SETBIT(JSUSPENDED, state);
690          CLEARBIT(JRUNNING, state);
691          lSetUlong(jatep, JAT_state, state);
692
693          /* if this is a stop signal for a job
694             which is in at least ONE queue
695             which is already stopped we
696             do not deliver the signal */
697
698          getridofjob = sge_execd_deliver_signal(signal, jep, jatep);

In sge_execd_deliver_signal(), in the SIGSTOP handling code, this same state is
checked and if the state is SUSPENDED, no signal is sent.

294    /* Simply apply signal to all subtasks of the job
295       except in case of SGE_MIGRATE when there is a
296       ckpt env with "migrate on suspend" configured */
297    queue_already_suspended = (lGetUlong(jatep, JAT_state)&JSUSPENDED);
298    if (!(sig == SGE_MIGRATE
299          && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND))
300          && !queue_already_suspended) {
301       lListElem *petep;
302       /* signal each pe task */
303       for_each (petep, lGetList(jatep, JAT_task_list)) {
304          if (sge_kill((int)lGetUlong(petep, PET_pid), sig,
305             lGetUlong(jep, JB_job_number), lGetUlong(jatep, JAT_task_number),
306             lGetString(petep, PET_id))==-2)
307             getridofjob = 1;
308       }
309    }
310


For SIGSTOP, the state is always suspended, so no STOP signals are ever sent to
slave tasks.

The variable name is queue_already_suspended, but it's actually checking to see
if the task has been suspended. Can anyone explain this check?

Note: This code also has the same problem reported by Ron, where the
CHECKPOINT_SUSPEND is being compared with an OR instead of an AND.

   ------- Additional comments from svdavidson Mon Sep 29 12:28:36 -0700 2008 -------
The actual problem is that parallel job suspension is broken.  I removed the
check of queue_already_suspended in sge_execd_deliver_signal(), and suspending
of parallel jobs works now. Suspending of queues, jobs, and queue subordination
all started working for parallel jobs. The only remaining question is what is
supposed to be the purpose of the queue_already_suspended check?

   ------- Additional comments from svdavidson Sat Oct 4 13:13:09 -0700 2008 -------
I have been running the patched code for about a week and it is working for
suspending parallel jobs.  The patches I used are included below.  The patches
are based on the SGE 6.1u3 source code.

diff -ru gridengine-V61u3.ORIG/source/daemons/execd/execd_signal_queue.c
gridengine-V61u3/source/daemons/execd/execd_signal_queue.c
--- gridengine-V61u3.ORIG/source/daemons/execd/execd_signal_queue.c
2007-05-09 05:36:51.000000000 -0500
+++ gridengine-V61u3/source/daemons/execd/execd_signal_queue.c  2008-10-04
12:56:55.630669364 -0500
@@ -236,7 +236,9 @@
 lListElem *jep,
 lListElem *jatep
 ) {
+#if 0
    int queue_already_suspended;
+#endif
    int getridofjob = 0;

    DENTER(TOP_LAYER, "sge_execd_deliver_signal");
@@ -288,19 +290,27 @@

 /*
    DPRINTF(("(sig==SGE_MIGRATE) = %d (ckpt on suspend) = %d %d\n",
-      (sig == SGE_MIGRATE), lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND,
+      (sig == SGE_MIGRATE), lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND,
          lGetUlong(jep, JB_checkpoint_attr)));
 */
    /* Simply apply signal to all subtasks of the job
       except in case of SGE_MIGRATE when there is a
       ckpt env with "migrate on suspend" configured */
+#if 0
    queue_already_suspended = (lGetUlong(jatep, JAT_state)&JSUSPENDED);
-   if (!(sig == SGE_MIGRATE
-         && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND))
+   DPRINTF(("queue_already_suspended = %d\n", queue_already_suspended));
+   if (!(sig == SGE_MIGRATE
+         && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND))
          && !queue_already_suspended) {
+#endif
+
+   if (!(sig == SGE_MIGRATE
+         && (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND))) {
       lListElem *petep;
+      DPRINTF(("signaling each pe task with signal %d\n", sig));
       /* signal each pe task */
       for_each (petep, lGetList(jatep, JAT_task_list)) {
+         DPRINTF(("signaling pe task pid=%d\n", lGetUlong(petep, PET_pid)));
          if (sge_kill((int)lGetUlong(petep, PET_pid), sig,
             lGetUlong(jep, JB_job_number), lGetUlong(jatep, JAT_task_number),
             lGetString(petep, PET_id))==-2)
@@ -308,6 +318,8 @@
       }
    }

+   DPRINTF(("lGetUlong(jatep, JAT_status) = %d, JSLAVE = %d\n",
lGetUlong(jatep, JAT_status), JSLAVE));
+
    if (lGetUlong(jatep, JAT_status)!=JSLAVE)
       if (sge_kill((int)lGetUlong(jatep, JAT_pid), sig, lGetUlong(jep,
JB_job_number),
                         lGetUlong(jatep, JAT_task_number), NULL)==-2)
diff -ru gridengine-V61u3.ORIG/source/daemons/qmaster/sge_qmod_qmaster.c
gridengine-V61u3/source/daemons/qmaster/sge_qmod_qmaster.c
--- gridengine-V61u3.ORIG/source/daemons/qmaster/sge_qmod_qmaster.c
2007-10-22 05:05:04.000000000 -0500
+++ gridengine-V61u3/source/daemons/qmaster/sge_qmod_qmaster.c  2008-09-26
15:51:43.473894911 -0500
@@ -1302,7 +1302,7 @@
    /* do not signal slave tasks in case of checkpointing jobs with
       STOP/CONT when suspending means migration */
    if ((how==SGE_SIGCONT || how==SGE_SIGSTOP) &&
-      (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)!=0) {
+      (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND)!=0) {
       DPRINTF(("omit signaling - checkpoint script does action for whole job\n"));
       return;
    }

   ------- Additional comments from reuti Sat Oct 4 14:43:33 -0700 2008 -------
Did you also try with subordination like my configuration which I posted above? All slave tasks are always
honored and none is skipped?
#581 fixed IZ2757: man sge_conf lists telnet for rlogin and rsh / qlogin_command uses qlogin_command /usr/bin/telnetd Dave Love <d.love@…> reuti
Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2757]

        Issue #:      2757             Platform:     All      Reporter: reuti (reuti)
       Component:     gridengine          OS:        All
     Subcomponent:    man              Version:      6.2         CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     man sge_conf lists telnet for rlogin and rsh / qlogin_command uses qlogin_command /usr/bin/telnetd
   Status whiteboard:
      Attachments:

     Issue 2757 blocks:
   Votes for issue 2757:


   Opened: Wed Oct 15 10:30:00 -0700 2008 
------------------------


The man page sge_conf states for

rlogin_command
rlogin_daemon
rsh_command
rsh_daemon

to use the qlogin examples, hence telnet and in.telnetd to fall back to the old startup method. I would
expect there:

$SGE_ROOT/utilbin/$ARC/rlogin
/usr/sbin/in.rlogind
$SGE_ROOT/utilbin/$ARC/rsh
$SGE_ROOT/utilbin/$ARC/rshd

to be listed instead. I can put an absolute path there, but it can't be generic like in former times, hence
I need local configurations in a mixed environment now.

   ------- Additional comments from reuti Fri Oct 17 03:09:39 -0700 2008 -------
The complete entries which must be entered to fall back are:

qlogin_command               /usr/bin/telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_command               /usr/sge/utilbin/lx24-x86/rlogin
rlogin_daemon                /usr/sbin/in.rlogind
rsh_command                  /usr/sge/utilbin/lx24-x86/rsh
rsh_daemon                   /usr/sge/utilbin/lx24-x86/rshd -l

Besides giving this exact information what will be called is necessary to setup any calls:

qlogin_command will be called with the parameters "$HOST $PORT"

rlogin_command will be called with "-p $PORT $HOST" (for now it states the opposite order: "...started
with the target host and port number as parameters." in man sge_conf)

rsh_command will be called with "-n -p $PORT $HOST" (for now it states the opposite order and it would
conform to telnet: "...the target host and port number as parameters like required for telnet(1) plus..." in
man sge_conf. The order is of course wrong.)

   ------- Additional comments from reuti Mon Nov 24 16:02:33 -0700 2008 -------
Add on: the section for qlogin_command refers to "qlogin_command /usr/bin/telnetd", it must read
"qlogin_command /usr/bin/telnet".
Note: See TracQuery for help on using queries.