Custom Query (431 matches)
Results (130 - 132 of 431)
Ticket | Resolution | Summary | Owner | Reporter |
---|---|---|---|---|
#569 | fixed | IZ2716: interactive jobs (qlogin, qrsh without command) don't set the TZ environment variable correctly | pollinger | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2716] Issue #: 2716 Platform: All Reporter: pollinger (pollinger) Component: gridengine OS: All Subcomponent: execution Version: 6.2 CC: None defined Status: NEW Priority: P4 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: pollinger (pollinger) QA Contact: pollinger URL: * Summary: interactive jobs (qlogin, qrsh without command) don't set the TZ environment variable correctly Status whiteboard: Attachments: Issue 2716 blocks: Votes for issue 2716: Opened: Thu Sep 4 06:16:00 -0700 2008 ------------------------ qrsh without command and qlogin don't set the TZ environment variable correctly, therefore "date" prints it in the wrong format. This can easily be reproduced on Linux, on Solaris, "date" seems to read the time zone from somewhere else if TZ is not set. |
|||
#577 | fixed | IZ2740: Parallel jobs should be handled as a group | Dave Love <d.love@…> | reuti |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2740] Issue #: 2740 Platform: All Reporter: reuti (reuti) Component: gridengine OS: All Subcomponent: kernel Version: 6.1u5 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: FEATURE Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: Parallel jobs should be handled as a group Status whiteboard: Attachments: Issue 2740 blocks: Votes for issue 2740: Opened: Sat Sep 27 13:38:00 -0700 2008 ------------------------ If one process of a parallel job gets suspended for whatever reason, the complete job should be affected, as a partial parallel job can't continue anyway. For now a suspension of a queue with a slave process on it is simply not delivered. Neither to the slave process on this node, nor to the master of this parallel job of course. Even if the intention is still, that the suspension of a parallel job must be handled by a custom method, this custom suspend_method should be invoked on the master node of this parallel job (like the resume_method later on). From the email discussion: http://gridengine.sunsource.net/servlets/ReadMsg? list=users&msgNo=26104 I meant something different. You have one parallel job with just 2 slots running on node1 (master) and node2 (slave) in a queue called parallel.q. On node2 a serial job starts in a superordinated queue serial.q (or another parallel job with a different node allocation). Although the queue instance parallel@node2 is flagged as "S" suspended, no signal is send to the parallel job running there. This would lead to a further discussion: should the slave-execd talk to master-execd to suspend the complete job? Most likely it can't run anyway when one of the slaves is suspended. IMO, it should, since it really doesn't make sense to suspend part of a job any more than it makes sense to kill part of a job or adjust the priority of part of a job. A parallel job, whether it is distributed or not, should be treated as a group of related processes where the entire job is treated as a unit. ------- Additional comments from rayson Sun Sep 28 23:27:46 -0700 2008 ------- If I read the code correctly, suspension on subordinate is triggered by qmaster, not from the local execds. sge_signal_queue() and signal_slave_jobs_in_queue() should be able to suspend the whole parallel job when any slave tasks get subordinate suspended. sge_signal_queue() calls signal_slave_jobs_in_queue(), which has: /* search master queue - needed for signalling of a job */ It eventually calls sge_signal_queue() when it finds the right master queue. And then, we will hit the bug that Ron found when it calls signal_slave_tasks_of_job(): if (!jep) {/* signalling a queue ? - handle slave jobs in this queue */ signal_slave_jobs_in_queue(ctx, how, qep, monitor); } else {/* is this the master queue of this job to signal ? - then decide whether slave tasks also must get signalled */ if (!strcmp(lGetString(lFirst(lGetList(jatep, JAT_granted_destin_identifier_list)), JG_qname), lGetString(qep, QU_full_name))) { signal_slave_tasks_of_job(ctx, how, jep, jatep, monitor); } } I believe we will wait for the answer from Shannon to see what else is needed... http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=26106 Rayson ------- Additional comments from reuti Mon Sep 29 06:16:02 -0700 2008 ------- But shouldn't show up all slots of the parallel job as suspended (or more correct: subordinated) then (please see below)? Maybe there should be another state for the job: "g" - suspended because at least one process of the group was suspended (for whatever reason). By looking at "qstat -f", it would be easy to spot out the reason: all processes in "qstat -g t" are "g" for a job, and on at least one them the queue is in state "S" or "s". This feature should also only be enabled by an additional switch or a qmaster_params SUSPEND_PARALLEL_GROUP=yes. ============================ Strange observation: during my tests with a 4 node parallel job and one serial job, it happened that sometimes: - none of the parallel processes shows "S" - only one of the parallel processes shows "S" - two or more (even all) show "S" $ qstat -g t job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------------- ----------------------------------- 386 0.61000 test1.sh reuti r 09/29/2008 14:39:04 parallel@node10 SLAVE 386 0.61000 test1.sh reuti S 09/29/2008 14:39:04 parallel@node12 SLAVE 386 0.61000 test1.sh reuti S 09/29/2008 14:39:04 parallel@node14 MASTER 386 0.61000 test1.sh reuti S 09/29/2008 14:39:04 parallel@node15 SLAVE 394 0.50125 test.sh reuti r 09/29/2008 15:06:32 vast@node12 MASTER Why not the process on node10? But indeed: the processes on node 14 get the STOP, although the serial job runs on node 12. So it's really designated in some way already. ------- Additional comments from svdavidson Mon Sep 29 07:11:32 -0700 2008 ------- By looking at the trace files for the slave tasks, I noticed that when suspending a queue using qmod -s, the local processes all get sent a SIGSTOP signal, but the remote processes do not. However, when unsuspending the queue, both the local and the remote processes receive a SIGCONT signal. ------- Additional comments from svdavidson Mon Sep 29 11:05:53 -0700 2008 ------- The signal events are being sent by the qmaster to the parallel queues. 82040 19471 46922316396864 JOB 62: sent signal STOP (retry after 60 seconds) host: prod-0001 82136 19471 46922316396864 JOB 62: sent signal STOP (retry after 60 seconds) host: prod-0002 When the execution daemon receives the signal event, in execd_signal_queue.c:signal_job(), the job state is marked as SUSPENDED and sge_execd_deliver_signal() is called. 685 state = lGetUlong(jatep, JAT_state); 686 if (!ISSET(state,JSUSPENDED)) { 687 suspend_change = 1; 688 } 689 SETBIT(JSUSPENDED, state); 690 CLEARBIT(JRUNNING, state); 691 lSetUlong(jatep, JAT_state, state); 692 693 /* if this is a stop signal for a job 694 which is in at least ONE queue 695 which is already stopped we 696 do not deliver the signal */ 697 698 getridofjob = sge_execd_deliver_signal(signal, jep, jatep); In sge_execd_deliver_signal(), in the SIGSTOP handling code, this same state is checked and if the state is SUSPENDED, no signal is sent. 294 /* Simply apply signal to all subtasks of the job 295 except in case of SGE_MIGRATE when there is a 296 ckpt env with "migrate on suspend" configured */ 297 queue_already_suspended = (lGetUlong(jatep, JAT_state)&JSUSPENDED); 298 if (!(sig == SGE_MIGRATE 299 && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)) 300 && !queue_already_suspended) { 301 lListElem *petep; 302 /* signal each pe task */ 303 for_each (petep, lGetList(jatep, JAT_task_list)) { 304 if (sge_kill((int)lGetUlong(petep, PET_pid), sig, 305 lGetUlong(jep, JB_job_number), lGetUlong(jatep, JAT_task_number), 306 lGetString(petep, PET_id))==-2) 307 getridofjob = 1; 308 } 309 } 310 For SIGSTOP, the state is always suspended, so no STOP signals are ever sent to slave tasks. The variable name is queue_already_suspended, but it's actually checking to see if the task has been suspended. Can anyone explain this check? Note: This code also has the same problem reported by Ron, where the CHECKPOINT_SUSPEND is being compared with an OR instead of an AND. ------- Additional comments from svdavidson Mon Sep 29 12:28:36 -0700 2008 ------- The actual problem is that parallel job suspension is broken. I removed the check of queue_already_suspended in sge_execd_deliver_signal(), and suspending of parallel jobs works now. Suspending of queues, jobs, and queue subordination all started working for parallel jobs. The only remaining question is what is supposed to be the purpose of the queue_already_suspended check? ------- Additional comments from svdavidson Sat Oct 4 13:13:09 -0700 2008 ------- I have been running the patched code for about a week and it is working for suspending parallel jobs. The patches I used are included below. The patches are based on the SGE 6.1u3 source code. diff -ru gridengine-V61u3.ORIG/source/daemons/execd/execd_signal_queue.c gridengine-V61u3/source/daemons/execd/execd_signal_queue.c --- gridengine-V61u3.ORIG/source/daemons/execd/execd_signal_queue.c 2007-05-09 05:36:51.000000000 -0500 +++ gridengine-V61u3/source/daemons/execd/execd_signal_queue.c 2008-10-04 12:56:55.630669364 -0500 @@ -236,7 +236,9 @@ lListElem *jep, lListElem *jatep ) { +#if 0 int queue_already_suspended; +#endif int getridofjob = 0; DENTER(TOP_LAYER, "sge_execd_deliver_signal"); @@ -288,19 +290,27 @@ /* DPRINTF(("(sig==SGE_MIGRATE) = %d (ckpt on suspend) = %d %d\n", - (sig == SGE_MIGRATE), lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND, + (sig == SGE_MIGRATE), lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND, lGetUlong(jep, JB_checkpoint_attr))); */ /* Simply apply signal to all subtasks of the job except in case of SGE_MIGRATE when there is a ckpt env with "migrate on suspend" configured */ +#if 0 queue_already_suspended = (lGetUlong(jatep, JAT_state)&JSUSPENDED); - if (!(sig == SGE_MIGRATE - && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)) + DPRINTF(("queue_already_suspended = %d\n", queue_already_suspended)); + if (!(sig == SGE_MIGRATE + && (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)) && !queue_already_suspended) { +#endif + + if (!(sig == SGE_MIGRATE + && (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND))) { lListElem *petep; + DPRINTF(("signaling each pe task with signal %d\n", sig)); /* signal each pe task */ for_each (petep, lGetList(jatep, JAT_task_list)) { + DPRINTF(("signaling pe task pid=%d\n", lGetUlong(petep, PET_pid))); if (sge_kill((int)lGetUlong(petep, PET_pid), sig, lGetUlong(jep, JB_job_number), lGetUlong(jatep, JAT_task_number), lGetString(petep, PET_id))==-2) @@ -308,6 +318,8 @@ } } + DPRINTF(("lGetUlong(jatep, JAT_status) = %d, JSLAVE = %d\n", lGetUlong(jatep, JAT_status), JSLAVE)); + if (lGetUlong(jatep, JAT_status)!=JSLAVE) if (sge_kill((int)lGetUlong(jatep, JAT_pid), sig, lGetUlong(jep, JB_job_number), lGetUlong(jatep, JAT_task_number), NULL)==-2) diff -ru gridengine-V61u3.ORIG/source/daemons/qmaster/sge_qmod_qmaster.c gridengine-V61u3/source/daemons/qmaster/sge_qmod_qmaster.c --- gridengine-V61u3.ORIG/source/daemons/qmaster/sge_qmod_qmaster.c 2007-10-22 05:05:04.000000000 -0500 +++ gridengine-V61u3/source/daemons/qmaster/sge_qmod_qmaster.c 2008-09-26 15:51:43.473894911 -0500 @@ -1302,7 +1302,7 @@ /* do not signal slave tasks in case of checkpointing jobs with STOP/CONT when suspending means migration */ if ((how==SGE_SIGCONT || how==SGE_SIGSTOP) && - (lGetUlong(jep, JB_checkpoint_attr)|CHECKPOINT_SUSPEND)!=0) { + (lGetUlong(jep, JB_checkpoint_attr)&CHECKPOINT_SUSPEND)!=0) { DPRINTF(("omit signaling - checkpoint script does action for whole job\n")); return; } ------- Additional comments from reuti Sat Oct 4 14:43:33 -0700 2008 ------- Did you also try with subordination like my configuration which I posted above? All slave tasks are always honored and none is skipped? |
|||
#581 | fixed | IZ2757: man sge_conf lists telnet for rlogin and rsh / qlogin_command uses qlogin_command /usr/bin/telnetd | Dave Love <d.love@…> | reuti |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2757] Issue #: 2757 Platform: All Reporter: reuti (reuti) Component: gridengine OS: All Subcomponent: man Version: 6.2 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: man sge_conf lists telnet for rlogin and rsh / qlogin_command uses qlogin_command /usr/bin/telnetd Status whiteboard: Attachments: Issue 2757 blocks: Votes for issue 2757: Opened: Wed Oct 15 10:30:00 -0700 2008 ------------------------ The man page sge_conf states for rlogin_command rlogin_daemon rsh_command rsh_daemon to use the qlogin examples, hence telnet and in.telnetd to fall back to the old startup method. I would expect there: $SGE_ROOT/utilbin/$ARC/rlogin /usr/sbin/in.rlogind $SGE_ROOT/utilbin/$ARC/rsh $SGE_ROOT/utilbin/$ARC/rshd to be listed instead. I can put an absolute path there, but it can't be generic like in former times, hence I need local configurations in a mixed environment now. ------- Additional comments from reuti Fri Oct 17 03:09:39 -0700 2008 ------- The complete entries which must be entered to fall back are: qlogin_command /usr/bin/telnet qlogin_daemon /usr/sbin/in.telnetd rlogin_command /usr/sge/utilbin/lx24-x86/rlogin rlogin_daemon /usr/sbin/in.rlogind rsh_command /usr/sge/utilbin/lx24-x86/rsh rsh_daemon /usr/sge/utilbin/lx24-x86/rshd -l Besides giving this exact information what will be called is necessary to setup any calls: qlogin_command will be called with the parameters "$HOST $PORT" rlogin_command will be called with "-p $PORT $HOST" (for now it states the opposite order: "...started with the target host and port number as parameters." in man sge_conf) rsh_command will be called with "-n -p $PORT $HOST" (for now it states the opposite order and it would conform to telnet: "...the target host and port number as parameters like required for telnet(1) plus..." in man sge_conf. The order is of course wrong.) ------- Additional comments from reuti Mon Nov 24 16:02:33 -0700 2008 ------- Add on: the section for qlogin_command refers to "qlogin_command /usr/bin/telnetd", it must read "qlogin_command /usr/bin/telnet". |
Note: See TracQuery
for help on using queries.