Custom Query (431 matches)

Filters
 
Or
 
  
 
Columns

Show under each result:


Results (82 - 84 of 431)

Ticket Resolution Summary Owner Reporter
#1467 fixed [SoGE 8.1.3] Bug: builtin method qlogin/qrsh failing Dave Love <d.love@…> t.mainka@…
Description

Hello,

we experienced a problem on RHEL/CentOS 6 machines with qlogin/qrsh via the builtin starter. The job seems to be scheduled and started fine, but for some reason the shell at the end won't start and the job ends with a commlib error:

$ qlogin -verbose -q queue@host Your job 998 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 998 has been successfully scheduled. Establishing builtin session to host exechost.f.q.d.n ... error: commlib error: got read error (closing "exechost.f.q.d.n/shepherd_ijs/2")

Tracing through the execd on the destination machine showed that the execle() call for the shell failed with EFAULT:

write(4, "07/09/2013 08:30:44 [50449:30912]: execle(/bin/bash, -bash, NULL,

env)\n", 71) = 71

execve("/bin/bash", -bash?, ["SHELL=/bin/bash", "HOME=/home/username",

"TERM=xterm", "LOGNAME=username", "PATH=/bin:/usr/bin", 0x7fffffffffff]) = -1 EFAULT

After some digging it looks like the environment array the funtion start_qlogin_job() generates isn't properly ended with a NULL pointer any more (like it was in the SGE 6.2u5 source).

The attached trivial patch fixed our problems.

Regards, Thomas Mainka

-- Thomas Mainka science+computing ag System Administration Hagellocher Weg 73 mail: t.mainka@… 72070 Tuebingen, Germany tel.: +49 7071 9457 472 www.science-computing.de -- Vorstandsvorsitzender/Chairman? of the board of management: Gerd-Lothar Leonhart Vorstand/Board? of Management: Dr. Bernd Finkbeiner, Michael Heinrichs, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Philippe Miltin Sitz/Registered? Office: Tuebingen Registergericht/Registration? Court: Stuttgart Registernummer/Commercial? Register No.: HRB 382196

sge-builtin_starter.patch

#1459 duplicate USE_CGROUPS sets host in error state mikaelb
Description

I have been testing the USE_CGROUPS option that is available to execd. When USE_CGROUPS is enabled it works fine to submit a single job to a queue instance on an execution node. However, if a second job is submitted to the same queue instance, it fails and sets the queue instance in error state due to that the shepherd exited with return code 7. The shepherd trace gives the this:

Shepherd trace:
03/13/2013 22:39:47 [0:17310]: shepherd called with uid = 0, euid = 0
03/13/2013 22:39:47 [400:17310]: starting up 8.1.3
03/13/2013 22:39:47 [400:17310]: can't open file pid: Permission denied

Jobs that successfully start have job spool directories owned by the gridadmin administrative user (the user SGE runs as), while the spool directories of the failed jobs are still owned by root. If I turn off USE_CGROUPS everything works ok. Initially I thought this was som race condition which can be triggered when jobs are started rapidly, but some more testing showed that it was when a second job was started on the same execution host.

#1458 wontfix urgency_slots should apply to parallel jobs requesting a fixed number of slots wish
Description

rrcontr is multiplied by urgency slots where a range is requested but when pe_min=pe_max then the actaul number of slots requested. This does not appear to be a very useful behavior. Having jobs where pe_min=pe_max also use the urgency_slots figure would lead to more consistent behavior.

Note: See TracQuery for help on using queries.