[GE users] a 't' status blocking a node

Alexis Salzman alexis.salzman at medysys.com
Thu Mar 15 11:06:11 GMT 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,
there is nothing in the /var/log/messages of this node for the period
before the reboot :
.
.
.
Mar 14 14:31:29 homer01 pam_rhosts_auth[3405]: allowed to
idvu at homer01.medysys.fr as idvu
Mar 14 14:31:29 homer01 rsh(pam_unix)[3405]: session opened for user
idvu by (uid=0)
Mar 14 14:31:29 homer01 rsh(pam_unix)[3405]: session closed for user idvu
Mar 14 15:59:02 homer01 login(pam_unix)[16553]: session opened for user
root by LOGIN(uid=0)
.
.
.
Oh i forget to say that this job was a 'PE' (2 procs on the same node)
if it as any impact

thank you any way for  your  reply

A.S.
 
Reuti wrote:
> Hi,
>
> Am 14.03.2007 um 18:13 schrieb Alexis Salzman:
>
>> Hi all,
>> with  SGE 6.0u6 on AMD Opteron under Redhat (linux 2.4.21-27.ELsmp) a
>> strange behaviour  occurs  some times.
>> When the job start it  change to 't'  state.  Nothing append. After 
>> a while  using qdel the user try unsuccessfully to delete is job but
>> the state goes to 'dt' and nothing append. As admin with qmon i try
>> to delete it without more  success.
>> sshd on the running node became impossible.
>> Directly on the runinng node :
>>  A netstat on the runing node indicate a CLOSE_WAIT for sgeexecd.
>>  /etc/init.d/sgeexecd stop block
>>  kill -9 <pid sgeexecd> fails
>> reboot block while trying to shut down  sgeexecd => hard reboot needed
>
> did you also check the /var/log/messages on this machine? I saw such a
> behavior when the machine was running out of memory and therefore
> killed randomly some of the running processes there (OOM killer).
>
> -- Reuti
>
>
>> In the spool/message of the running node i find nothing before the
>> reboot (16:30) about the blocking job (17285) witch start and block
>> around 15:30 :
>> .
>> .
>> .
>> 03/12/2007 11:28:01|execd|homer01|E|shepherd of job 17156.1 exited
>> with exit status = 25
>> 03/14/2007 16:30:56|execd|homer01|I|starting up 6.0u6
>> 03/14/2007 16:31:12|execd|homer01|E|acknowledge for unknown job
>> 17285.1/master
>> 03/14/2007 16:31:12|execd|homer01|E|incorrect config file for job
>> 17285.1
>> 03/14/2007 16:31:12|execd|homer01|E|ERROR: unlinking
>> "jobs/00/0001/7285.1": No such file or directory
>> 03/14/2007 16:31:12|execd|homer01|E|can not remove file job spool
>> file: jobs/00/0001/7285.1
>> 03/14/2007 16:31:12|execd|homer01|E|can't remove directory
>> "active_jobs/17285.1": opendir(active_jobs/17285.1) failed: No such
>> file or directo
>> ry
>>
>> (few time and the job is deleted from SGE)
>> .
>> .
>> .
>>
>> I didn't play with the qmaster as i was afraid to propagate this
>> blocking behaviour.... I'am not a gourou ...
>> Any ideas ?
>> Is their some kind of parameter or command to skip the reboot ... ?
>> Is their some network latency parameter ???
>> Thanks in advance for any help.
>> A.S.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list