[GE users] drmaa return value for getJobProgramStatus

Daniel Templeton Dan.Templeton at Sun.COM
Tue Jul 3 21:03:21 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Just as an update, I have closed 1485.  I believe the original intention 
of 1485 was to be able to retrieve the status of *finished* jobs from a 
previous session.  Since, however, the reconnectable sessions 
enhancement has already set the expectation that jobs that exited before 
the current session was initialized are treated as non-existent, I feel 
comfortable asserting that the current behavior of Grid Engine, i.e. 
that getJobProgramStatus() works on all jobs from the active session and 
active jobs not from the current session, is acceptable.  An application 
which is tracking a job's status through getJobProgramStatus() from a 
different session must make the assumption that an error indicating that 
the job does not exist means that the job has entered either the ERROR 
or FINISHED state.  For more information than that, you have to be in 
the same session.

While writing this, it occurred to me that with the reconnectable 
sessions enhancement in place, it's possible to do a further 
enhancement: to pass in a special identifier to Session.init() that 
causes there to be no session.  The DRMAA program would then be dealing 
directly with all jobs in the system, unconstrained by sessions.  If 
someone thinks that's the answer to all his/her DRMAA problems, submit 
it as an RFE, and I'll add my 2 cents to the issue.

Daniel

Ryan Golhar wrote:
> Thanks Dan.  That was very helpful!
>
> Ryan
>
>
> -----Original Message-----
> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
> Sent: Tuesday, July 03, 2007 2:28 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] drmaa return value for getJobProgramStatus
>
>
> Ryan,
>
> OK.  I see where your problem is.  Sometimes the function of the DRMAA 
> session can be a little confusing.  According to the DRMAA standard, it 
> is only possible to wait() for a job within the same session in which it 
> was submitted.  In your code, you submit the job, exit the session, then 
> try to wait().  Exiting the session unregisters the DRMAA client for job 
> events from that session.  The wait() call is dependent on those events, 
> so unregistering the client prevents it from working.  Same for 
> synchronize().
>
> getJobProgramStatus() and control(), on the other hand, are allowed by 
> the DRMAA standard to function on jobs outside of the scope of the 
> session.  Because they are synchronous RPC calls, they can work without 
> the need for an event client.  (Actually, Issue 1485 says that 
> getJobProgramStatus() does *not* work with jobs outside the session, but 
> your results and my review of the source code say otherwise.  I'll have 
> to see if we can figure out when 1485 was addressed and close it.)
>
> So, there are two solutions to your problem.  The first is to just not 
> exit() the session.  The second is to use the new reconnectable sessions 
> feature to reconnect to the session that you just closed.
>
> Educational reading:
>
> http://blogs.sun.com/templedf/entry/drmaa_internals1
> http://blogs.sun.com/templedf/entry/good_drmaa_news
>
> Daniel
>
> Ryan Golhar wrote:
>   
>> Dan,
>>
>> I'm still stuck.  I've stripped down my code to the basics to test 
>> this:
>>
>> 1.  I initialized the grid engine
>> 2.  I submit the job and get the jobid
>> 3.  I poll the status of the jobid using getJobProgramStatus.  The 
>> first time I call getJobProgramStatus, it returns QUEUED_ACTIVE.
>>
>> If I call Session.wait(..) immediately after, I get an Exception that 
>> the job does not exist.  However qstat reports the job is queued.
>>
>> If I wait for a few seconds, I see the job starts running (using 
>> qstat). The program called getJobProgramStatus which returns RUNNING.  
>> I then call Session.wait and get the same exception.
>>
>> I'm not sure what is wrong.  If the program status is RUNNING, then I 
>> would expect Session.wait to work.  I've attached the java code I'm 
>> using.  Any help would be appreciated.  Thanks,
>>
>> Ryan
>>
>>
>> -----Original Message-----
>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM]
>> Sent: Saturday, June 30, 2007 4:21 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] drmaa return value for getJobProgramStatus
>>
>>
>> Actually, I guess there is a third option.  In the DRMAA Hands-on Lab 
>> I
>> did for JavaOne, I just had the polling thread stop asking about a job 
>> once it got the InvalidJobException, with the assumption that the job 
>> was now in the hands of the wait thread.
>>
>> Daniel
>>
>> Daniel Templeton wrote:
>>   
>>     
>>> Gah!  My apologies.  Please ignore my previous two emails.  That's 
>>> what I get for writing emails at midnight.  I reread my emails this 
>>> morning, and I don't know what I was thinking. :(
>>>
>>> The implementation *does* already do what I was suggesting.  It uses 
>>> the local job info cache to return the job state if the qmaster 
>>> doesn't remember the job.  The reason why you're seeing the 
>>> InvalidJobException is that you also have a thread doing a wait(ANY) 
>>> call in a loop.  (Am I right?)  Once a wait() call has succeeded for 
>>> a job, that job no longer exists.  Period.
>>>
>>> There are two ways to deal with the problem.  Either have the wait 
>>> thread notify the polling thread once a job has ended, or build the
>>> wait() call into the polling thread after a job's state is FINISHED 
>>> or
>>> ERROR.
>>>
>>> Sorry for the confusion.
>>>
>>> Daniel
>>>
>>> Ryan Golhar wrote:
>>>     
>>>       
>>>> Thanks Daniel.  It seems a bit odd that there is a Session.DONE but 
>>>> it will never get used.  If the DRMAA implementation does have 
>>>> information, will it
>>>> always work or is it just because of the session instance?  If the DRMAA
>>>> implementation will always have information on completed jobs, then 
>>>> it makes
>>>> sense to use that information, but if its not guaranteed, then I 
>>>> don't know
>>>> if that is the best solution (in my opinion).  In either case, I 
>>>> think it
>>>> would be good to file it as an RFE.  How do I do that?
>>>>
>>>> Ryan
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] Sent: 
>>>> Saturday, June 30, 2007 3:11 AM
>>>> To: users at gridengine.sunsource.net
>>>> Subject: Re: [GE users] drmaa return value for getJobProgramStatus
>>>>
>>>>
>>>> Ryan,
>>>>
>>>> The exception happens because the qmaster disavows all knowledge of 
>>>> finished jobs.  (Not exactly, but close enough for this discussion.)
>>>> Since the DRMAA implementation actually does have the information 
>>>> about the job on hand, though, it really would make sense for the 
>>>> getJobProgramStatus() method to use that information in the case of 
>>>> finished jobs instead of only relying on the qmaster.  If you'd like 
>>>> to file that as an RFE, that would be helpful.
>>>>
>>>> Thanks,
>>>> Daniel
>>>>
>>>> Daniel Templeton wrote:
>>>>  
>>>>       
>>>>         
>>>>> Ryan,
>>>>>
>>>>> That is indeed how the implementation works.  To confirm that the
>>>>> InvalidJobException from getJobProgramStatus() means that the job 
>>>>> has ended, wait() for the job with the timeout set to 
>>>>> Session.TIMEOUT_NO_WAIT.  If the job has finished, the wait() call 
>>>>> will return its exit info, including why/how it exited.  If the job 
>>>>> simply doesn't exist for some reason, you'll get another 
>>>>> InvalidJobException.
>>>>>
>>>>> Daniel
>>>>>
>>>>> Ryan Golhar wrote:
>>>>>    
>>>>>         
>>>>>           
>>>>>> I'm able to successfully submit a job through Drmaa to the
>>>>>> appropriate queue and set other settings.  If the job is running 
>>>>>> and I call getJobProgramStatus (Java), I get a return value of 
>>>>>> Session.Running
>>>>>> (32)
>>>>>> which is correct.  Once the job completes, and I call
>>>>>> getJobProgramStatus, I
>>>>>> get an exception about the job id not being valid:
>>>>>>
>>>>>> org.ggf.drmaa.InvalidJobException: The job specified by the 
>>>>>> 'jobid'
>>>>>> does not exist.
>>>>>>         at 
>>>>>> com.sun.grid.drmaa.SessionImpl.nativeGetJobProgramStatus(Native
>>>>>> Method)
>>>>>>         at
>>>>>>
>>>>>>           
>>>>>>             
>> com.sun.grid.drmaa.SessionImpl.getJobProgramStatus(SessionImpl.java:21
>> 3)
>>   
>>     
>>>>>>         at
>>>>>> org.umdnj.JBLAST.LocalSGEBLAST.exeGet(LocalSGEBLAST.java:82)
>>>>>>         at
>>>>>> org.umdnj.JBLAST.BlastResultThread.run(BlastResultThread.java:62)
>>>>>>
>>>>>> I can interpret this exception as the job has completed, however I 
>>>>>> don't think this is the correct way of doing things as I can't 
>>>>>> tell if
>>>>>>       
>>>>>>           
>>>>>>             
>>>> the job
>>>>  
>>>>       
>>>>         
>>>>>> complete successfully or if something else happened.   Am I missing
>>>>>> something?
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------
>>>>>> -
>>>>>> --
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>>         
>>>>>>           
>>>>>>             
>>>>> -------------------------------------------------------------------
>>>>> -
>>>>> -
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>     
>>>>>         
>>>>>           
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>   
>>>>       
>>>>         
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>     
>>>       
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>   
>> ----------------------------------------------------------------------
>> --
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>   
>>     
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list