Java Doc for AdaptiveRevisitHostQueue.java in  » Web-Crawler » heritrix » org » archive » crawler » frontier » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.frontier 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   org.archive.crawler.frontier.AdaptiveRevisitHostQueue

AdaptiveRevisitHostQueue
public class AdaptiveRevisitHostQueue implements AdaptiveRevisitAttributeConstants,FrontierGroup(Code)
A priority based queue of CrawlURIs. Each queue should represent one host (although this is not enforced in this class). Items are ordered by the scheduling directive and time of next processing (in that order) and also indexed by the URI.

The HQ does no calculations on the 'time of next processing.' It always relies on values already set on the CrawlURI.

Note: Class is not 'thread safe.' In multi threaded environment the caller must ensure that two threads do not make overlapping calls.

Any BDB DatabaseException will be converted to an IOException by public methods. This includes preserving the original stacktrace, in favor of the one created for the IOException, so that the true source of the exception is not lost.
author:
   Kristinn Sigurdsson



Field Summary
final public static  intHQSTATE_BUSY
     HQ has maximum number of CrawlURI currently being processed.
final public static  intHQSTATE_EMPTY
     HQ contains no queued CrawlURIs elements.
final public static  intHQSTATE_READY
    
final public static  intHQSTATE_SNOOZED
    
protected  StoredClassCatalogclassCatalog
    
protected  EntryBindingcrawlURIBinding
    
final  StringhostName
    
 longinProcessing
     Number of URIs belonging to this queue that are being processed at the moment.
 longnextReadyTime
     Time (in milliseconds) when the HQ will next be ready to issue a URI for processing.
protected  EntryBindingprimaryKeyBinding
    
protected  DatabaseprimaryUriDB
     Database containing the URI priority queue, indexed by the the URI string.
protected  DatabaseprocessingUriDB
     A database containing those URIs that are currently being processed.
protected  SecondaryDatabasesecondaryUriDB
     Secondary index into AdaptiveRevisitHostQueue.primaryUriDB the primary DB , URIs indexed by the time when they can next be processed again.
 longsize
     Size of queue.
 intstate
     Last known state of HQ -- ALL methods should use getState() to read this value, never read it directly.
protected  CrawlSubstatssubstats
    
 intvalence
     Number of simultanious connections permitted to this host.
 long[]wakeUpTime
     Time (in milliseconds) when each URI 'slot' becomes available again.

Any positive value larger then the current time signifies a taken slot where the URI has completed processing but the politness wait has not ended.


Constructor Summary
public  AdaptiveRevisitHostQueue(String hostName, Environment env, StoredClassCatalog catalog, int valence)
     Constructor
Parameters:
  hostName - Name of the host this queue represents.

Method Summary
public  voidadd(CrawlURI curi, boolean overrideSetTimeOnDups)
     Add a CrawlURI to this host queue.

Calls can optionally chose to have the time of next processing value override existing values for the URI if the existing values are 'later' then the new ones.

protected  voidaddInProcessing(CrawlURI curi)
     Adds a CrawlURI to the list of CrawlURIs belonging to this HQ and are being processed at the moment.
public  voidclose()
     Cleanup all open Berkeley Database objects.
protected  longcountCrawlURIs()
     Count all entries in both primaryUriDB and processingUriDB.
protected  voiddeleteInProcessing(String uri)
     Removes a URI from the list of URIs belonging to this HQ and are currently being processed.
protected  voidflushProcessingURIs()
     Flush any CrawlURIs in the processingUriDB into the primaryUriDB.
protected  CrawlURIgetCrawlURI(String uri)
     Returns the CrawlURI associated with the specified URI (string) or null if no such CrawlURI is queued in this HQ.
public  StringgetHostName()
    
public  longgetNextReadyTime()
     Returns the time when the HQ will next be ready to issue a URI.
public  longgetSize()
     Returns the size of the HQ.
public  intgetState()
     Returns the current state of the HQ.
public  StringgetStateByName()
     Same as AdaptiveRevisitHostQueue.getState() getState() except this method returns a human readable name for the state instead of its constant integer value.
public  CrawlSubstatsgetSubstats()
    
protected  booleaninProcessing(String uri)
     Returns true if this HQ has a CrawlURI matching the uri string currently being processed.
public  CrawlURInext()
     Returns the 'top' URI in the AdaptiveRevisitHostQueue.
public  CrawlURIpeek()
     Returns the URI with the earliest time of next processing.
protected  voidreorder()
     Method is called whenever something has been done that might have changed the value of the 'published' time of next ready.
public  Stringreport(int max)
     Returns a report detailing the status of this HQ.
Parameters:
  max - Maximum number of URIs to show.
protected  voidsetNextReadyTime(long newTime)
    
public  voidsetOwner(AdaptiveRevisitQueueList owner)
     Set the AdaptiveRevisitQueueList object that contains this HQ.
protected  OperationStatusstrictAdd(CrawlURI curi, boolean overrideDuplicates)
     An internal method for adding URIs to the queue.
public  voidupdate(CrawlURI curi, boolean needWait, long wakeupTime)
     Update CrawlURI that has completed processing.
Parameters:
  curi - The CrawlURI.
public  voidupdate(CrawlURI curi, boolean needWait, long wakeupTime, boolean forgetURI)
     Update CrawlURI that has completed processing.
Parameters:
  curi - The CrawlURI.

Field Detail
HQSTATE_BUSY
final public static int HQSTATE_BUSY(Code)
HQ has maximum number of CrawlURI currently being processed. This number is either equal to the 'valence' (maximum number of simultanious connections to a host) or (if smaller) the total number of CrawlURIs in the HQ.



HQSTATE_EMPTY
final public static int HQSTATE_EMPTY(Code)
HQ contains no queued CrawlURIs elements. This state only occurs after queue creation before the first add. After the first item is added the state can never become empty again.



HQSTATE_READY
final public static int HQSTATE_READY(Code)
HQ has a CrawlURI ready for processing



HQSTATE_SNOOZED
final public static int HQSTATE_SNOOZED(Code)
HQ is in a suspended state until it can be woken back up



classCatalog
protected StoredClassCatalog classCatalog(Code)
For BDB serialization of objects



crawlURIBinding
protected EntryBinding crawlURIBinding(Code)
A binding for the CrawlURIARWrapper object



hostName
final String hostName(Code)
Name of the host that this AdaptiveRevisitHostQueue represents



inProcessing
long inProcessing(Code)
Number of URIs belonging to this queue that are being processed at the moment. This number will always be in the range of 0 - valence



nextReadyTime
long nextReadyTime(Code)
Time (in milliseconds) when the HQ will next be ready to issue a URI for processing. When setting this value, methods should use the setter method AdaptiveRevisitHostQueue.setNextReadyTime(long) setNextReadyTime()



primaryKeyBinding
protected EntryBinding primaryKeyBinding(Code)
A binding for the serialization of the primary key (URI string)



primaryUriDB
protected Database primaryUriDB(Code)
Database containing the URI priority queue, indexed by the the URI string.



processingUriDB
protected Database processingUriDB(Code)
A database containing those URIs that are currently being processed.



secondaryUriDB
protected SecondaryDatabase secondaryUriDB(Code)
Secondary index into AdaptiveRevisitHostQueue.primaryUriDB the primary DB , URIs indexed by the time when they can next be processed again.



size
long size(Code)
Size of queue. That is, the number of CrawlURIs that have been added to it, including any that are currently being processed.



state
int state(Code)
Last known state of HQ -- ALL methods should use getState() to read this value, never read it directly.



substats
protected CrawlSubstats substats(Code)



valence
int valence(Code)
Number of simultanious connections permitted to this host. I.e. this many URIs can be issued before state of HQ becomes busy until one of them is returned via the update method.



wakeUpTime
long[] wakeUpTime(Code)
Time (in milliseconds) when each URI 'slot' becomes available again.

Any positive value larger then the current time signifies a taken slot where the URI has completed processing but the politness wait has not ended.

A zero or positive value smaller then the current time in milliseconds signifies an empty slot.

Any negative value signifies a slot for a URI that is being processed.

Methods should never write directly to this, rather use the AdaptiveRevisitHostQueue.updateWakeUpTimeSlot(long) updateWakeUpTimeSlot() and AdaptiveRevisitHostQueue.useWakeUpTimeSlot() useWakeUpTimeSlot() methods as needed.





Constructor Detail
AdaptiveRevisitHostQueue
public AdaptiveRevisitHostQueue(String hostName, Environment env, StoredClassCatalog catalog, int valence) throws IOException(Code)
Constructor
Parameters:
  hostName - Name of the host this queue represents. This name mustbe unique for all HQs in the same Environment.
Parameters:
  env - Berkeley DB Environment. All BDB databases created will use it.
Parameters:
  catalog - Db for bdb class serialization.
Parameters:
  valence - The total number of simultanous URIs that the HQ can issuefor processing. Once this many URIs have been issued forprocessing, the HQ will go into AdaptiveRevisitHostQueue.HQSTATE_BUSY busystate until at least one of the URI is AdaptiveRevisitHostQueue.update(CrawlURI,boolean,long) updated.Value should be larger then zero. Zero and negative valueswill be treated same as 1.
throws:
  IOException - if an error occurs opening/creating the database




Method Detail
add
public void add(CrawlURI curi, boolean overrideSetTimeOnDups) throws IOException(Code)
Add a CrawlURI to this host queue.

Calls can optionally chose to have the time of next processing value override existing values for the URI if the existing values are 'later' then the new ones.
Parameters:
  curi - The CrawlURI to add.
Parameters:
  overrideSetTimeOnDups - If true then the time of next processing forthe supplied URI will override the anyexisting time for it already stored in the HQ.If false, then no changes will be made to anyexisting values of the URI. Note: Will neveroverride with a later time.
throws:
  IOException - When an error occurs accessing the database




addInProcessing
protected void addInProcessing(CrawlURI curi) throws DatabaseException, IllegalStateException(Code)
Adds a CrawlURI to the list of CrawlURIs belonging to this HQ and are being processed at the moment.
Parameters:
  curi - The CrawlURI to add to the list
throws:
  DatabaseException -
throws:
  IllegalStateException - if the CrawlURI is already in the list of URIs beingprocessed.



close
public void close() throws IOException(Code)
Cleanup all open Berkeley Database objects.

Does not close the Environment.
throws:
  IOException - if an error occurs closing a database object




countCrawlURIs
protected long countCrawlURIs() throws DatabaseException(Code)
Count all entries in both primaryUriDB and processingUriDB.

This method is needed since BDB does not provide a simple way of counting entries.

Note: This is an expensive operation, requires a loop through the entire queue! the number of distinct CrawlURIs in the HQ.
throws:
  DatabaseException -




deleteInProcessing
protected void deleteInProcessing(String uri) throws DatabaseException(Code)
Removes a URI from the list of URIs belonging to this HQ and are currently being processed.

Returns true if successful, false if the URI was not found.
Parameters:
  uri - The URI string of the CrawlURI to delete.
throws:
  DatabaseException -
throws:
  IllegalStateException - if the URI was not on the list




flushProcessingURIs
protected void flushProcessingURIs() throws DatabaseException(Code)
Flush any CrawlURIs in the processingUriDB into the primaryUriDB. URIs flushed will have their 'time of next fetch' maintained and the nextReadyTime will be updated if needed.

No change is made to the list of available slots.
throws:
  DatabaseException - if one occurs while flushing




getCrawlURI
protected CrawlURI getCrawlURI(String uri) throws DatabaseException(Code)
Returns the CrawlURI associated with the specified URI (string) or null if no such CrawlURI is queued in this HQ. If CrawlURI is being processed it is not considered to be queued and this method will return null for any such URIs.
Parameters:
  uri - A string representing the URI the CrawlURI associated with the specified URI (string) or nullif no such CrawlURI is queued in this HQ.
throws:
  DatabaseException - if a errors occurs reading the database



getHostName
public String getHostName()(Code)
Returns the HQ's name the HQ's name



getNextReadyTime
public long getNextReadyTime()(Code)
Returns the time when the HQ will next be ready to issue a URI.

If the queue is in a AdaptiveRevisitHostQueue.HQSTATE_SNOOZED snoozed state then this time will be in the future and reflects either the time when the HQ will again be able to issue URIs for processing because politness constraints have ended, or when a URI next becomes available for visit, whichever is larger.

If the queue is in a AdaptiveRevisitHostQueue.HQSTATE_READY ready state this time will be in the past and reflect the earliest time when the HQ had a URI ready for processing, taking time spent snoozed for politness concerns into account.

If the HQ is in any other state then the return value of this method is equal to Long.MAX_VALUE.

This value may change each time a URI is added, issued or updated. the time when the HQ will next be ready to issue a URI




getSize
public long getSize()(Code)
Returns the size of the HQ. That is, the number of URIs queued, including any that are currently being processed. the size of the HQ.



getState
public int getState()(Code)
Returns the current state of the HQ. the current state of the HQ.
See Also:   AdaptiveRevisitHostQueue.HQSTATE_BUSY
See Also:   AdaptiveRevisitHostQueue.HQSTATE_EMPTY
See Also:   AdaptiveRevisitHostQueue.HQSTATE_READY
See Also:   AdaptiveRevisitHostQueue.HQSTATE_SNOOZED



getStateByName
public String getStateByName()(Code)
Same as AdaptiveRevisitHostQueue.getState() getState() except this method returns a human readable name for the state instead of its constant integer value.

Should only be used for reports, error messages and other strings intended for human eyes. the human readable name of the current state




getSubstats
public CrawlSubstats getSubstats()(Code)



inProcessing
protected boolean inProcessing(String uri) throws DatabaseException(Code)
Returns true if this HQ has a CrawlURI matching the uri string currently being processed. False otherwise.
Parameters:
  uri - Uri to check true if this HQ has a CrawlURI matching the uri string currentlybeing processed. False otherwise.
throws:
  DatabaseException -



next
public CrawlURI next() throws IllegalStateException, IOException(Code)
Returns the 'top' URI in the AdaptiveRevisitHostQueue.

HQ state will be set to AdaptiveRevisitHostQueue.HQSTATE_BUSY busy if this method returns normally. a CrawlURI ready for processing
throws:
  IllegalStateException - if the HostQueues current state is notready AdaptiveRevisitHostQueue.HQSTATE_READY ready
throws:
  IOException - if an error occurs reading from the database




peek
public CrawlURI peek() throws IllegalStateException, IOException(Code)
Returns the URI with the earliest time of next processing. I.e. the URI at the head of this host based priority queue.

Note: This method will return the head CrawlURI regardless of wether it is safe to start processing it or not. CrawlURI will remain in the queue. The returned CrawlURI should only be used for queue inspection, it can not be updated and returned to the queue. To get URIs ready for processing use AdaptiveRevisitHostQueue.next() next() . the URI with the earliest time of next processing or null if the queue is empty or all URIs are currently being processed.
throws:
  IllegalStateException -
throws:
  IOException - if an error occurs reading from the database




reorder
protected void reorder()(Code)
Method is called whenever something has been done that might have changed the value of the 'published' time of next ready. If an owner has been specified it will be notified that the value may have changed..



report
public String report(int max)(Code)
Returns a report detailing the status of this HQ.
Parameters:
  max - Maximum number of URIs to show. 0 equals no limit. a report detailing the status of this HQ.



setNextReadyTime
protected void setNextReadyTime(long newTime)(Code)
Updates nextReadyTime (if smaller) with the supplied value
Parameters:
  newTime - the new value of nextReady Time;



setOwner
public void setOwner(AdaptiveRevisitQueueList owner)(Code)
Set the AdaptiveRevisitQueueList object that contains this HQ. Will cause that object to be notified (via AdaptiveRevisitQueueList.reorder(AdaptiveRevisitHostQueue)reorder() when the value used for sorting the list of HQs changes.
Parameters:
  owner - the ARHostQueueList object that contains this HQ.



strictAdd
protected OperationStatus strictAdd(CrawlURI curi, boolean overrideDuplicates) throws DatabaseException(Code)
An internal method for adding URIs to the queue.
Parameters:
  curi - The CrawlURI to add
Parameters:
  overrideDuplicates - If true then any existing CrawlURI in the DBwill be overwritten. If false insert into thequeue is only performed if the key doesn't already exist. The OperationStatus object returned by the put method.
throws:
  DatabaseException -



update
public void update(CrawlURI curi, boolean needWait, long wakeupTime) throws IllegalStateException, IOException(Code)
Update CrawlURI that has completed processing.
Parameters:
  curi - The CrawlURI. This must be a CrawlURI issued by this HQ's AdaptiveRevisitHostQueue.next() next() method.
Parameters:
  needWait - If true then the URI was processed successfully, requiring a period of suspended action on that host. Ifvalence is > 1 then seperate times are maintained for each slot.
Parameters:
  wakeupTime - If new state is AdaptiveRevisitHostQueue.HQSTATE_SNOOZED snoozedthen this parameter should contain the time (in milliseconds) when it will be safe to wake the HQ upagain. Otherwise this parameter will be ignored.
throws:
  IllegalStateException - if the CrawlURIdoes not match a CrawlURI issued for crawling by this HQ'sAdaptiveRevisitHostQueue.next next().
throws:
  IOException - if an error occurs accessing the database



update
public void update(CrawlURI curi, boolean needWait, long wakeupTime, boolean forgetURI) throws IllegalStateException, IOException(Code)
Update CrawlURI that has completed processing.
Parameters:
  curi - The CrawlURI. This must be a CrawlURI issued by this HQ's AdaptiveRevisitHostQueue.next() next() method.
Parameters:
  needWait - If true then the URI was processed successfully, requiring a period of suspended action on that host. Ifvalence is > 1 then seperate times are maintained for each slot.
Parameters:
  wakeupTime - If new state is AdaptiveRevisitHostQueue.HQSTATE_SNOOZED snoozedthen this parameter should contain the time (in milliseconds) when it will be safe to wake the HQ upagain. Otherwise this parameter will be ignored.
Parameters:
  forgetURI - If true, the URI will be deleted from the queue.
throws:
  IllegalStateException - if the CrawlURIdoes not match a CrawlURI issued for crawling by this HQ'sAdaptiveRevisitHostQueue.next next().
throws:
  IOException - if an error occurs accessing the database



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.