Java Doc for CrawlURI.java in  » Web-Crawler » heritrix » org » archive » crawler » datamodel » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.datamodel 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   org.archive.crawler.datamodel.CandidateURI
      org.archive.crawler.datamodel.CrawlURI

CrawlURI
public class CrawlURI extends CandidateURI implements FetchStatusCodes(Code)
Represents a candidate URI and the associated state it collects as it is crawled.

Core state is in instance variables but a flexible attribute list is also available. Use this 'bucket' to carry custom processing extracted data and state across CrawlURI processing. See the CrawlURI.putString(String,String) , CrawlURI.getString(String) , etc.
author:
   Gordon Mohr



Field Summary
final public static  intMAX_OUTLINKS
     Protection against outlink overflow.
final public static  intUNCALCULATED
    
transient  Objectholder
    
 intholderCost
     spot for an integer cost to be placed by external facility (frontier).
transient  ObjectholderKey
    
protected  longordinal
     Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering.
transient  Collection<Object>outLinks
     All discovered outbound Links (navlinks, embeds, etc.) Can either contain Link instances or CandidateURI instances, or both. The LinksScoper processor converts Link instances in this collection to CandidateURI instances.

Constructor Summary
public  CrawlURI(UURI uuri)
     Create a new instance of CrawlURI from a UURI .
public  CrawlURI(CandidateURI caUri, long o)
    

Method Summary
public  voidaboutToLog()
    
public static  voidaddAlistPersistentMember(Object key)
     Add the key of alist items you want to persist across processings.
public  voidaddAnnotation(String annotation)
     Add an annotation: an abbrieviated indication of something special about this URI that need not be present in every crawl.log line, but should be noted for future reference.
public  voidaddCredentialAvatar(CredentialAvatar ca)
     Add an avatar.
public  voidaddLocalizedError(String processorName, Throwable ex, String message)
     Make note of a non-fatal error, local to a particular Processor, which should be logged somewhere, but allows processing to continue.
public  voidaddOutLink(Link link)
     Add a discovered Link, unless it would exceed the max number to accept.
protected  booleanannotationContains(String str2Find)
    
public  voidclearOutlinks()
    
public  voidcreateAndAddLink(String url, CharSequence context, char hopType)
    
public  voidcreateAndAddLinkRelativeToBase(String url, CharSequence context, char hopType)
    
public  voidcreateAndAddLinkRelativeToVia(String url, CharSequence context, char hopType)
     Convenience method for creating a Link with the given string and context, relative to this CrawlURI's via UURI if available.
public  LinkcreateLink(String url, CharSequence context, char hopType)
    
public static  StringfetchStatusCodesToString(int code)
     Takes a status code and converts it into a human readable string.
public static  CrawlURIfrom(CandidateURI caUri, long ordinal)
     Make a CrawlURI from the passed CandidateURI. Its safe to pass a CrawlURI instance.
public  StringgetAnnotations()
     Get the annotations set for this uri.
public  UURIgetBaseURI()
     Get the (HTML) Base URI used for derelativizing internal URIs.
protected  StringgetClassSimpleName(Class c)
    
public  ObjectgetContentDigest()
     Return the retained content-digest value, if any.
public  StringgetContentDigestSchemeString()
    
public  StringgetContentDigestString()
    
public  longgetContentLength()
     For completed HTTP transactions, the length of the content-body.
public  longgetContentSize()
     Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers.
public  StringgetContentType()
     Get the content type of this URI. Fetched URIs content type.
public  StringgetCrawlURIString()
    
public  Set<CredentialAvatar>getCredentialAvatars()
     Credential avatars.
public  intgetDeferrals()
     Get the deferral count.
public  intgetEmbedHopCount()
     Get the embeded hop count.
public  intgetFetchAttempts()
     Get the number of attempts at getting the document referenced by this URI.
public  intgetFetchStatus()
     Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.
public  ObjectgetHolder()
     Return the 'holder' for the convenience of an external facility.
public  intgetHolderCost()
    
public  ObjectgetHolderKey()
     Return the 'holderKey' for convenience of an external facility (Frontier).
public  HttpRecordergetHttpRecorder()
     Get the http recorder associated with this uri. Returns the httpRecorder.
public  intgetLinkHopCount()
     Get the link hop count.
public  longgetOrdinal()
     Get the ordinal (serial number) assigned at creation.
public  Collection<CandidateURI>getOutCandidates()
     Returns discovered candidate URIs.
public  Collection<Link>getOutLinks()
     Returns discovered links.
public  Collection<Object>getOutObjects()
     Returns all of the outbound objects.
public  AListgetPersistentAList()
    
public  ObjectgetPrerequisiteUri()
     Get the prerequisite for this URI.
public  longgetRecordedSize()
    
public  intgetThreadNumber()
     Get the number of the ToeThread responsible for processing this uri.
public  StringgetUserAgent()
     Get the user agent to use for crawling this URI.
public  booleanhasBeenLinkExtracted()
     If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content.
public  booleanhasCredentialAvatars()
    
public  booleanhasPrerequisiteUri()
    
public  booleanhasRfc2617CredentialAvatar()
    
public  voidincrementDeferrals()
     Increment the deferral count.
public  intincrementFetchAttempts()
     Increment the number of attempts at getting the document referenced by this URI.
public  booleanis2XXSuccess()
    
public  booleanisHeaderTruncatedFetch()
    
public  booleanisHttpTransaction()
     Return true if this is a http transaction.
public  booleanisLengthTruncatedFetch()
    
public  booleanisPost()
     Returns true if this URI should be fetched by sending a HTTP POST request.
public  booleanisPrerequisite()
     Returns true if this CrawlURI is a prerequisite.
public  booleanisSuccess()
     Ask this URI if it was a success or not. Only makes sense to call this method after execution of HttpMethod#execute.
public  booleanisTimeTruncatedFetch()
    
public  booleanisTruncatedFetch()
     TODO: Implement truncation using booleans rather than as this ugly String parse.
public  voidlinkExtractorFinished()
     Note that link extraction has been performed on this CrawlURI.
public  voidmarkAsSeed()
     Mark this uri as being a seed.
public  voidmarkPrerequisite(String preq, ProcessorChain lastProcessorChain)
     Do all actions associated with setting a CrawlURI as requiring a prerequisite.
Parameters:
  lastProcessorChain - Last processor chain reference.
public  ProcessornextProcessor()
     Get the next processor to process this URI.
public  ProcessorChainnextProcessorChain()
     Get the processor chain that should be processing this URI after the current chain is finished with it.
public  intoutlinksSize()
    
public  voidprocessingCleanup()
     Clean up after a run through the processing chain. Called on the end of processing chain by Frontier#finish.
public static  booleanremoveAlistPersistentMember(Object key)
    
Parameters:
  key - Key to remove.
public  booleanremoveCredentialAvatar(CredentialAvatar ca)
     Remove all credential avatars from this crawl uri.
Parameters:
  ca - Avatar to remove.
public  voidremoveCredentialAvatars()
     Remove all credential avatars from this crawl uri.
public  voidreplaceOutlinks(Collection<CandidateURI> links)
     Replace current collection of links w/ passed list.
public  voidresetDeferrals()
     Reset deferrals counter.
public  voidresetFetchAttempts()
     Reset fetchAttempts counter.
public  voidsetBaseURI(String baseHref)
     Set the (HTML) Base URI used for derelativizing internal URIs.
public  voidsetContentDigest(byte[] digestValue)
     Set the retained content-digest value (usu.
public  voidsetContentDigest(String scheme, byte[] digestValue)
    
public  voidsetContentSize(long l)
     Sets the 'content size' for the URI, which is considered inclusive of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server).
public  voidsetContentType(String ct)
     Set a fetched uri's content type.
Parameters:
  ct - Contenttype.
public  voidsetFetchStatus(int newstatus)
     Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
public  voidsetHolder(Object obj)
     Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
public  voidsetHolderCost(int cost)
    
public  voidsetHolderKey(Object obj)
     Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
public  voidsetHttpRecorder(HttpRecorder httpRecorder)
     Set the http recorder to be associated with this uri.
public  voidsetNextProcessor(Processor processor)
     Set the next processor to process this URI.
public  voidsetNextProcessorChain(ProcessorChain nextProcessorChain)
     Set the next processor chain to process this URI.
public  voidsetPost(boolean b)
     Set whether this URI should be fetched by sending a HTTP POST request. Else a HTTP GET request will be used.
Parameters:
  b - Set whether this curi is to be POST'd.
public  voidsetPrerequisite(boolean prerequisite)
     Set if this CrawlURI is itself a prerequisite URI.
public  voidsetPrerequisiteUri(Object link)
     Set a prerequisite for this URI.
public  voidsetThreadNumber(int i)
     Set the number of the ToeThread responsible for processing this uri.
public  voidsetUserAgent(String string)
     Set the user agent to use when crawling this URI.
public  voidskipToProcessor(ProcessorChain processorChain, Processor processor)
     Set which processor should be the next processor to process this uri instead of using the default next processor.
public  voidskipToProcessorChain(ProcessorChain processorChain)
     Set which processor chain should be processing this uri next.
public  voidstripToMinimal()
     Remove all attributes set on this uri.

Field Detail
MAX_OUTLINKS
final public static int MAX_OUTLINKS(Code)
Protection against outlink overflow. Change value by setting alternate maximum in heritrix.properties.



UNCALCULATED
final public static int UNCALCULATED(Code)



holder
transient Object holder(Code)



holderCost
int holderCost(Code)
spot for an integer cost to be placed by external facility (frontier). cost is truncated to 8 bits at times, so should not exceed 255



holderKey
transient Object holderKey(Code)



ordinal
protected long ordinal(Code)
Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering. Will sometimes be truncated to 48 bits, so behavior over 281 trillion instantiated CrawlURIs may be buggy



outLinks
transient Collection<Object> outLinks(Code)
All discovered outbound Links (navlinks, embeds, etc.) Can either contain Link instances or CandidateURI instances, or both. The LinksScoper processor converts Link instances in this collection to CandidateURI instances.




Constructor Detail
CrawlURI
public CrawlURI(UURI uuri)(Code)
Create a new instance of CrawlURI from a UURI .
Parameters:
  uuri - the UURI to base this CrawlURI on.



CrawlURI
public CrawlURI(CandidateURI caUri, long o)(Code)
Create a new instance of CrawlURI from a CandidateURI
Parameters:
  caUri - the CandidateURI to base this CrawlURI on.
Parameters:
  o - Monotonically increasing number within a crawl.




Method Detail
aboutToLog
public void aboutToLog()(Code)
Notify CrawlURI it is about to be logged; opportunity for self-annotation



addAlistPersistentMember
public static void addAlistPersistentMember(Object key)(Code)
Add the key of alist items you want to persist across processings.
Parameters:
  key - Key to add.



addAnnotation
public void addAnnotation(String annotation)(Code)
Add an annotation: an abbrieviated indication of something special about this URI that need not be present in every crawl.log line, but should be noted for future reference.
Parameters:
  annotation - the annotation to add; should not contain whitespace or a comma



addCredentialAvatar
public void addCredentialAvatar(CredentialAvatar ca)(Code)
Add an avatar. We do lazy instantiation.
Parameters:
  ca - Credential avatar to add to set of avatars.



addLocalizedError
public void addLocalizedError(String processorName, Throwable ex, String message)(Code)
Make note of a non-fatal error, local to a particular Processor, which should be logged somewhere, but allows processing to continue. This is how you add to the local-error log (the 'localized' in the below is making an error local rather than global, not making a swiss-french version of the error.).
Parameters:
  processorName - Name of processor the exception was thrownin.
Parameters:
  ex - Throwable to log.
Parameters:
  message - Extra message to log beyond exception message.



addOutLink
public void addOutLink(Link link)(Code)
Add a discovered Link, unless it would exceed the max number to accept. (If so, increment discarded link counter.)
Parameters:
  link - the Link to add



annotationContains
protected boolean annotationContains(String str2Find)(Code)



clearOutlinks
public void clearOutlinks()(Code)



createAndAddLink
public void createAndAddLink(String url, CharSequence context, char hopType) throws URIException(Code)
Convenience method for creating a Link with the given string and context
Parameters:
  url - String to use to create Link
Parameters:
  context - CharSequence context to use
Parameters:
  hopType -
throws:
  URIException - if Link UURI cannot be constructed



createAndAddLinkRelativeToBase
public void createAndAddLinkRelativeToBase(String url, CharSequence context, char hopType) throws URIException(Code)
Convenience method for creating a Link with the given string and context, relative to a previously set base HREF if available (or relative to the current CrawlURI if no other base has been set)
Parameters:
  url - String URL to add as destination of link
Parameters:
  context - String context where link was discovered
Parameters:
  hopType - char hop-type indicator
throws:
  URIException -



createAndAddLinkRelativeToVia
public void createAndAddLinkRelativeToVia(String url, CharSequence context, char hopType) throws URIException(Code)
Convenience method for creating a Link with the given string and context, relative to this CrawlURI's via UURI if available. (If a via is not available, falls back to using #createAndAddLinkRelativeToBase.)
Parameters:
  url - String URL to add as destination of link
Parameters:
  context - String context where link was discovered
Parameters:
  hopType - char hop-type indicator
throws:
  URIException -



createLink
public Link createLink(String url, CharSequence context, char hopType) throws URIException(Code)
Convenience method for creating a Link discovered at this URI with the given string and context
Parameters:
  url - String to use to create Link
Parameters:
  context - CharSequence context to use
Parameters:
  hopType - Link.
throws:
  URIException - if Link UURI cannot be constructed



fetchStatusCodesToString
public static String fetchStatusCodesToString(int code)(Code)
Takes a status code and converts it into a human readable string.
Parameters:
  code - the status code a human readable string declaring what the status code is.



from
public static CrawlURI from(CandidateURI caUri, long ordinal)(Code)
Make a CrawlURI from the passed CandidateURI. Its safe to pass a CrawlURI instance. In this case we just return it as a result. Otherwise, we create new CrawlURI instance.
Parameters:
  caUri - Candidate URI.
Parameters:
  ordinal - A crawlURI made from the passed CandidateURI.



getAnnotations
public String getAnnotations()(Code)
Get the annotations set for this uri. the annotations set for this uri.



getBaseURI
public UURI getBaseURI()(Code)
Get the (HTML) Base URI used for derelativizing internal URIs. UURI base URI previously set



getClassSimpleName
protected String getClassSimpleName(Class c)(Code)



getContentDigest
public Object getContentDigest()(Code)
Return the retained content-digest value, if any. Digest value.



getContentDigestSchemeString
public String getContentDigestSchemeString()(Code)



getContentDigestString
public String getContentDigestString()(Code)



getContentLength
public long getContentLength()(Code)
For completed HTTP transactions, the length of the content-body. For completed HTTP transactions, the length of the content-body.



getContentSize
public long getContentSize()(Code)
Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers. It is the responsibility of the classes which fetch the URI to set this value accordingly -- it is not calculated/verified within CrawlURI. This value is consulted in reporting/logging/writing-decisions.
See Also:   CrawlURI.setContentSize() contentSize



getContentType
public String getContentType()(Code)
Get the content type of this URI. Fetched URIs content type. May be null.



getCrawlURIString
public String getCrawlURIString()(Code)
This crawl URI as a string wrapped with 'CrawlURI(' +')'.



getCredentialAvatars
public Set<CredentialAvatar> getCredentialAvatars()(Code)
Credential avatars. Null if none set.



getDeferrals
public int getDeferrals()(Code)
Get the deferral count. the deferral count.



getEmbedHopCount
public int getEmbedHopCount()(Code)
Get the embeded hop count. the embeded hop count.



getFetchAttempts
public int getFetchAttempts()(Code)
Get the number of attempts at getting the document referenced by this URI. the number of attempts at getting the document referenced by thisURI.



getFetchStatus
public int getFetchStatus()(Code)
Return the overall/fetch status of this CrawlURI for its current trip through the processing loop. a value from FetchStatusCodes



getHolder
public Object getHolder()(Code)
Return the 'holder' for the convenience of an external facility. holder



getHolderCost
public int getHolderCost()(Code)
Return the 'holderCost' for convenience of external facility (frontier) value of holderCost



getHolderKey
public Object getHolderKey()(Code)
Return the 'holderKey' for convenience of an external facility (Frontier). holderKey



getHttpRecorder
public HttpRecorder getHttpRecorder()(Code)
Get the http recorder associated with this uri. Returns the httpRecorder. May be null but its set early inFetchHttp so there is an issue if its null.



getLinkHopCount
public int getLinkHopCount()(Code)
Get the link hop count. the link hop count.



getOrdinal
public long getOrdinal()(Code)
Get the ordinal (serial number) assigned at creation. ordinal



getOutCandidates
public Collection<CandidateURI> getOutCandidates()(Code)
Returns discovered candidate URIs. The returned collection will be emtpy until something like LinksScoper promotes discovered Links into CandidateURIs. Elements can be removed from the returned collection, but not added. To add a candidate URI, use CrawlURI.replaceOutlinks(Collection) or CrawlURI.getOutObjects . Collection of candidate URIs



getOutLinks
public Collection<Link> getOutLinks()(Code)
Returns discovered links. The returned collection might be empty if no links were discovered, or if something like LinksScoper promoted the links to CandidateURIs. Elements can be removed from the returned collection, but not added. To add a discovered link, use one of the createAndAdd methods or CrawlURI.getOutObjects() . Collection of all discovered outbound Links



getOutObjects
public Collection<Object> getOutObjects()(Code)
Returns all of the outbound objects. The returned Collection will contain Link instances, or CandidateURI instances, or both. the collection of Links and/or CandidateURIs



getPersistentAList
public AList getPersistentAList()(Code)



getPrerequisiteUri
public Object getPrerequisiteUri()(Code)
Get the prerequisite for this URI.

A prerequisite is a URI that must be crawled before this URI can be crawled. the prerequisite for this URI or null if no prerequisite.




getRecordedSize
public long getRecordedSize()(Code)
Get size of data recorded (transferred) recorded data size



getThreadNumber
public int getThreadNumber()(Code)
Get the number of the ToeThread responsible for processing this uri. the ToeThread number.



getUserAgent
public String getUserAgent()(Code)
Get the user agent to use for crawling this URI. If null the global setting should be used. user agent or null



hasBeenLinkExtracted
public boolean hasBeenLinkExtracted()(Code)
If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content. This does not preclude other link extractors that may have an interest in this CrawlURI from also doing link extraction but default behavior should be to not run if link extraction has already been done.

There is an onus on link extractors to set this flag if they have run.

The only extractor of the default Heritrix set that does not respect this flag is org.archive.crawler.extractor.ExtractorHTTP . It runs against HTTP headers, not the document content. True if a processor has performed link extraction on thisCrawlURI
See Also:   CrawlURI.linkExtractorFinished()




hasCredentialAvatars
public boolean hasCredentialAvatars()(Code)
True if there are avatars attached to this instance.



hasPrerequisiteUri
public boolean hasPrerequisiteUri()(Code)
True if this CrawlURI has a prerequisite.



hasRfc2617CredentialAvatar
public boolean hasRfc2617CredentialAvatar()(Code)
True if we have an rfc2617 payload.



incrementDeferrals
public void incrementDeferrals()(Code)
Increment the deferral count.



incrementFetchAttempts
public int incrementFetchAttempts()(Code)
Increment the number of attempts at getting the document referenced by this URI. the number of attempts at getting the document referenced by thisURI.



is2XXSuccess
public boolean is2XXSuccess()(Code)
True if status code is in the 2xx range.
See Also:   CrawlURI.isSuccess()



isHeaderTruncatedFetch
public boolean isHeaderTruncatedFetch()(Code)



isHttpTransaction
public boolean isHttpTransaction()(Code)
Return true if this is a http transaction. TODO: Compound this and CrawlURI.isPost() method so that there is one place to go to find out if get http, post http, ftp, dns. True if this is a http transaction.



isLengthTruncatedFetch
public boolean isLengthTruncatedFetch()(Code)



isPost
public boolean isPost()(Code)
Returns true if this URI should be fetched by sending a HTTP POST request. TODO: Compound this and CrawlURI.isHttpTransaction() method so that there is one place to go to find out if get http, post http, ftp, dns. Returns is this CrawlURI instance is to be posted.



isPrerequisite
public boolean isPrerequisite()(Code)
Returns true if this CrawlURI is a prerequisite. true if this CrawlURI is a prerequisite.



isSuccess
public boolean isSuccess()(Code)
Ask this URI if it was a success or not. Only makes sense to call this method after execution of HttpMethod#execute. Regard any status larger then 0 as success except for below caveat regarding 401s. Use CrawlURI.is2XXSuccess() if looking for a status code in the 200 range.

401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data. True if ths URI has been successfully processed.
See Also:   CrawlURI.is2XXSuccess()




isTimeTruncatedFetch
public boolean isTimeTruncatedFetch()(Code)



isTruncatedFetch
public boolean isTruncatedFetch()(Code)
TODO: Implement truncation using booleans rather than as this ugly String parse. True if fetch was truncated.



linkExtractorFinished
public void linkExtractorFinished()(Code)
Note that link extraction has been performed on this CrawlURI. A processor doing link extraction should invoke this method once it has finished it's work. It should invoke it even if no links are extracted. It should only invoke this method if the link extraction was performed on the document body (not the HTTP headers etc.).
See Also:   CrawlURI.hasBeenLinkExtracted()



markAsSeed
public void markAsSeed()(Code)
Mark this uri as being a seed.



markPrerequisite
public void markPrerequisite(String preq, ProcessorChain lastProcessorChain) throws URIException(Code)
Do all actions associated with setting a CrawlURI as requiring a prerequisite.
Parameters:
  lastProcessorChain - Last processor chain reference. This chain iswhere this CrawlURI goes next.
Parameters:
  preq - Object to set a prerequisite.
throws:
  URIException -



nextProcessor
public Processor nextProcessor()(Code)
Get the next processor to process this URI. the processor that should process this URI next.



nextProcessorChain
public ProcessorChain nextProcessorChain()(Code)
Get the processor chain that should be processing this URI after the current chain is finished with it. the next processor chain to process this URI.



outlinksSize
public int outlinksSize()(Code)
Count of outlinks.



processingCleanup
public void processingCleanup()(Code)
Clean up after a run through the processing chain. Called on the end of processing chain by Frontier#finish. Null out any state gathered during processing.



removeAlistPersistentMember
public static boolean removeAlistPersistentMember(Object key)(Code)

Parameters:
  key - Key to remove. True if list contained the element.



removeCredentialAvatar
public boolean removeCredentialAvatar(CredentialAvatar ca)(Code)
Remove all credential avatars from this crawl uri.
Parameters:
  ca - Avatar to remove. True if we removed passed parameter. False if no operationperformed.



removeCredentialAvatars
public void removeCredentialAvatars()(Code)
Remove all credential avatars from this crawl uri.



replaceOutlinks
public void replaceOutlinks(Collection<CandidateURI> links)(Code)
Replace current collection of links w/ passed list. Used by Scopers adjusting the list of links (removing those not in scope and promoting Links to CandidateURIs).
Parameters:
  a - collection of CandidateURIs replacing any previouslyexisting outLinks or outCandidates



resetDeferrals
public void resetDeferrals()(Code)
Reset deferrals counter.



resetFetchAttempts
public void resetFetchAttempts()(Code)
Reset fetchAttempts counter.



setBaseURI
public void setBaseURI(String baseHref) throws URIException(Code)
Set the (HTML) Base URI used for derelativizing internal URIs.
Parameters:
  baseHref - String base href to use
throws:
  URIException - if supplied string cannot be interpreted as URI



setContentDigest
public void setContentDigest(byte[] digestValue)(Code)
Set the retained content-digest value (usu. SHA1).
Parameters:
  digestValue - #setContentDigest(String scheme, byte[])



setContentDigest
public void setContentDigest(String scheme, byte[] digestValue)(Code)



setContentSize
public void setContentSize(long l)(Code)
Sets the 'content size' for the URI, which is considered inclusive of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server). (In contrast, content-length matches the HTTP definition, that of the enclosed content-body.) Should be set by a fetcher or other processor as soon as the final size of recorded content is known. Setting to an artificial/incorrect value may affect other reporting/processing.
Parameters:
  l - Content size.



setContentType
public void setContentType(String ct)(Code)
Set a fetched uri's content type.
Parameters:
  ct - Contenttype. May be null.



setFetchStatus
public void setFetchStatus(int newstatus)(Code)
Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
Parameters:
  newstatus - a value from FetchStatusCodes



setHolder
public void setHolder(Object obj)(Code)
Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
Parameters:
  obj -



setHolderCost
public void setHolderCost(int cost)(Code)
Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI
Parameters:
  cost - value to remember



setHolderKey
public void setHolderKey(Object obj)(Code)
Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
Parameters:
  obj -



setHttpRecorder
public void setHttpRecorder(HttpRecorder httpRecorder)(Code)
Set the http recorder to be associated with this uri.
Parameters:
  httpRecorder - The httpRecorder to set.



setNextProcessor
public void setNextProcessor(Processor processor)(Code)
Set the next processor to process this URI.
Parameters:
  processor - the next processor to process this URI.



setNextProcessorChain
public void setNextProcessorChain(ProcessorChain nextProcessorChain)(Code)
Set the next processor chain to process this URI.
Parameters:
  nextProcessorChain - the next processor chain to process this URI.



setPost
public void setPost(boolean b)(Code)
Set whether this URI should be fetched by sending a HTTP POST request. Else a HTTP GET request will be used.
Parameters:
  b - Set whether this curi is to be POST'd. Else its to be GET'd.



setPrerequisite
public void setPrerequisite(boolean prerequisite)(Code)
Set if this CrawlURI is itself a prerequisite URI.
Parameters:
  prerequisite - True if this CrawlURI is itself a prerequiste uri.



setPrerequisiteUri
public void setPrerequisiteUri(Object link)(Code)
Set a prerequisite for this URI.

A prerequisite is a URI that must be crawled before this URI can be crawled.
Parameters:
  link - Link to set as prereq.




setThreadNumber
public void setThreadNumber(int i)(Code)
Set the number of the ToeThread responsible for processing this uri.
Parameters:
  i - the ToeThread number.



setUserAgent
public void setUserAgent(String string)(Code)
Set the user agent to use when crawling this URI. If not set the global settings should be used.
Parameters:
  string - user agent to use



skipToProcessor
public void skipToProcessor(ProcessorChain processorChain, Processor processor)(Code)
Set which processor should be the next processor to process this uri instead of using the default next processor.
Parameters:
  processorChain - the processor chain to skip to.
Parameters:
  processor - the processor in the processor chain to skip to.



skipToProcessorChain
public void skipToProcessorChain(ProcessorChain processorChain)(Code)
Set which processor chain should be processing this uri next.
Parameters:
  processorChain - the processor chain to skip to.



stripToMinimal
public void stripToMinimal()(Code)
Remove all attributes set on this uri.

This methods removes the attribute list.




Fields inherited from org.archive.crawler.datamodel.CandidateURI
final public static int HIGH(Code)(Java Doc)
final public static int HIGHEST(Code)(Java Doc)
final public static int MEDIUM(Code)(Java Doc)
final public static int NORMAL(Code)(Java Doc)

Methods inherited from org.archive.crawler.datamodel.CandidateURI
protected void clearAList()(Code)(Java Doc)
public boolean containsKey(String key)(Code)(Java Doc)
public CandidateURI createCandidateURI(UURI baseUURI, Link link) throws URIException(Code)(Java Doc)
public CandidateURI createCandidateURI(UURI baseUURI, Link link, int scheduling, boolean seed) throws URIException(Code)(Java Doc)
public static CandidateURI createSeedCandidateURI(UURI uuri)(Code)(Java Doc)
public String flattenVia()(Code)(Java Doc)
public boolean forceFetch()(Code)(Java Doc)
public static CandidateURI fromString(String uriHopsViaString) throws URIException(Code)(Java Doc)
public AList getAList()(Code)(Java Doc)
public synchronized String getCandidateURIString()(Code)(Java Doc)
public String getClassKey()(Code)(Java Doc)
public int getInt(String key)(Code)(Java Doc)
public long getLong(String key)(Code)(Java Doc)
public Object getObject(String key)(Code)(Java Doc)
public String getPathFromSeed()(Code)(Java Doc)
public String[] getReports()(Code)(Java Doc)
public int getSchedulingDirective()(Code)(Java Doc)
public String getString(String key)(Code)(Java Doc)
public int getTransHops()(Code)(Java Doc)
public String getURIString()(Code)(Java Doc)
public UURI getUURI()(Code)(Java Doc)
public UURI getVia()(Code)(Java Doc)
public CharSequence getViaContext()(Code)(Java Doc)
protected void inheritFrom(CandidateURI ancestor)(Code)(Java Doc)
public boolean isLocation()(Code)(Java Doc)
public boolean isSeed()(Code)(Java Doc)
public Iterator keys()(Code)(Java Doc)
public void makeHeritable(String key)(Code)(Java Doc)
public void makeNonHeritable(String key)(Code)(Java Doc)
public boolean needsImmediateScheduling()(Code)(Java Doc)
public boolean needsSoonScheduling()(Code)(Java Doc)
public void putInt(String key, int value)(Code)(Java Doc)
public void putLong(String key, long value)(Code)(Java Doc)
public void putObject(String key, Object value)(Code)(Java Doc)
public void putString(String key, String value)(Code)(Java Doc)
protected UURI readUuri(String u)(Code)(Java Doc)
public void remove(String key)(Code)(Java Doc)
public void reportTo(String name, PrintWriter writer)(Code)(Java Doc)
public void reportTo(PrintWriter writer) throws IOException(Code)(Java Doc)
public boolean sameDomainAs(CandidateURI other) throws URIException(Code)(Java Doc)
protected void setAList(AList alist)(Code)(Java Doc)
public void setClassKey(String key)(Code)(Java Doc)
public void setForceFetch(boolean b)(Code)(Java Doc)
public void setIsSeed(boolean b)(Code)(Java Doc)
protected void setPathFromSeed(String string)(Code)(Java Doc)
public void setSchedulingDirective(int schedulingDirective)(Code)(Java Doc)
public void setVia(UURI via)(Code)(Java Doc)
public String singleLineLegend()(Code)(Java Doc)
public String singleLineReport()(Code)(Java Doc)
public void singleLineReportTo(PrintWriter w)(Code)(Java Doc)
public String toString()(Code)(Java Doc)

Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.