heritrix

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix 
Heritrix Crawlers
License:GNU Library or Lesser General Public License (LGPL)
URL:http://crawler.archive.org/
Description:Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Package NameComment
com.sleepycat.collections
org.apache.commons.httpclient
org.apache.commons.httpclient.cookie
org.apache.commons.pool.impl
org.archive.crawler

Introduction to Heritrix.

Heritrix is designed to be easily extensible via 3rd party modules.

Architecture

The software is divided into several packages of varying importance. The relationship between them will be covered in some greater depth after their introductions.

The root package (this) contains the executable class {@link org.archive.crawler.Heritrix Heritrix}. That class will load the crawler, parsing command line arguments. If a WUI is to be launched it will launch it. It can also start jobs (with or without the WUI) that are specified in command line options.

framework

{@link org.archive.crawler.framework org.archive.crawler.framework}

The framework package contains most of the core classes for running a crawl. It also contains a number of Interfaces for extensible items, the implementatations of whom can be found in other classes.

Heritrix is in effect divided into two types of classes.

  1. Core classes - these can often be configured but not replaced.
  2. Pluggable classes - these must implment a given interface or extend a specific class but 3rd parties can introduce their own implementations.
The framework thus contains a selection of the core classes and a number of the Interfaces and base classes for the pluggable classes.

datamodel

{@link org.archive.crawler.datamodel org.archive.crawler.datamodel}

Contains various classes that make up the crawlers data structure. Including such essentials as the CandidateURI and CrawlURI classes that wrap the discovered URIs for processing.

admin

{@link org.archive.crawler.admin org.archive.crawler.admin}

The admin package contains classes that are used by the Web UI. This includes some core classes and a specific implementation of the Statistics Tracking interface found in the framework package that is designed to provide the UI with information about ongoing crawls.

Pluggable modules

The following is a listing of the types of pluggable modules found in Heritrix with brief explanations of each and linking to their respective API documentation.

Frontier

A Frontier maintains the internal state of a crawl while it is in progress. What URIs have been discovered, which should be crawled next, etc.

Needless to say this is one of the most important modules in any crawl and the provided implementation should generally be appropriate unless a very different strategy for ordering URIs for crawling is desired.

{@link org.archive.crawler.framework.Frontier Frontier} is the interface that all Frontiers must implement.
{@link org.archive.crawler.frontier org.archive.crawler.frontier} package contains the provided implementation of a Frontier along with it's supporting classes.

Processor

Processing Steps

When a URI is crawled, a {@link org.archive.crawler.framework.ToeThread ToeThread} will execute a series of processors on it.

The processors are split into 5 distinct chains that are exectued in sequence:

  1. Pre-fetch processing chain
  2. Fetch processing chain
  3. Extractor processing chain
  4. Write/Index processing chain
  5. Post-processing chain
Each of these chains contain any number of processors. The processors all inherit from a generic {@link org.archive.crawler.framework.Processor Processor}. While the processors are divided into the five categories above that is strictly a high level configuration and any processor can be in any chain (although doing link extraction before fetching a document is clearly of no use).

Numerous processors are provided with Heritrix in the following packages:
{@link org.archive.crawler.prefetch org.archive.crawler.prefetch} package contains processors run before the URI is fetched from the Internet.
{@link org.archive.crawler.fetcher org.archive.crawler.fetcher} package contains processors that fetch URI from the Internet. Typically each processor handles a different protocol.
{@link org.archive.crawler.extractor org.archive.crawler.extractor} package contains processors that perform link extractions on various document types.
{@link org.archive.crawler.writer org.archive.crawler.writer} package contains a processor that writes an ARC file with the fetched document.
{@link org.archive.crawler.postprocessor org.archive.crawler.postprocessor} package contain processors that do wrapup on the processing, reporting links back to the Frontier etc.

Filter

Scope

Scopes are special filters that are applied to the crawl as a whole to define it's scope. Any given crawl will employ exactly one scope object to define what URIs are considered 'within scope'.

Several implementations covering the most commonly desired scopes are provided (broad, domain, host etc.). However custom implementations can be made of these to define any arbitrary scope. It should be noted though that usually any type of limitations to the scope of a crawl can be more easily achived using one of the existing scopes and modifing it with appropriate filters.

{@link org.archive.crawler.framework.CrawlScope CrawlScope} - Base class for scopes.
{@link org.archive.crawler.scope org.archive.crawler.scope} package. Contains provided scopes.

Statistics Tracking

Any number of statistics tracking modules can be added to a crawl to gather run time information about it's progress.

These modules can both interrogate the Frontier for what sparse date it exposes but they can also subscribe to {@link org.archive.crawler.event.CrawlURIDispositionListener Crawled URI Disposition} events to monitor the completion of each URI that is processed.

An interface for {@link org.archive.crawler.framework.StatisticsTracking statistics tracking} is provided as well as a partial implementation ({@link org.archive.crawler.framework.AbstractTracker AbstractTracker}) that does much of the work common to most statistics tracking modules.

Furthermore the admin package implements a statistics tracking module ({@link org.archive.crawler.admin.StatisticsTracker StatisticsTracker}) that generates a log of the crawlers progress as well as providing information that the UI uses. It also compiles end-of-crawl reports that contain all of the information it has gathered in the course of the crawl.
It is highly recommended that it always be used when running crawls via the UI.

org.archive.crawler.admin org.archive.crawler.admin package Contains classes that the web UI uses to monitor and control crawls. Some utilities classes used exclusively or primarily for the UI are also included.

Most of the heavy duty work is done by the CrawlJobHandler that manages most of the interaction between the UI and the the CrawlController. The CrawlJob class serves to encapsulate the settings needed to launch one crawl.

This package also provides an implementation of the Statistics Tracking interface that contains useful methods to access progress data. This is used for monitoring crawls. While it is technically possible to launch jobs without this statistics tracker, it would render the UI inoperable as far as monitoring the progress of that crawl.

org.archive.crawler.admin.ui
org.archive.crawler.datamodel
org.archive.crawler.datamodel.credential org.archive.io.arc package Contains html form login and basic and digest credentials used by Heritrix logging into sites.

To watch credentials running, enable logging setting the following logging level for FetchHttp class: org.archive.crawler.fetcher.FetchHTTP.level = FINE

org.archive.crawler.deciderules Provides classes for a simple decision rules framework.

Each 'step' in a decision rule set which can affect an objects ultimate fate is called a DecideRule. Each DecideRule renders a decision (possibly neutral) on the passed objects fate.

Possible decisions are:

  • ACCEPT means the object is ruled-in for further processing
  • REJECT means the object is ruled-out for further processing
  • PASS means this particular DecideRule has no opinion

As previously outlined, each DecideRule is applied in turn; the last one to express a non-PASS preference wins.

For example, if the rules are:

  • AcceptDecideRule -- ACCEPTs all (establishing a default)
  • TooManyHopsDecideRule(max-hops=3) -- REJECTS all with hopsPath.length()>3, PASSes otherwise
  • PrerequisiteAcceptDecideRule -- ACCEPTs any with 'P' as last hop, PASSes otherwise (this allows 'LLL's which need a 'LLLP' prerequisite a chance to complete)
Then, you have a crawl that will go 3 hops (of any type) from the seeds, with a special affordance to get prerequisites of 3-hop items (which may be 4 "hops" out)

To allow this style of decision processing to be plugged into the existing Filter and Scope slots:

  • There's a DecidingFilter which takes an (ordered) map of DecideRules
  • There's a DecidingScope which takes the same

See NewScopingModel for background.

org.archive.crawler.deciderules.recrawl
org.archive.crawler.event
org.archive.crawler.extractor
org.archive.crawler.fetcher
org.archive.crawler.filter
org.archive.crawler.framework
org.archive.crawler.framework.exceptions
org.archive.crawler.frontier
org.archive.crawler.io
org.archive.crawler.postprocessor
org.archive.crawler.prefetch
org.archive.crawler.processor
org.archive.crawler.processor.recrawl
org.archive.crawler.scope
org.archive.crawler.selftest org.archive.crawler.selftest package Provides the client-side aspect of the heritrix integration self test.

The selftest webapp is the repository for the serverside of the intergration test.

The integration self test is run from the command line. Invocation makes the crawler go up against itself trawling the selftest webapp. When done, the product -- arc and log files -- are analyzed by code herein to verify test pass or fail.

The integration self test is the aggregation of multiple individual tests each testing a particular crawler aspect. For example, the Robots test validates the crawler's parse of robots.txt. Each test comprises a directory under the selftest webapp named for the test into which we put the server pages that express the scenario to test, and a class from this package named for test webapp directory w/ a SelfTest suffix. The selftest class verifies test success. Each selftest class subclasses org.archive.crawler.selftest.SelfTestCase which is itself a subclass of org.junit.TestCase). All tests need to be registered with the {@link org.archive.crawler.selftest.AllSelfTestCases} class and must live in the org.archive.crawler.selftest package. The class {@link org.archive.crawler.selftest.SelfTestCrawlJobHandler} manages the running of selftest.

Run one test only by passing its name as the option value to the selftest argument.

The first crop of self tests are derived from tests developed by Parker Thompson < pt at archive dot org >. See Tests. These tests in turn look to have been derived from Testing Search Indexing Systems1.

Adding a Self Test

TODO

Related Documentation

TODO

org.archive.crawler.settings Provides classes for the settings framework.

The settings framework is designed to be a flexible way to configure a crawl with special treatment for subparts of the web without adding to much performance overhead.

At it's core the settings framework is a way to keep persistent, context sensitive configuration settings for any class in the crawler.

All classes in the crawler that has configurable settings subclasses {@link org.archive.crawler.settings.ComplexType} or one of its descendants. The {@link org.archive.crawler.settings.ComplexType} implements the {@link javax.management.DynamicMBean} interface. This gives you a way to ask the object for what attributes it supports and standard methods for getting and setting these attributes.

The entry point into the settings framework is the {@link org.archive.crawler.settings.SettingsHandler}. This class is responsible for loading and saving from persistent storage and for interconnecting the different parts of the framework.


Figure 1. Schematic view of the Settings Framework

Settings hierarchy

The settings framework supports a hierarchy of settings. This hierarchy is built by {@link org.archive.crawler.settings.CrawlerSettings} objects. On the top there is a settings object representing the global settings. This consist of all the settings that a crawl job needs for running. Beneath this global object there is one "per" settings object for each host/domain which has settings that should override the order for that particular host or domain.

When the settings framework is asked for an attribute for a specific host, it will first try to see if this attribute is set for this particular host. If it is, the value will be returned. If not, it will go up one level recursively until it eventually reach the order object and returns the global value. If no value is set here either (normally it would be), a hard coded default value is returned.

All per domain/host settings objects only contain those settings which are to be overridden for that particular domain/host. The convention is to name the top level object "global settings" and the objects beneath "per settings" or "overrides" (although the refinements described next, also do overriding).

To further complicate the picture, there is also settings objects called refinements. An object of this type belongs to a global or per settings object and overrides the settings in it's owners object if some criteria is met. These criteria could be that the URI in question conforms to a regular expression or that it the settings are consulted at a specific time of day limited by a time span.

ComplexType hierarchy

All the configurable modules in the crawler subclasses {@link org.archive.crawler.settings.ComplexType} or one of its descendants. The {@link org.archive.crawler.settings.ComplexType} is responsible for keeping the definition of the configurable attributes of the module. The actual values are stored in an instance of {@link org.archive.crawler.settings.DataContainer}. The {@link org.archive.crawler.settings.DataContainer} is never accessed directly from user code. Instead the user accesses the attributes through methods in the {@link org.archive.crawler.settings.ComplexType}. The attributes are accessed in different ways depending if it is from the user interface or from inside a running crawl.

When an attribute is accessed from the URI (either reading or writing) you want to make sure that you are editing the attribute in the right context. When trying to override an attribute, you don't want the settings framework to traverse up to effective value for the attribute, but instead want to know that the attribute is not set on this level. To achieve this, there is {@link org.archive.crawler.settings.ComplexType#getLocalAttribute(CrawlerSettings settings, String name)} and {@link org.archive.crawler.settings.ComplexType#setAttribute(CrawlerSettings settings, Attribute attribute)} methods taking a settings object as a parameter. These methods works only on the supplied settings object. In addition the methods {@link org.archive.crawler.settings.ComplexType#getAttribute(String)} and {@link org.archive.crawler.settings.ComplexType#setAttribute(Attribute attribute)} is there for conformance to the Java JMX specification. The latter two always works on the global settings object.

Getting an attribute within a crawl is different in that you always want to get a value even if it is not set in it's context. That means that the settings framework should work its way up the settings hierarchy to find the value in effect for the context. The method {@link org.archive.crawler.settings.ComplexType#getAttribute(String name, CrawlURI uri)} should be used to make sure that the right context is used. Figure 2 shows how the settings framework finds the effective value given a context.


Figure 2. Flow of getting an attribute

The different attributes has a type. The allowed type all subclasses the {@link org.archive.crawler.settings.Type} class. There are tree main Types:

  1. {@link org.archive.crawler.settings.SimpleType}
  2. {@link org.archive.crawler.settings.ListType}
  3. {@link org.archive.crawler.settings.ComplexType}
Except for the {@link org.archive.crawler.settings.SimpleType}, the actual type used will be a subclass of one of these main types.

SimpleType

The {@link org.archive.crawler.settings.SimpleType} is mainly for representing Java™ wrappers for the Java™ primitive types. In addition it also handles the {@link java.util.Date} type and a special Heritrix {@link org.archive.crawler.settings.TextField} type. Overrides of a {@link org.archive.crawler.settings.SimpleType} must be of the same type as the initial default value for the {@link org.archive.crawler.settings.SimpleType}.

ListType

The {@link org.archive.crawler.settings.ListType} is further subclassed into versions for some of the wrapped Java™ primitive types ({@link org.archive.crawler.settings.DoubleList}, {@link org.archive.crawler.settings.FloatList}, {@link org.archive.crawler.settings.IntegerList}, {@link org.archive.crawler.settings.LongList}, {@link org.archive.crawler.settings.StringList}). A List holds values in the same order as they were added. If an attribute of type {@link org.archive.crawler.settings.ListType} is overridden, then the complete list of values is replaced at the override level.

ComplexType

The {@link org.archive.crawler.settings.ComplexType} is a map of name/value pairs. The values can be any {@link org.archive.crawler.settings.Type} including new {@link org.archive.crawler.settings.ComplexType MapTypes}. The {@link org.archive.crawler.settings.ComplexType} is defined abstract and you should use one of the subclasses {@link org.archive.crawler.settings.MapType} or {@link org.archive.crawler.settings.ModuleType}. The {@link org.archive.crawler.settings.MapType} allows adding of new name/value pairs at runtime, while the {@link org.archive.crawler.settings.ModuleType} only allows the name/value pairs that it defines at construction time. When overriding the {@link org.archive.crawler.settings.MapType} the options are either override the value of an already existing attribute or add a new one. It is not possible in an override to remove an existing attribute. The {@link org.archive.crawler.settings.ModuleType} doesn't allow additions in overrides, but the predefined attributes' values might be overridden. Since the {@link org.archive.crawler.settings.ModuleType} is defined at construction time, it is possible to set more restrictions on each attribute than in the {@link org.archive.crawler.settings.MapType}. Another consequence of definition at construction time is that you would normally subclass the {@link org.archive.crawler.settings.ModuleType}, while the {@link org.archive.crawler.settings.MapType} is usable as it is. It is possible to restrict the {@link org.archive.crawler.settings.MapType} to only allow attributes of a certain type. There is also a restriction that {@link org.archive.crawler.settings.MapType MapTypes} can not contain nested {@link org.archive.crawler.settings.MapType MapTypes}.
org.archive.crawler.settings.refinements
org.archive.crawler.url
org.archive.crawler.url.canonicalize
org.archive.crawler.util
org.archive.crawler.writer
org.archive.extractor
org.archive.httpclient org.archive.httpclient package Provides specializations on apache jakarta commons httpclient.

HttpRecorderGetMethod

Class that the passed HttpRecorder w/ boundary between HTTP header and content. Also forces a close on the response on call to releaseConnection.

ConfigurableTrustManagerProtocolSocketFactory

A protocol socket factory that allows setting of trust level on construction.

References

JavaTM Secure Socket Extension (JSSE): Reference Guide

org.archive.io
org.archive.io.arc org.archive.io.arc package ARC file reading and writing.
org.archive.io.warc org.archive.io.warc package Experimental WARC Writer and Readers. Code and specification subject to change with no guarantees of backward compatibility: i.e. newer readers may not be able to parse WARCs written with older writers. This package contains prototyping code for revision 0.12 of the WARC specification. See latest revision for current state (Version 0.10 code and its documentation has been moved into the v10 subpackage).

Implementation Notes

Tools

Initial implementations of Arc2Warc and Warc2Arc tools can be found in the package above this one, at {@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc} respectively. Pass --help to learn how to use each tool.

TODO

  • Is MIME-Version header needed? MIME Parsers seem fine without (python email lib and java mail).
  • Should we write out a Content-Transfer-Encoding header (Currently we do not). Need section in spec. explicit about our interpretation of MIME and deviations (e.g. content-transfer-encoding should be assumed binary in case of WARCs, multipart is not disallowed but not encouraged, etc.)
  • Minor: Do WARC-Version: 0.12 like MIME-Version: 1.0 rather than WARC/0.12 for lead in to an ARCRecord?
org.archive.io.warc.v10 org.archive.io.warc package Experimental WARC Writer and Readers. Code and specification subject to change with no guarantees of backward compatibility: i.e. newer readers may not be able to parse WARCs written with older writers. This code, with noted exceptions, is a loose implementation of parts of the (unreleased and unfinished) WARC File Format (Version 0.9). Deviations from 0.9, outlined below in the section Deviations from Spec., are to be proposed as amendments to the specification to make a new revision. Since the new spec. revision will likely be named version 0.10, code in this package writes WARCs of version 0.10 -- not 0.9.

Implementation Notes

Tools

Initial implementations of Arc2Warc and Warc2Arc tools can be found in the package above this one, at {@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc} respectively. Pass --help to learn how to use each tool.

Unique ID Generator

WARC requires a GUID for each record written. A configurable unique ID {@link org.archive.uid.GeneratorFactory}, it can be configured to use alternate unique ID generators, was added with a default of {@link org.archive.uid.UUIDGenerator}. The default implementation generates UUIDs (using java5 java.util.UUID) with an urn scheme using the uuid namespace [See RFC4122].

{@link org.archive.util.anvl ANVL}

The ANVL RFC822-like format is used writing Named Fields in WARCs and occasionally for metadata. An implementation was added at {@link org.archive.util.anvl}.

Miscellaneous

Writing WARCs, the response record type is chosen as the core record that all others associate to: i.e. all others have a Related-Record-ID that points back to the response.

Deviations from Spec.

The below deviations from spec. 0.9 have been realized in code and are to be proposed as spec. amendments with new revision likely to be 0.10 (Vocal assent was given by John, Gordon, and Stack to the below at La Honda Meeting, August 8th, 2006).

mimetype in header line

Allow full mimetypes in the header line as per RFC2045 rather than current, shriveled mimetype that allows only type and subtype. This will mean mimetypes are allowed parameters: e.g. text/plain; charset=UTF-8 or application/http; msgtype=request. Allowing full mimetypes, we can support the following scenarios without further amendment to specification and without parsers having to resort to metadata records or to custom Named Fields to figure how to interpret payload:

  • Consider the case where an archiving organization would store all related to a capture as one record with a mimetype of multipart/mixed; boundary=RECORD-ID. An example record might comprise the parts Content-Type: application/http; msgtype=request, Content-Type: application/http; msgtype=response, and Content-Type: text/xml+rdf (For metadata).
  • Or, an archiving institution would store a capture with multipart/alternatives ranging from most basic (or 'desiccated' in Kunze-speak) -- perhaps a text/plain rendition of a PDF capture -- through to best, the actual PDF binary itself.

To support full mimetypes, we must allow for whitespace between parameters and allow that parameter values themselves might include whitespace ('quoted-string'). The WARC Writer converts any embedded carriage-return and newlines to single space.

Swap position of recordid and mimetype in the header line

Because of the above amendment where we allow full mimetypes on header line, to ease the parse, since miemtype now may include whitespace, we move the mimetype to last position on header line and recordid to second-from-last.

Use application/http instead of message/http

message type has line length maximum of 1000 characters absent a Content-Type-Encoding header set to BINARY. (See definition of message/http for talk of adherence to MIME message line limits: See 19.1 Internet Media Type message/http and application/http in RFC2616).

Suggested Spec. Amendments

Apart from the above listed deviations, the below changes are also suggested for inclusion in 0.10 spec. revision

Below are mostly suggested edits. Changes are not substantative.

Allow multiple instances of a single Named Parameter

Allow that there may be multiple instances of same Named Parameter in any one Named Parameter block. E.g. Multiple Related-Record-IDs could prove of use. Spec. mentions this in 8.1 HTTP and HTTPS section but better belongs in the 5.2 Named Parameters preamble.

Related, add to Named Field section note on bidirectional Related-Record-ID.

Miscellaneous

LaHonda in below is reference to meeting of John, Gordon and Stack at LaHonda Cafe on 16th St., on August 8th, 2006.

  • Leave off 9.2 GZIP extra fields. Big section on implementing an option that has little to do with WARCing. AGREED at LaHonda.
  • But, we need to mark gzipped files as being WARC: i.e. that the GZIP is a member per resource. Its useful so readers know how to invoke GZIP (That it has to be done once to get at any record or just need to do per record). Suggest adding GZIP extra field in HEAD of GZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting.
  • IP-Address for dns resource is DNS Server. Add note to this effect in 8.2 DNS.
  • Section 6. is truncated -- missing text. What was intended here? SEE ISO DOC.
  • In-line ANVL definition (From Kunze). Related, can labels have CTLs such as CRLF (Shouldn't)? When says 'control-chars', does this include UNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but they are not same (Should be UTF-8). ANVL OR NOT STILL UP IN AIR AFTER LaHonda. Postpone to 0.11 revision.
  • Fix examples. Use output of experimental ARC Writer.
  • Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in Digital Preservation using the WARC File Format.

Open Issues

Drop response record type

resource is sufficent. Let mimetype distingush if capture with response headers or not (As per comment at end of 8.1 HTTP and HTTPS where it allows that if no response headers, use resource record type and page mimetype rather than response type plus a mimetype of message/http: The difference in record types is not needed distingushing between the two types of capture)

Are there other capture methods that would require a response record, that don't have a mimetype that includes response headers and content? SMTP has rich MIME set to describe responses. Its request is pretty much unrecordable. NNTP and FTP similar. Because of rich MIME, no need of a special response type here.

Related, do we need the request record? Only makes sense for HTTP?

This proposal is contentious. Gordon drew scenario where response would be needed distingushing local from remote capture if an archiving institution purposefully archived without recording headers or if the payload itself was an archived record. In opposition, was suggested that should an institution choose to cature in this 'unusual' mode, crawl metadata could be used consulted to disambiguate confusion on how capture was done (To be further investigated. In general, definition of record types is still in need of work).

subject-url

The ISO revision suggests that the positional parameter subject-uri be renamed. Suggest record-url.

Other issues

  • Should we allow freeform creation of custom Named Fields if have a MIME-like 'X-' or somesuch prefix?
  • Nothing on header-line encoding (Section 11 says UTF-8). For completeness should be US-ASCII or UTF-8, no control-chars (especially CR or LF), etc.
  • warcinfo
    • What for a scheme? Using UUID as per G suggestion.
    • Also, how to populate description of crawl into warcinfo? 'Documentation' Named Field with list of URLs that can be assumed to exist somewhere in the current WARC set (We'd have to make the crawler go get them at start of a crawl).
    • I don't want to repeat crawl description for every WARC. How to have this warcinfo point at an original? related-record-id seems insufficent.
    • If the crawler config. changes, can I just write a warcinfo with differences? How to express? Or better as metadata about a warcinfo?
    • In the past we used to get the filename from this URL header field when we unsure of the filename or it was unavailable (We're reading a Stream). Won't be able to do that with UUID for URL. So, introducing new warcinfo Named Field (optional) 'Filename' that will be used when warcinfo is put at start of a file. Allow warcinfo to have a named parameter 'Filename'?
  • revisit
    • What to write? Use a description field or just expect this info to be present in the warcinfo? Example has request header (inside XML). Better to use associated request record for this kind of info?
    • Related-Record-ID (RRID) of original is likely an onerous requirement. Envisioning an implementation where we'd write revisit records, we'd write such a record where content was judged same or where date since last fetch had not changed. If we're to write the RRID, then we'd have to maintain table keyed by URL with value of page hash or of last modified-date plus associated RRID (actual RRID URL, not a hash).
  • Should we allow a Description Named Field. E.g. I add an order file as a metadata record and associate with a warcinfo record. Description field could say "This is Heritrix Order file". Same for seeds. Alternative is custom XML packaging (Scheme could describe fields such as 'order' file or ANVL packaging using ANVL 'comments'.
  • Section 11, why was it we said we don't need a parameter or explicit subtype for special gzip WARC format? I don't remember? Reader needs to know when its reading a stream. A client would like to know so it wrote stream to disk with right suffix? Recap. (Perhaps it was looking at the MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fields that denote it WARC, thats sufficent?).
  • Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' -- but allow free form description? Leave off 'superior method of indicating truncation' paragraph. This qualifier could be added to all sections of doc -- that a subsequent revision of any aspect of the doc. will be superior. Rather than End-Length, like MIME, last record could have Segment-Number-Total, a count of all segments that make up complete record.

From LaHonda, discussion of revisit type. Definition was tighted some by saying revisit is used when you chose not to store the capture. Was thought possible that it NOT require pointer back to an original. Suggested it might have a similarity judgment header -- similiarity-value -- with values between 0 and 1. Might also have analysis-method and description. Possible methods discussed included: URI same, length same, hash of content same, judgement based off content of HTTP HEAD request, etc. Possible payloads might be: Nothing, a diff, the hash obtained, etc.

Unimplemented

  • Record Segmentation (4.8 continuation record type and the 5.2 Segment-* Named Parameters. Future TODO.
  • 4.7 conversion type. Future TODO.

TODOs

  • unit tests using multipart/* (JavaMail) reading and writing records? Try record-id as part boundary.
  • Performance: Need to add Record-based buffering. GZIP'd streams have some buffering because of the deflater but could probably do w/ more.
org.archive.net
org.archive.net.md5
org.archive.net.rsync
org.archive.net.s3
org.archive.queue
org.archive.uid org.archive.uid package A unique ID generator. Default is {@link org.archive.uid.UUIDGenerator}. To use another ID Generator, set the System Property org.archive.uid.GeneratorFactory.generator to point at an alternate implementation of {@link org.archive.uid.Generator}.

TODO

  • MIME boundaries have upper-bound of 70 characters total including 'blank line' (CRLFCRLF) and two leading hyphens. Add to {@link org.archive.uid.Generator} interface an upper-bound on generated ID length.
  • Add example of an actionable uid generator: e.g. http://archive.org/UID-SCHEME/ID where scheme might be UUID and an ID might be f9472055-fbb6-4810-90e8-68fd39e145a6;type=metadata or, using ARK: http://archive.org/ark:/13030/f9472055-fbb6-4810-90e8-68fd39e145a6;type=metadata.
org.archive.util
org.archive.util.anvl org.archive.util.anvl package Parsers and Writers for the (expired) Internet-Draft A Name-Value Language (ANVL). Use {@link org.archive.util.anvl.ANVLRecord} to create new instances of ANVL Records and for parsing.

Implementation Details

The ANVL Internet-Draft of 14 February, 2005 is inspecific as to the definition of 'blank line' and 'newline'. This parser implementation assumes CRNL.

Says "An element consists of a label, a colon, and an optional value". Should that be: "An element consists of a label and an optional value, or a comment."

Specification is unclear regards CR or NL in label or comment (This implementation disallows CR or NL in labels but lets them pass in comments).

A grammar would help. Here is RFC822:

     field       =  field-name ":" [ field-body ] CRLF
     
     field-name  =  1*<any CHAR, excluding CTLs, SPACE, and ":">
     
     field-body  =  field-body-contents
                    [CRLF LWSP-char field-body]
     
     field-body-contents =
                   <the ASCII characters making up the field-body, as
                    defined in the following sections, and consisting
                    of combinations of atom, quoted-string, and
                    specials tokens, or else consisting of texts>

org.archive.util.bdbje
org.archive.util.fingerprint
org.archive.util.iterator
org.archive.util.ms Memory-efficient reading of .doc files. To extract the text from a .doc file, use {@link org.archive.util.ms.Doc#getText(SeekInputStream)}. That's basically the whole API. The other classes are necessary to make that method work, and you can probably ignore them.

Implementation/Format Details

These APIs differ from the POI API provided by Apache in that POI wants to load complete documents into memory. Though POI does provide an "event-driven" API that is memory efficient, that API cannot be used to scan text across block or piece boundaries.

This package provides a stream-based API for extracting the text of a .doc file. At this time, the package does not provide a way to extract style attributes, embedded images, subdocuments, change tracking information, and so on.

There are two layers of abstraction between the contents of a .doc file and reality. The first layer is the Block File System, and the second layer is the piece table.

The Block File System

All .doc files are secretly file systems, like a .iso file, but insane. A good overview of how this file system is arranged inside the file is available at the Jarkarta POIFS system.

Subfiles and directories in a block file system are represented via the {@link org.archive.util.ms.Entry} interface. The root directory can be obtained via the {@link org.archive.util.ms.BlockFileSystem#getRoot()} method. From there, the child entries can be discovered.

The file system divides its subfiles into 512-byte blocks. Those blocks are not necessarily stored in a linear order; blocks from different subfiles may be interspersed with each other. The {@link org.archive.util.ms.Entry#open()} method returns an input stream that provides a continuous view of a subfile's contents. It does so by moving the file pointer of the .doc file behind the scenes.

It's important to keep in mind that any given read on a stream produced by a BlockFileSystem may involve:

  1. Moving the file pointer to the start of the file to look up the main block allocation table.
  2. Navigation the file pointer through various allocation structures located throughout the file.
  3. Finally repositioning the file pointer at the start of the next block to be read.

So, this package lowers memory consumption at the expense of greater IO activity. A future version of this package will use internal caches to minimize IO activity, providing tunable trade-offs between memory and IO.

The Piece Table

The second layer of abstraction between you and the contents of a .doc file is the piece table. Some .doc files are produced using a "fast-save" feature that only writes recent changes to the end of the file. In this case, the text of the document may be fragmented within the document stream itself. Note that this fragmentation is in addition to the block fragmentation described above.

A .doc file contains several subfiles within its filesystem. The two that are important for extracting text are named WordDocument and 0Table. The WordDocument subfile contains the text of the document. The 0Table subfile contains supporting information, including the piece table.

The piece table is a simple map from logical character position to actual subfile stream position. Additionally, each piece table entry describes whether or not the piece stores text using 16-bit Unicode, or using 8-bit ANSI codes. One .doc file can contain both Unicode and ANSI text. A consequence of this is that every .doc file has a piece table, even those that were not "fast-saved".

The reader returned by {@link org.achive.util.ms.Doc#getText(SeekInputStream)} consults the piece table to determine where in the WordDocument subfile the next piece of text is located. It also uses the piece table to determine how bytes should be converted to Unicode characters.

Note, however, that any read from such a reader may involve:

  1. Moving the file pointer to the piece table.
  2. Searching the piece table index for the next piece, which may involve moving the file pointer many times.
  3. Moving the file pointer to that piece's description in the piece table.
  4. Moving the file pointer to the start of the piece indicated by the description.
Since the "file pointer" in this context is the file pointer of the subfile, each move described above may additionally involve:
  1. Moving the file pointer to the piece table.
  2. Searching the piece table index for the next piece, which may involve moving the file pointer many times.
  3. Moving the file pointer to that piece's description in the piece table.
  4. Moving the file pointer to the start of the piece indicated by the description.
A future implementation will provide an intelligent cache of the piece table, which will hopefully reduce the IO activity required.
st.ata.util
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.