org.archive.io.warc.v10

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.io.warc.v10 
org.archive.io.warc.v10
org.archive.io.warc package Experimental WARC Writer and Readers. Code and specification subject to change with no guarantees of backward compatibility: i.e. newer readers may not be able to parse WARCs written with older writers. This code, with noted exceptions, is a loose implementation of parts of the (unreleased and unfinished) WARC File Format (Version 0.9). Deviations from 0.9, outlined below in the section Deviations from Spec., are to be proposed as amendments to the specification to make a new revision. Since the new spec. revision will likely be named version 0.10, code in this package writes WARCs of version 0.10 -- not 0.9.

Implementation Notes

Tools

Initial implementations of Arc2Warc and Warc2Arc tools can be found in the package above this one, at {@link org.archive.io.Arc2Warc} and {@link org.archive.io.Warc2Arc} respectively. Pass --help to learn how to use each tool.

Unique ID Generator

WARC requires a GUID for each record written. A configurable unique ID {@link org.archive.uid.GeneratorFactory}, it can be configured to use alternate unique ID generators, was added with a default of {@link org.archive.uid.UUIDGenerator}. The default implementation generates UUIDs (using java5 java.util.UUID) with an urn scheme using the uuid namespace [See RFC4122].

{@link org.archive.util.anvl ANVL}

The ANVL RFC822-like format is used writing Named Fields in WARCs and occasionally for metadata. An implementation was added at {@link org.archive.util.anvl}.

Miscellaneous

Writing WARCs, the response record type is chosen as the core record that all others associate to: i.e. all others have a Related-Record-ID that points back to the response.

Deviations from Spec.

The below deviations from spec. 0.9 have been realized in code and are to be proposed as spec. amendments with new revision likely to be 0.10 (Vocal assent was given by John, Gordon, and Stack to the below at La Honda Meeting, August 8th, 2006).

mimetype in header line

Allow full mimetypes in the header line as per RFC2045 rather than current, shriveled mimetype that allows only type and subtype. This will mean mimetypes are allowed parameters: e.g. text/plain; charset=UTF-8 or application/http; msgtype=request. Allowing full mimetypes, we can support the following scenarios without further amendment to specification and without parsers having to resort to metadata records or to custom Named Fields to figure how to interpret payload:

  • Consider the case where an archiving organization would store all related to a capture as one record with a mimetype of multipart/mixed; boundary=RECORD-ID. An example record might comprise the parts Content-Type: application/http; msgtype=request, Content-Type: application/http; msgtype=response, and Content-Type: text/xml+rdf (For metadata).
  • Or, an archiving institution would store a capture with multipart/alternatives ranging from most basic (or 'desiccated' in Kunze-speak) -- perhaps a text/plain rendition of a PDF capture -- through to best, the actual PDF binary itself.

To support full mimetypes, we must allow for whitespace between parameters and allow that parameter values themselves might include whitespace ('quoted-string'). The WARC Writer converts any embedded carriage-return and newlines to single space.

Swap position of recordid and mimetype in the header line

Because of the above amendment where we allow full mimetypes on header line, to ease the parse, since miemtype now may include whitespace, we move the mimetype to last position on header line and recordid to second-from-last.

Use application/http instead of message/http

message type has line length maximum of 1000 characters absent a Content-Type-Encoding header set to BINARY. (See definition of message/http for talk of adherence to MIME message line limits: See 19.1 Internet Media Type message/http and application/http in RFC2616).

Suggested Spec. Amendments

Apart from the above listed deviations, the below changes are also suggested for inclusion in 0.10 spec. revision

Below are mostly suggested edits. Changes are not substantative.

Allow multiple instances of a single Named Parameter

Allow that there may be multiple instances of same Named Parameter in any one Named Parameter block. E.g. Multiple Related-Record-IDs could prove of use. Spec. mentions this in 8.1 HTTP and HTTPS section but better belongs in the 5.2 Named Parameters preamble.

Related, add to Named Field section note on bidirectional Related-Record-ID.

Miscellaneous

LaHonda in below is reference to meeting of John, Gordon and Stack at LaHonda Cafe on 16th St., on August 8th, 2006.

  • Leave off 9.2 GZIP extra fields. Big section on implementing an option that has little to do with WARCing. AGREED at LaHonda.
  • But, we need to mark gzipped files as being WARC: i.e. that the GZIP is a member per resource. Its useful so readers know how to invoke GZIP (That it has to be done once to get at any record or just need to do per record). Suggest adding GZIP extra field in HEAD of GZIP member that says 'WARC' (ARC has such a thing currently). NOT NECESSARY per LaHonda meeting.
  • IP-Address for dns resource is DNS Server. Add note to this effect in 8.2 DNS.
  • Section 6. is truncated -- missing text. What was intended here? SEE ISO DOC.
  • In-line ANVL definition (From Kunze). Related, can labels have CTLs such as CRLF (Shouldn't)? When says 'control-chars', does this include UNICODE control characters (Should)? CHAR is described as ASCII/UTF-8 but they are not same (Should be UTF-8). ANVL OR NOT STILL UP IN AIR AFTER LaHonda. Postpone to 0.11 revision.
  • Fix examples. Use output of experimental ARC Writer.
  • Fix ambiguity in spec. pertaining to 'smallest possible anvl-fields' notcited by Mads Alhof Kristiansen in Digital Preservation using the WARC File Format.

Open Issues

Drop response record type

resource is sufficent. Let mimetype distingush if capture with response headers or not (As per comment at end of 8.1 HTTP and HTTPS where it allows that if no response headers, use resource record type and page mimetype rather than response type plus a mimetype of message/http: The difference in record types is not needed distingushing between the two types of capture)

Are there other capture methods that would require a response record, that don't have a mimetype that includes response headers and content? SMTP has rich MIME set to describe responses. Its request is pretty much unrecordable. NNTP and FTP similar. Because of rich MIME, no need of a special response type here.

Related, do we need the request record? Only makes sense for HTTP?

This proposal is contentious. Gordon drew scenario where response would be needed distingushing local from remote capture if an archiving institution purposefully archived without recording headers or if the payload itself was an archived record. In opposition, was suggested that should an institution choose to cature in this 'unusual' mode, crawl metadata could be used consulted to disambiguate confusion on how capture was done (To be further investigated. In general, definition of record types is still in need of work).

subject-url

The ISO revision suggests that the positional parameter subject-uri be renamed. Suggest record-url.

Other issues

  • Should we allow freeform creation of custom Named Fields if have a MIME-like 'X-' or somesuch prefix?
  • Nothing on header-line encoding (Section 11 says UTF-8). For completeness should be US-ASCII or UTF-8, no control-chars (especially CR or LF), etc.
  • warcinfo
    • What for a scheme? Using UUID as per G suggestion.
    • Also, how to populate description of crawl into warcinfo? 'Documentation' Named Field with list of URLs that can be assumed to exist somewhere in the current WARC set (We'd have to make the crawler go get them at start of a crawl).
    • I don't want to repeat crawl description for every WARC. How to have this warcinfo point at an original? related-record-id seems insufficent.
    • If the crawler config. changes, can I just write a warcinfo with differences? How to express? Or better as metadata about a warcinfo?
    • In the past we used to get the filename from this URL header field when we unsure of the filename or it was unavailable (We're reading a Stream). Won't be able to do that with UUID for URL. So, introducing new warcinfo Named Field (optional) 'Filename' that will be used when warcinfo is put at start of a file. Allow warcinfo to have a named parameter 'Filename'?
  • revisit
    • What to write? Use a description field or just expect this info to be present in the warcinfo? Example has request header (inside XML). Better to use associated request record for this kind of info?
    • Related-Record-ID (RRID) of original is likely an onerous requirement. Envisioning an implementation where we'd write revisit records, we'd write such a record where content was judged same or where date since last fetch had not changed. If we're to write the RRID, then we'd have to maintain table keyed by URL with value of page hash or of last modified-date plus associated RRID (actual RRID URL, not a hash).
  • Should we allow a Description Named Field. E.g. I add an order file as a metadata record and associate with a warcinfo record. Description field could say "This is Heritrix Order file". Same for seeds. Alternative is custom XML packaging (Scheme could describe fields such as 'order' file or ANVL packaging using ANVL 'comments'.
  • Section 11, why was it we said we don't need a parameter or explicit subtype for special gzip WARC format? I don't remember? Reader needs to know when its reading a stream. A client would like to know so it wrote stream to disk with right suffix? Recap. (Perhaps it was looking at the MAGIC bytes -- if it starts with GZIP MAGIC and includes extra fields that denote it WARC, thats sufficent?).
  • Section 7, on truncation, on 7.1, suggest values -- 'time', 'length' -- but allow free form description? Leave off 'superior method of indicating truncation' paragraph. This qualifier could be added to all sections of doc -- that a subsequent revision of any aspect of the doc. will be superior. Rather than End-Length, like MIME, last record could have Segment-Number-Total, a count of all segments that make up complete record.

From LaHonda, discussion of revisit type. Definition was tighted some by saying revisit is used when you chose not to store the capture. Was thought possible that it NOT require pointer back to an original. Suggested it might have a similarity judgment header -- similiarity-value -- with values between 0 and 1. Might also have analysis-method and description. Possible methods discussed included: URI same, length same, hash of content same, judgement based off content of HTTP HEAD request, etc. Possible payloads might be: Nothing, a diff, the hash obtained, etc.

Unimplemented

  • Record Segmentation (4.8 continuation record type and the 5.2 Segment-* Named Parameters. Future TODO.
  • 4.7 conversion type. Future TODO.

TODOs

  • unit tests using multipart/* (JavaMail) reading and writing records? Try record-id as part boundary.
  • Performance: Need to add Record-based buffering. GZIP'd streams have some buffering because of the deflater but could probably do w/ more.
Java Source File NameTypeComment
ExperimentalWARCWriter.javaClass Experimental WARC implementation. Based on unreleased version 0.9 of WARC File Format document.
ExperimentalWARCWriterTest.javaClass Test Writer and Reader.
WARCReader.javaClass WARCReader.
WARCReaderFactory.javaClass Factory for WARC Readers.
WARCRecord.javaClass A WARC file Record.
WARCRecordTest.javaClass
WARCWriterPool.javaClass A pool of WARCWriters.
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.