it.unimi.dsi.mg4j.document

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Search Engine » mg4j » it.unimi.dsi.mg4j.document 
it.unimi.dsi.mg4j.document
This package contains all the logics related to and useful for managing documents, document collections and such. Warning: We are still working on the document infrastructure. It should be pretty stable, but changes should not be unexpected. Suggestions are welcome. Note also that most of the classes in this package should be considered examples and suggestions: while a casual user will find them invaluable in indexing data, a custom, large-scale application will usually require writing your own {@link it.unimi.dsi.mg4j.document.DocumentCollection}.

Basic interfaces

The {@link it.unimi.dsi.mg4j.document.Document} interface

MG4J aims at indexing the content of entities called documents. The main classes that describe documents and sets of documents are included in the it.unimi.dsi.mg4j.document package. In particular, documents are instances of the {@link it.unimi.dsi.mg4j.document.Document} interface. A document is characterized abstractly by the following data:

  • a title, a character sequence that represents the document; the document title is returned by the {@link it.unimi.dsi.mg4j.document.Document#title()} method;
  • a URI, that somehow characterizes the document uniquely; the document URI is returned by the {@link it.unimi.dsi.mg4j.document.Document#uri()} method;
  • a number of fields; every field is abstractly represented by a number, the field index; fields are numbered from 0, but the user should know how many fields a document exhibits, because this information is not available in the document itself. For every field, the document exhibits the following data:
    • the field content, that is an Object returned by the {@link it.unimi.dsi.mg4j.document.Document#content(int)} method; the type of object that this method returns must be known by the calling class in advance; in particular, for textual fields (see below), the content will be a {@link java.io.Reader};
    • for textual fields only, a {@linkplain it.unimi.dsi.mg4j.io.WordReader word reader}: an object that is able to split the field content (in this case: a sequence of characters) into a sequence of words; the word reader is returned by the {@link it.unimi.dsi.mg4j.document.Document#wordReader(int)} method (which must be called only for textual fields).

Users should always close a document after usage by calling the {@link it.unimi.dsi.mg4j.document.Document#close()} method: the method is responsible for relinquishing all resources that a document instantiated for its very existence.

The {@link it.unimi.dsi.mg4j.document.DocumentFactory} interface

Documents usually do not come alone, but they are grouped into collections: documents within a collection are of the same type, and this fact explains why the document structure (number and type of fields) are not contained in the document itself. Indeed, documents are produced by document factories.

A document factory is an instance of the {@link it.unimi.dsi.mg4j.document.DocumentFactory} interface, that in particular is able to produce a document. All documents produced by the same factory are of the same kind, and exhibit the same number and type of fields. A factory gives information about the documents it produces through the following methods:

  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#numberOfFields()}: returns the number of fields contained in each document produced by this factory; recall that fields are indexed starting from 0;
  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldName(int)}: returns a mnemonic explanatory name for the given field;
  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldIndex(String)}: returns the index of a field, given its mnemonic name;
  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldType(int)}: returns (an integer representing) the type of the given field; possible types are static constants declared in the {@link it.unimi.dsi.mg4j.document.DocumentFactory.FieldType} interface; one of the possible types is {@link it.unimi.dsi.mg4j.document.DocumentFactory.FieldType#TEXT}, used for textual fields; note that the type of objects returned by the {@link it.unimi.dsi.mg4j.document.Document#content(int)} method of the {@link it.unimi.dsi.mg4j.document.Document} interface depend on the type of the field.

The abovementioned methods provide information about documents produced by the factory. The actual documents are produced by the {@link it.unimi.dsi.mg4j.document.DocumentFactory#getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap) getDocument(rawContent,metadata)} method.

This method returns a new document from the factory. The rawContent parameter is the most important one: it is a stream of bytes that the factory uses to produce the document. The factory knows how the sequence of bytes should be interpreted to produce a document of the desired kind. Note that even though the interpretation of the sequence of bytes representing the raw document content is entirely left to implementors, often you might prefer to think of the input byte sequence as of a list of consecutive self-delimiting byte subsequences, one for each field: in this case, the {@link java.io.InputStream#reset()} method of the {@link java.io.InputStream} class is used to divide the subsequences from one another.

The metadata parameter is a map providing some basic data about the document as derived by the collection. The map is a reference map with suitable {@link java.util.Enum} keys, and as such must be queried using the keys in {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys}, or other similar factory-specific keys. For instance, the key {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys#TITLE} gives a suggested title to the document (but the factory may ignore it, if it has a better way to determine a title for the document), whereas {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys#URI} specifies a suggested URI for the document (which, once more, may be ignored by the factory).

Usually, a factory is built using a list of properties that define default values for metadata such as charset encoding or MIME types. There properties can be passed in several ways, and usually the main method of a collection provides an option (typically, -p) that let the user specify default metadata for the factory. The property resolution algorithm is explained in the documentation for {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory}.

The {@link it.unimi.dsi.mg4j.document.DocumentSequence} interface

Up to this point, we have interpreted documents and document factories in a very abstract manner, but we gave no importance on the way the byte sequences representing the raw data are produced.

Basically, a source of documents is a {@link it.unimi.dsi.mg4j.document.DocumentSequence}. More precisely, instances of this class represent sources that are able to generate documents. Typically, a document sequence is able to produce a stream of documents after one another, through a special kind of iterator, called {@link it.unimi.dsi.mg4j.document.DocumentIterator}, returned by the {@link it.unimi.dsi.mg4j.document.DocumentSequence#iterator()} method.

A document iterator is not really a Java {@link java.util.Iterator}: it is simply a class that exposes a method {@link it.unimi.dsi.mg4j.document.DocumentIterator#nextDocument()} that returns the next document, if any, or null when there are no more documents. Thus, document iterators can be lazy, which is preferrable in several circumstance (e.g., documents coming from an input stream).

The {@link it.unimi.dsi.mg4j.document.DocumentCollection} interface

In some cases, the documents only appear as an uninterrupted stream and applications do not have direct accesses to single documents; in particular, it might be the case that documents just disappear after being enumerated (as it happens when the document source is standard input). In all such situations, a {@link it.unimi.dsi.mg4j.document.DocumentSequence} provides the only way to get the documents, because it only guarantees sequential access.

Nonetheless, there are other cases where documents can be easily accessed in a direct fashion, and can be read many times (for example, when documents are files in the file system). In such cases, {@link it.unimi.dsi.mg4j.document.DocumentCollection} (an extension of {@link it.unimi.dsi.mg4j.document.DocumentSequence}) can be used.

Apart for the methods of a document sequence, a document collection provides the following additional access methods to the documents:

  • {@link it.unimi.dsi.mg4j.document.DocumentCollection#size()}, that returns the collection size, i.e., the number of documents in the collection;
  • {@link it.unimi.dsi.mg4j.document.DocumentCollection#document(int)}, that returns the document with given pointer; the document pointer is an integer representing uniquely a document within the collection: the i-th document produced by the collection's {@linkplain it.unimi.dsi.mg4j.document.DocumentCollection#iterator() document iterator} has pointer i−1 (so, document pointers range from 0, inclusive, to size(), exclusive);
  • {@link it.unimi.dsi.mg4j.document.DocumentCollection#stream(int)}, that returns the {@link java.io.InputStream} of raw data that this collection would use to produce the document with given pointer.

After a document collection has been created, for example, starting from a set of files in the filesystem, it can usually be saved (serialized) on a file: the extension used for the filename is, by default, {@link it.unimi.dsi.mg4j.document.DocumentCollection#DEFAULT_EXTENSION}. Implementors of this interface should always specify explicitly which assumptions on the existence of external data are made for the consistency of a collection to be preserved. For example, a collection produced from a set of files will be consistent until no file has been changed or deleted; if the latter situation happens, the collection usually becomes inconsistent, and in any case you might expect that the indices thereof produced will no longer match the content of the collection.

Note that a document collection has very weak requirements (and thus very weak obligations) on the concurrent creation of several objects (documents, iterators, etc.). Please read carefully the {@link it.unimi.dsi.mg4j.document.DocumentCollection class description}.

Relations between document sequences/collections and document factories

As we explained above, document sequences/collections extract raw data (byte sequences) from some source and use a specified document factory to turn such data into documents. Hence, there exists a tight connection between each document sequence and the document factory it uses.

Typically, the document factory is provided to the document sequence at construction time, and this fact provides a form of flexibility, because different sources (e.g., the file system and the standard input) may be coupled with the same document factory (e.g., a document factory parsing HTML documents into text), or, conversely, the same document source may be used to produce documents with different formats.

Users should always be careful, however. Often, document sequences make assumptions about the factory they use, which reduces the number of possible combinations the user may adopt. Implementors of the {@link it.unimi.dsi.mg4j.document.DocumentSequence} interface should always clarify all the assumptions they make about the factories that can be used for the sequence.

Document factories

Recall that a document factory is an object that is capable of producing homogeneous documents (documents with the same number/type of fields). Every document is produced starting from a raw bytestream.

The {@link it.unimi.dsi.mg4j.document.IdentityDocumentFactory}

The simplest possible document factory is the {@link it.unimi.dsi.mg4j.document.IdentityDocumentFactory}: this factory produces documents with a single textual field, called text, that is actually obtained by transforming the byte sequence into a sequence of characters, using some default encoding. Actually, a document factory must also provide a way to break the text into words. With this aim, the identity factory may be provided with a {@link java.util.Locale} that is used to determine how words are best broken in the given locale's language.

Other examples of document factories

Other implementations of document factories that are provided with MG4J are:

  • {@link it.unimi.dsi.mg4j.document.HtmlDocumentFactory}: a factory used to parse HTML documents; this factory produces documents by parsing HTML streams. Bytes are converted into characters using a specified encoding; the resulting HTML character sequence is parsed to extract text (that is returned as a textual field named text) and title (the HTML title element, returned as a textual field named title). The title, if present, is also used as document title (otherwise, the suggested title is used). Note that a document collection might have information about the charset encoding (e.g., by means of HTTP headers): in this case the metadata field of {@link it.unimi.dsi.mg4j.document.DocumentFactory#getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap)} should pass this information.
  • {@link it.unimi.dsi.mg4j.document.PdfDocumentFactory}: a factory used to parse PDF documents. Currently, a single textual field named text is exported.

Composing document factories

As we said, many document factories interpret the raw content data (an {@link java.io.InputStream}, i.e., a sequence of bytes) as if it is really made by a concatenation of many {@link java.io.InputStream}s, where each stream is typically parsed to a field; to pass from one stream to the next, the {@link java.io.InputStream#reset()} method is called.

Suppose you have n document factories D1, …, Dn, with f1, …, fn fields, respectively. One may want to build a new factory with f1+…+fn fields, where each document is produced by composing the document factories D1, …, Dn sequentially: in other words, the raw data are first passed to the first factory (that extracts f1 fields, typically resetting the stream as many times), then it is passed to the second factory (that extracts f2 fields) etc.

The {@link it.unimi.dsi.mg4j.document.CompositeDocumentFactory} does the job, and also allows one to change the field names (that are otherwise named as they were in the subfactories).

The class {@link it.unimi.dsi.mg4j.io.MultipleInputStream} is a useful tool to produce raw data for composite factories: it allows one to convert an array of input streams into a single input stream: each time the resulting stream is reset, the multiple input stream will offer you the next stream in the array.

A special form of composite document factory is obtained using {@link it.unimi.dsi.mg4j.document.ReplicatedDocumentFactory}, that allows one to compose sequentially the same document factory with itself a certain number of times.

Document collections and sequences

The {@link it.unimi.dsi.mg4j.document.InputStreamDocumentSequence}

This is the simplest kind of document sequence: it just breaks a single {@link java.io.InputStream} on the basis of a given separator character; each piece of the stream is interpreted as the raw data corresponding to a document, and it is passed to a factory (specified at construction time) for converting it into a {@link it.unimi.dsi.mg4j.document.Document}.

The {@link it.unimi.dsi.mg4j.document.FileSetDocumentCollection}

This kind of collection is built starting from a set of files in the file system. Each file is interpreted as a document, and passed to a factory (specified at construction time). The suggested title for a document is the corresponding filename, and the suggested URI is the URI of the file.

The {@link it.unimi.dsi.mg4j.document.ZipDocumentCollection} facility

There are cases in which one would like to turn a document sequence into a document collection. This may happen for one of the following reasons:

  • the sequence is, by its very nature, volatile (e.g., it is coming from standard input, and cannot be re-produced), but we would like to make it into a resident non-volatile collection;
  • the sequence is not amenable to be accessed at random;
  • the documents in the sequence are difficult to parse, and it is not advisable to repeat the parsing process every time they are accessed.

In all such cases, it may be advisable to produce a compact copy of the sequence that is easily and efficiently accessible at random. To do this, one may use the {@link it.unimi.dsi.mg4j.document.ZipDocumentCollectionBuilder}, that takes a document sequence and produces a "zipped clone" of the documents in the sequence: there are some mild limitations to the sequences that can be used in this context, and the resulting collection is only a partial copy of the original one, but in most cases this is sufficient for all indexing purposes. The builder will save two files: one contains the essential data concerning the zipped collection, and the other contains the zipped version of the documents.

After this, the produced {@link it.unimi.dsi.mg4j.document.ZipDocumentCollection} may be used as any other collection.

Java Source File NameTypeComment
AbstractDocument.javaClass An abstract, it.unimi.dsi.io.SafelyCloseable safely closeable implementation of a document.
AbstractDocumentCollection.javaClass An abstract, it.unimi.dsi.io.SafelyCloseable safely closeable implementation of a document collection.
AbstractDocumentFactory.javaClass An abstract implementation of a factory, providing a protected method to check for field indices.
AbstractDocumentIterator.javaClass An abstract, it.unimi.dsi.io.SafelyCloseable safely closeable implementation of a document iterator.
AbstractDocumentSequence.javaClass An abstract, it.unimi.dsi.io.SafelyCloseable safely closeable implementation of a document sequence.
CompositeDocumentFactory.javaClass A composite factory that passes the input stream to a sequence of factories in turn.

Factories can be composed.

CompositeDocumentSequence.javaClass A document sequence composing a list of underlying sequences.

An instance of this class exposes documents formed by juxtaposition of the content of the underlying document sequences.

CSVDocumentCollection.javaClass A it.unimi.dsi.mg4j.document.DocumentCollection corresponding to a given set of records in a comma separated file.
DispatchingDocumentFactory.javaClass A document factory that actually dispatches the task of building documents to various factories according to some strategy.

The strategy is specified as (an object embedding) a method that determines which factory should be used on the basis of the metadata that are provided to the DispatchingDocumentFactory.getDocument(InputStream,Reference2ObjectMap) method.

Document.javaInterface An indexable document.

Instance of this class represent a single document.

DocumentCollection.javaInterface A collection of documents.

Classes implementing this interface have additional responsabilities w.r.t.

DocumentFactory.javaInterface A factory parsing and building documents of the same type.

Each document produced by the same factory has a number of fields, which represent units of information that should be indexed separately.

DocumentIterator.javaInterface An iterator over documents.

This interface provide a DocumentIterator.nextDocument() method returning the next document, or null if no more documents are available.

DocumentSequence.javaInterface A sequence of documents.

This is the most basic class available in MG4J for representing a sequence to documents to be indexed.

FileSetDocumentCollection.javaClass A it.unimi.dsi.mg4j.document.DocumentCollection corresponding to a given set of files.

This class provides a main method with a flexible syntax that serialises into a document collection a list of files given on the command line or piped into standard input.

HtmlDocumentFactory.javaClass A factory that provides fields for body and title of HTML documents.
IdentityDocumentFactory.javaClass A factory that provides a single field containing just the raw input stream; the encoding is set using the property it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys.ENCODING .
InputStreamDocumentSequence.javaClass A document sequence obtained by breaking an input stream at a specified separator.
JavamailDocumentCollection.javaClass A it.unimi.dsi.mg4j.document.DocumentCollection corresponding to a Javamail javax.mail.Store .

This class is very simple: for instance, it will not understand correctly multipart MIME messages, which will seen as without content.

JdbcDocumentCollection.javaClass A it.unimi.dsi.mg4j.document.DocumentCollection corresponding to the result of a query in a relational database.

An instance of this class is based on a query.

PdfDocumentFactory.javaClass A factory that converts PDF (Portable Document Format) documents into text. Presently this class is very inefficient; it is mainly useful for debugging and exemplification purposes.
PropertyBasedDocumentFactory.javaClass A document factory initialised by default properties.

Many document factories need a number of default values that are used when the metadata passed to it.unimi.dsi.mg4j.document.DocumentFactory.getDocument(java.io.InputStreamReference2ObjectMap) is not sufficient or lacks some key.

ReplicatedDocumentFactory.javaClass A factory that replicates a given factory several times.
TRECDocumentCollection.javaClass A collection for the TREC GOV2 data set.

The documents are stored as a set of descriptors, representing the (possibly gzipped) file they are contained in and the start and stop position in that file.

TRECHeaderDocumentFactory.javaClass A factory without fields that is used to interpret the header of a TREC GOV2 document.
ZipDocumentCollection.javaClass A it.unimi.dsi.mg4j.document.DocumentCollection produced from a document sequence using it.unimi.dsi.mg4j.document.ZipDocumentCollectionBuilder .

The collection will produce the same documents as the original sequence whence it was produced, in the following sense:

  • the resulting collection has as many document as the original sequence, in the same order, with the same titles and URI;
  • every document has the same number of fields, with the same names and types;
  • non-textual non-virtual fields will be written out as objects, so they need to be serializable;
  • virtual fields will be written as a sequence of starting with the number of fragments (converted into a string with String.valueOf(int) ), followed by a pair of strings for each fragment (the first string being the document specifier, and the second being the associated text);
  • textual fields will be written out in such a way that, when reading them, the same sequence of words and non-words will be produced; alternatively, one may produce a collection that only copies words (non-words are not copied).
ZipDocumentCollectionBuilder.javaClass A builder to create ZipDocumentCollection s.

After creating an instance of this class, it is possible to add incrementally new documents.

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.