mg4j

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Search Engine » mg4j 
MG4J Search Engines
License:GNU Library or Lesser General Public License (LGPL)
URL:http://mg4j.dsi.unimi.it/
Description:
Package NameComment
it.unimi.dsi.mg4j.compression
it.unimi.dsi.mg4j.document This package contains all the logics related to and useful for managing documents, document collections and such. Warning: We are still working on the document infrastructure. It should be pretty stable, but changes should not be unexpected. Suggestions are welcome. Note also that most of the classes in this package should be considered examples and suggestions: while a casual user will find them invaluable in indexing data, a custom, large-scale application will usually require writing your own {@link it.unimi.dsi.mg4j.document.DocumentCollection}.

Basic interfaces

The {@link it.unimi.dsi.mg4j.document.Document} interface

MG4J aims at indexing the content of entities called documents. The main classes that describe documents and sets of documents are included in the it.unimi.dsi.mg4j.document package. In particular, documents are instances of the {@link it.unimi.dsi.mg4j.document.Document} interface. A document is characterized abstractly by the following data:

  • a title, a character sequence that represents the document; the document title is returned by the {@link it.unimi.dsi.mg4j.document.Document#title()} method;
  • a URI, that somehow characterizes the document uniquely; the document URI is returned by the {@link it.unimi.dsi.mg4j.document.Document#uri()} method;
  • a number of fields; every field is abstractly represented by a number, the field index; fields are numbered from 0, but the user should know how many fields a document exhibits, because this information is not available in the document itself. For every field, the document exhibits the following data:
    • the field content, that is an Object returned by the {@link it.unimi.dsi.mg4j.document.Document#content(int)} method; the type of object that this method returns must be known by the calling class in advance; in particular, for textual fields (see below), the content will be a {@link java.io.Reader};
    • for textual fields only, a {@linkplain it.unimi.dsi.mg4j.io.WordReader word reader}: an object that is able to split the field content (in this case: a sequence of characters) into a sequence of words; the word reader is returned by the {@link it.unimi.dsi.mg4j.document.Document#wordReader(int)} method (which must be called only for textual fields).

Users should always close a document after usage by calling the {@link it.unimi.dsi.mg4j.document.Document#close()} method: the method is responsible for relinquishing all resources that a document instantiated for its very existence.

The {@link it.unimi.dsi.mg4j.document.DocumentFactory} interface

Documents usually do not come alone, but they are grouped into collections: documents within a collection are of the same type, and this fact explains why the document structure (number and type of fields) are not contained in the document itself. Indeed, documents are produced by document factories.

A document factory is an instance of the {@link it.unimi.dsi.mg4j.document.DocumentFactory} interface, that in particular is able to produce a document. All documents produced by the same factory are of the same kind, and exhibit the same number and type of fields. A factory gives information about the documents it produces through the following methods:

  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#numberOfFields()}: returns the number of fields contained in each document produced by this factory; recall that fields are indexed starting from 0;
  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldName(int)}: returns a mnemonic explanatory name for the given field;
  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldIndex(String)}: returns the index of a field, given its mnemonic name;
  • {@link it.unimi.dsi.mg4j.document.DocumentFactory#fieldType(int)}: returns (an integer representing) the type of the given field; possible types are static constants declared in the {@link it.unimi.dsi.mg4j.document.DocumentFactory.FieldType} interface; one of the possible types is {@link it.unimi.dsi.mg4j.document.DocumentFactory.FieldType#TEXT}, used for textual fields; note that the type of objects returned by the {@link it.unimi.dsi.mg4j.document.Document#content(int)} method of the {@link it.unimi.dsi.mg4j.document.Document} interface depend on the type of the field.

The abovementioned methods provide information about documents produced by the factory. The actual documents are produced by the {@link it.unimi.dsi.mg4j.document.DocumentFactory#getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap) getDocument(rawContent,metadata)} method.

This method returns a new document from the factory. The rawContent parameter is the most important one: it is a stream of bytes that the factory uses to produce the document. The factory knows how the sequence of bytes should be interpreted to produce a document of the desired kind. Note that even though the interpretation of the sequence of bytes representing the raw document content is entirely left to implementors, often you might prefer to think of the input byte sequence as of a list of consecutive self-delimiting byte subsequences, one for each field: in this case, the {@link java.io.InputStream#reset()} method of the {@link java.io.InputStream} class is used to divide the subsequences from one another.

The metadata parameter is a map providing some basic data about the document as derived by the collection. The map is a reference map with suitable {@link java.util.Enum} keys, and as such must be queried using the keys in {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys}, or other similar factory-specific keys. For instance, the key {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys#TITLE} gives a suggested title to the document (but the factory may ignore it, if it has a better way to determine a title for the document), whereas {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory.MetadataKeys#URI} specifies a suggested URI for the document (which, once more, may be ignored by the factory).

Usually, a factory is built using a list of properties that define default values for metadata such as charset encoding or MIME types. There properties can be passed in several ways, and usually the main method of a collection provides an option (typically, -p) that let the user specify default metadata for the factory. The property resolution algorithm is explained in the documentation for {@link it.unimi.dsi.mg4j.document.PropertyBasedDocumentFactory}.

The {@link it.unimi.dsi.mg4j.document.DocumentSequence} interface

Up to this point, we have interpreted documents and document factories in a very abstract manner, but we gave no importance on the way the byte sequences representing the raw data are produced.

Basically, a source of documents is a {@link it.unimi.dsi.mg4j.document.DocumentSequence}. More precisely, instances of this class represent sources that are able to generate documents. Typically, a document sequence is able to produce a stream of documents after one another, through a special kind of iterator, called {@link it.unimi.dsi.mg4j.document.DocumentIterator}, returned by the {@link it.unimi.dsi.mg4j.document.DocumentSequence#iterator()} method.

A document iterator is not really a Java {@link java.util.Iterator}: it is simply a class that exposes a method {@link it.unimi.dsi.mg4j.document.DocumentIterator#nextDocument()} that returns the next document, if any, or null when there are no more documents. Thus, document iterators can be lazy, which is preferrable in several circumstance (e.g., documents coming from an input stream).

The {@link it.unimi.dsi.mg4j.document.DocumentCollection} interface

In some cases, the documents only appear as an uninterrupted stream and applications do not have direct accesses to single documents; in particular, it might be the case that documents just disappear after being enumerated (as it happens when the document source is standard input). In all such situations, a {@link it.unimi.dsi.mg4j.document.DocumentSequence} provides the only way to get the documents, because it only guarantees sequential access.

Nonetheless, there are other cases where documents can be easily accessed in a direct fashion, and can be read many times (for example, when documents are files in the file system). In such cases, {@link it.unimi.dsi.mg4j.document.DocumentCollection} (an extension of {@link it.unimi.dsi.mg4j.document.DocumentSequence}) can be used.

Apart for the methods of a document sequence, a document collection provides the following additional access methods to the documents:

  • {@link it.unimi.dsi.mg4j.document.DocumentCollection#size()}, that returns the collection size, i.e., the number of documents in the collection;
  • {@link it.unimi.dsi.mg4j.document.DocumentCollection#document(int)}, that returns the document with given pointer; the document pointer is an integer representing uniquely a document within the collection: the i-th document produced by the collection's {@linkplain it.unimi.dsi.mg4j.document.DocumentCollection#iterator() document iterator} has pointer i−1 (so, document pointers range from 0, inclusive, to size(), exclusive);
  • {@link it.unimi.dsi.mg4j.document.DocumentCollection#stream(int)}, that returns the {@link java.io.InputStream} of raw data that this collection would use to produce the document with given pointer.

After a document collection has been created, for example, starting from a set of files in the filesystem, it can usually be saved (serialized) on a file: the extension used for the filename is, by default, {@link it.unimi.dsi.mg4j.document.DocumentCollection#DEFAULT_EXTENSION}. Implementors of this interface should always specify explicitly which assumptions on the existence of external data are made for the consistency of a collection to be preserved. For example, a collection produced from a set of files will be consistent until no file has been changed or deleted; if the latter situation happens, the collection usually becomes inconsistent, and in any case you might expect that the indices thereof produced will no longer match the content of the collection.

Note that a document collection has very weak requirements (and thus very weak obligations) on the concurrent creation of several objects (documents, iterators, etc.). Please read carefully the {@link it.unimi.dsi.mg4j.document.DocumentCollection class description}.

Relations between document sequences/collections and document factories

As we explained above, document sequences/collections extract raw data (byte sequences) from some source and use a specified document factory to turn such data into documents. Hence, there exists a tight connection between each document sequence and the document factory it uses.

Typically, the document factory is provided to the document sequence at construction time, and this fact provides a form of flexibility, because different sources (e.g., the file system and the standard input) may be coupled with the same document factory (e.g., a document factory parsing HTML documents into text), or, conversely, the same document source may be used to produce documents with different formats.

Users should always be careful, however. Often, document sequences make assumptions about the factory they use, which reduces the number of possible combinations the user may adopt. Implementors of the {@link it.unimi.dsi.mg4j.document.DocumentSequence} interface should always clarify all the assumptions they make about the factories that can be used for the sequence.

Document factories

Recall that a document factory is an object that is capable of producing homogeneous documents (documents with the same number/type of fields). Every document is produced starting from a raw bytestream.

The {@link it.unimi.dsi.mg4j.document.IdentityDocumentFactory}

The simplest possible document factory is the {@link it.unimi.dsi.mg4j.document.IdentityDocumentFactory}: this factory produces documents with a single textual field, called text, that is actually obtained by transforming the byte sequence into a sequence of characters, using some default encoding. Actually, a document factory must also provide a way to break the text into words. With this aim, the identity factory may be provided with a {@link java.util.Locale} that is used to determine how words are best broken in the given locale's language.

Other examples of document factories

Other implementations of document factories that are provided with MG4J are:

  • {@link it.unimi.dsi.mg4j.document.HtmlDocumentFactory}: a factory used to parse HTML documents; this factory produces documents by parsing HTML streams. Bytes are converted into characters using a specified encoding; the resulting HTML character sequence is parsed to extract text (that is returned as a textual field named text) and title (the HTML title element, returned as a textual field named title). The title, if present, is also used as document title (otherwise, the suggested title is used). Note that a document collection might have information about the charset encoding (e.g., by means of HTTP headers): in this case the metadata field of {@link it.unimi.dsi.mg4j.document.DocumentFactory#getDocument(java.io.InputStream,it.unimi.dsi.fastutil.objects.Reference2ObjectMap)} should pass this information.
  • {@link it.unimi.dsi.mg4j.document.PdfDocumentFactory}: a factory used to parse PDF documents. Currently, a single textual field named text is exported.

Composing document factories

As we said, many document factories interpret the raw content data (an {@link java.io.InputStream}, i.e., a sequence of bytes) as if it is really made by a concatenation of many {@link java.io.InputStream}s, where each stream is typically parsed to a field; to pass from one stream to the next, the {@link java.io.InputStream#reset()} method is called.

Suppose you have n document factories D1, …, Dn, with f1, …, fn fields, respectively. One may want to build a new factory with f1+…+fn fields, where each document is produced by composing the document factories D1, …, Dn sequentially: in other words, the raw data are first passed to the first factory (that extracts f1 fields, typically resetting the stream as many times), then it is passed to the second factory (that extracts f2 fields) etc.

The {@link it.unimi.dsi.mg4j.document.CompositeDocumentFactory} does the job, and also allows one to change the field names (that are otherwise named as they were in the subfactories).

The class {@link it.unimi.dsi.mg4j.io.MultipleInputStream} is a useful tool to produce raw data for composite factories: it allows one to convert an array of input streams into a single input stream: each time the resulting stream is reset, the multiple input stream will offer you the next stream in the array.

A special form of composite document factory is obtained using {@link it.unimi.dsi.mg4j.document.ReplicatedDocumentFactory}, that allows one to compose sequentially the same document factory with itself a certain number of times.

Document collections and sequences

The {@link it.unimi.dsi.mg4j.document.InputStreamDocumentSequence}

This is the simplest kind of document sequence: it just breaks a single {@link java.io.InputStream} on the basis of a given separator character; each piece of the stream is interpreted as the raw data corresponding to a document, and it is passed to a factory (specified at construction time) for converting it into a {@link it.unimi.dsi.mg4j.document.Document}.

The {@link it.unimi.dsi.mg4j.document.FileSetDocumentCollection}

This kind of collection is built starting from a set of files in the file system. Each file is interpreted as a document, and passed to a factory (specified at construction time). The suggested title for a document is the corresponding filename, and the suggested URI is the URI of the file.

The {@link it.unimi.dsi.mg4j.document.ZipDocumentCollection} facility

There are cases in which one would like to turn a document sequence into a document collection. This may happen for one of the following reasons:

  • the sequence is, by its very nature, volatile (e.g., it is coming from standard input, and cannot be re-produced), but we would like to make it into a resident non-volatile collection;
  • the sequence is not amenable to be accessed at random;
  • the documents in the sequence are difficult to parse, and it is not advisable to repeat the parsing process every time they are accessed.

In all such cases, it may be advisable to produce a compact copy of the sequence that is easily and efficiently accessible at random. To do this, one may use the {@link it.unimi.dsi.mg4j.document.ZipDocumentCollectionBuilder}, that takes a document sequence and produces a "zipped clone" of the documents in the sequence: there are some mild limitations to the sequences that can be used in this context, and the resulting collection is only a partial copy of the original one, but in most cases this is sufficient for all indexing purposes. The builder will save two files: one contains the essential data concerning the zipped collection, and the other contains the zipped version of the documents.

After this, the produced {@link it.unimi.dsi.mg4j.document.ZipDocumentCollection} may be used as any other collection.

it.unimi.dsi.mg4j.index MG4J: Managing Gigabytes for Java

Index generation and access.

This package contains the classes that handle index generation and access. The interval iterators defined in {@link it.unimi.dsi.mg4j.search} build upon the classes of this package to provide answer to queries using interval semantics, but it is also possible to access an index directly.

You can easily build indices using the tools in {@link it.unimi.dsi.mg4j.tool}. Once an index has been built, it can be opened using an {@link it.unimi.dsi.mg4j.index.Index} object, which gathers metadata that is necessary to access the index. You do not create an {@link it.unimi.dsi.mg4j.index.Index} with a constructor: rather, you use the static factory {@link it.unimi.dsi.mg4j.index.Index#getInstance(CharSequence)} (or one of its variants) to create an instance. This is necessary so that different kind of indices can be treated transparently: for example, the factory may return a {@link it.unimi.dsi.mg4j.index.cluster.IndexCluster} if the index is actually a cluster, but you do not need to know that.

From an {@link it.unimi.dsi.mg4j.index.Index}, you can easily obtain either an {@link it.unimi.dsi.mg4j.index.IndexReader}, which allows to scan sequentially or randomly the index. In turn from an {@link it.unimi.dsi.mg4j.index.IndexReader} you can obtain a {@link it.unimi.dsi.mg4j.index.IndexIterator} returning the documents containing a certain term and the position of the term within the document.

But there is more: an {@link it.unimi.dsi.mg4j.index.IndexIterator} is a kind of {@link it.unimi.dsi.mg4j.search.DocumentIterator}, and {@link it.unimi.dsi.mg4j.search.DocumentIterator}s can be combined in several ways using the classes of the package {@link it.unimi.dsi.mg4j.search}: for instance, you can combine document iterators using AND/OR. Note that you can combine document iterators on different indices, but of course the operation is meaningful only if the two indices contain different information about the same document collection (e.g., title and main text).

More importantly, if the index is full text (the default) for each document containing the term you can get interval iterators that return intervals representing extents of text satisfying the query: for instance, in case of an AND of two terms, the intervals will contain both terms.

Structure of an inverted index

An inverted index is made by a sequence of inverted lists (one inverted list for each term). Inverted lists are made by document records: each document record contains information about the occurrences of the term within a certain document.

More precisely, each inverted list starts with a suitably encoded integer, called the frequency, which is the number of document records that will follow (i.e., the number of documents in which the term appears). After that, there are exactly as many document records as the frequency.

Each document record is made by two parts:

  • a suitably encoded integer, called the (document) pointer, which identifies the document within the collection;
  • a (possibly empty) sequence of bits, called the data; the data have no special structure per se: the only assumption is that they are a self-delimiting bit sequence (i.e., one knows when the sequence is over).

As a basic and fundamental implementation, the classes of this package provide methods that write and read document data in a default form. In this default structure, each document data is a suitable coding of a (strictly increasing) sequence of integers, that correspond to the positions where the term occurs within the document. The length of the sequence (i.e., the number of positions in at which the term appears) is called the count (it is also common to call it “within-document frequency”, but we find this usage confusing).

it.unimi.dsi.mg4j.index.cluster MG4J: Managing Gigabytes for Java

Index partitioning and clustering.

This package contains the classes that provide the infrastructure for index partitioning and clustering. The tools the actually perform partitioning can be found in {@link it.unimi.dsi.mg4j.tool}.

An index cluster is a set of local indices that are viewed as a single index. In a {@linkplain it.unimi.dsi.mg4j.index.cluster.LexicalCluster lexical cluster} each local index has a disjoint set of terms, but the document pointers contained in each local index refer to the same documents. In a {@linkplain it.unimi.dsi.mg4j.index.cluster.DocumentalCluster documental cluster} each index contains postings referring to a disjoint subset of a collection.

Clustering indices requires mapping term number and document pointers back and forth between the global index and local indices. This mapping is provided by {@linkplain it.unimi.dsi.mg4j.index.cluster.DocumentalClusteringStrategy documental clustering strategies} and {@linkplain it.unimi.dsi.mg4j.index.cluster.LexicalClusteringStrategy lexical clustering strategies}.

Clusters are often generated by partitioning an index (albeit, for instance, {@link it.unimi.dsi.mg4j.tool.Scan} produces a cluster as output of the indexing process). In this case, a {@linkplain it.unimi.dsi.mg4j.index.cluster.DocumentalPartitioningStrategy documental partitioning strategy} or a {@linkplain it.unimi.dsi.mg4j.index.cluster.LexicalPartitioningStrategy lexical partitioning strategy} explain how to divide and remap term numbers and document pointers. Of course, the clustering and partitioning strategy must be suitably matched.

it.unimi.dsi.mg4j.index.payload
it.unimi.dsi.mg4j.index.remote MG4J: Managing Gigabytes for Java Remote index classes.

Package Specification

The classes in this package are extremely experimental and need considerable rework and optimisation to be actually useful. They are distributed with MG4J because excising them from the MG4J structure is rather complicated, and because they are fully usable: feedback is more than welcome.

it.unimi.dsi.mg4j.index.snowball MG4J: Managing Gigabytes for Java

Snowball-based {@linkplain it.unimi.dsi.mg4j.index.TermProcessor term processors}.

Package Specification

This package gathers a number of algorithmic stemmer (most notably, the famous {@link it.unimi.dsi.mg4j.index.snowball.PorterStemmer}) automatically generated from scripts written in Snowball—a string language developped by Martin Porter to describe easily algorithmic stemmers. Please refer to the documentation on the Snowball site for information about the stemmers.

The code in this package is distributed under the BSD License. It is Copyright (C) 2001 Martin Porter, and (for the Java developments) Copyright (C) 2002 Richard Boulton.

it.unimi.dsi.mg4j.index.wired
it.unimi.dsi.mg4j.io MG4J: Managing Gigabytes for Java

Bit-level I/O classes.

Package Specification

The standard Java API lacks bit-level I/O classes: to this purpose, MG4J provides {@link it.unimi.dsi.mg4j.io.InputBitStream} and {@link it.unimi.dsi.mg4j.io.OutputBitStream}, which can wrap any standard Java corresponding stream and make it work at the bit level; moreover, they provide support for several useful formats (such as unary, binary, minimal binary, γ, δ and Golomb encoding).

Compression can be achieved using self-delimiting formats supported by the classes above, or also by arithmetic coding, using the classes {@link it.unimi.dsi.mg4j.io.ArithmeticCoder} and {@link it.unimi.dsi.mg4j.io.ArithmeticDecoder}. Note that arithmetic coding is not very efficient in the present implementation, as it does not allow a varying number of symbols.

Bit input and output streams offer also efficient buffering and a way to reposition the bit stream in case the underlying byte stream is a file-based stream or a {@link it.unimi.dsi.fastutil.io.RepositionableStream}.

Conventions

All coding methods work on natural numbers. The encoding of zero is very natural for some techniques, and much less natural for others. To keep methods rationally organised, all methods are able to encode any natural number. If, for instance, you want to write positive numbers in unary encoding and you do not want to waste a bit, you have to decrement them first (i.e., instead of p you must encode p−1).

it.unimi.dsi.mg4j.query MG4J: Managing Gigabytes for Java

User interfaces for querying indices.

Most classes in this package are just suggestions or experiments. The {@link it.unimi.dsi.mg4j.query.Query} tool lets you query one or more indices on the command line or possibly using a browser. The displayed results can be linked to actual data (files, collections, etc.) using items such as {@link it.unimi.dsi.mg4j.query.GenericItem} or {@link it.unimi.dsi.mg4j.query.FileSystemItem}.

it.unimi.dsi.mg4j.query.nodes MG4J: Managing Gigabytes for Java

Composite representation for queries

This package contains the classes that represent queries as an abstract syntax tree, or, in design-pattern jargon, as a composite.

Warning: Before reading this overview, the reader is encouraged to consult the first part of the {@link it.unimi.dsi.mg4j.search} package documentation, where the notion of query is briefly presented.

The basic idea is to perform a complete decoupling of the query construction and resolution into three levels:

  • String level: the query is a string, written in a suitable language where, for example, every operator has a certain syntax, uses a certain keyword etc.
  • Tree level: the query is an abstract tree, that contains one internal node for each operator, but that has nothing to do with the actual way the query was written; for example, one may want to give the user the possibility to write the query by using a visual editor, in which case there will be some way to obtain the tree representation of the query without having any string-level representation for it.
  • Iterator level: the document iterator that lazily solves the query over a certain index or set of indices; this is the semantic counterpart of a query, and it is discussed at large in the overview of the {@link it.unimi.dsi.mg4j.search} package.

This package overview shows how to pass from the string to the tree level using the built-in {@link it.unimi.dsi.mg4j.query.parser.QueryParser}, and explains what is needed to implement the tree level, as well as the basic instruments needed to convert the tree level into the iterator level.

In MG4J, a query parser is an implementation of {@link it.unimi.dsi.mg4j.query.parser.QueryParser}: essentially, an object with a method that transforms strings in {@linkplain it.unimi.dsi.mg4j.query.nodes.Query queries}. You can use any parser you like, or build your queries programmatically.

The parser provided with MG4J and whose syntax is described in the manual is currently generated using JavaCC: unfortunately, this tool produces some public classes that somehow clutter the package documentation. All the parsing logic is contained in the {@link it.unimi.dsi.mg4j.query.parser.SimpleParser} class, which is generated by the JavaCC source SimpleParser.jj.

Building a query tree out of a query string

MG4J allows one to perform queries on multiple indices; by saying so, we mean that you may have many indices concerning the same document collection, and you want to perform a query that may search on some of them. Think, for example, of a collection of emails. You might have generated a number of indices, to index their subjects, sender, recipient(s), body etc. All these indices can be thought of as individual entities, their only relation being that they index collections with the same number of documents, and that same-numbered documents conceptually "come" from the same source.

To get a query tree from a string representation, you must first build a {@link it.unimi.dsi.mg4j.query.parser.QueryParser} object. For instance, if you use the built-in {@link it.unimi.dsi.mg4j.query.parser.SimpleParser} you have to provide a map whose keys are index names (called index aliases) and with each index alias mapped to the corresponding {@link it.unimi.dsi.mg4j.index.Index}. Moreover, one of the aliases is taken to be the default index alias used for the queries.

In our example, we will assume that we have three index aliases (say, subject, from and body), and that subject is the default index alias. The parser would then be created as follows (here, we are stipulating that index with alias x has basename mail-x):

		Object2ReferenceMap<String,Index> indexAlias2Index = new Object2ReferenceOpenHashMap<String,Index>();
		Index defaultIndex;
		indexAlias2Index.put( "subject", defaultIndex = Index.getInstance( "mail-subject" ) );
		indexAlias2Index.put( "from", Index.getInstance( "mail-from" ) );
		indexAlias2Index.put( "body", Index.getInstance( "mail-body" ) );
		
		QueryParser parser = new QueryParser( indexAlias2Index.keySet(), "subject" );
		
after which you can use it for example as follows:
			Query query = parser.parse( "meeting AND body:(schedule OR urgent)" );
			DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( indexAlias2Index, defaultIndex, 0 );
			DocumentIterator it = query.accept( visitor );
			while ( it.hasNext() ) {
				System.out.println( "Document #: " + it.nextDocument() );
				System.out.print( "\tPositions:" );
				for ( Interval interval: it )
					System.out.print( " " + interval );
				System.out.println();
			}
		

Structure of a {@link it.unimi.dsi.mg4j.query.nodes.Query}

As explained above, in this package a {@link it.unimi.dsi.mg4j.query.nodes.Query} is simply an abstract tree: its leaves will correspond to ground queries, whereas internal nodes correspond to query operators.

Ground queries can be {@linkplain it.unimi.dsi.mg4j.query.nodes.Term term queries}, {@linkplain it.unimi.dsi.mg4j.query.nodes.Prefix prefix queries}, and {@linkplain it.unimi.dsi.mg4j.query.nodes.MultiTerm multiterm queries} (the latter usually generated by some query-expansion mechanism). Any other query is either a {@linkplain it.unimi.dsi.mg4j.query.nodes.Composite composite query} (e.g., a {@linkplain it.unimi.dsi.mg4j.query.nodes.And conjunctive query}, a {@linkplain it.unimi.dsi.mg4j.query.nodes.Or disjunctive query} etc.), that is, a query composed by other subqueries, or it is obtained by some other operator (e.g., a {@linkplain it.unimi.dsi.mg4j.query.nodes.LowPass low-pass query} etc.).

Every query has a method {@link it.unimi.dsi.mg4j.query.nodes.Query#accept(it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor)} a {@link it.unimi.dsi.mg4j.query.nodes.QueryBuilderVisitor} object, through which you can visit the query tree. More precisely, when the accept method is invoked on a query, a recursive visit of the query tree is performed: at every node n of type Q (Q being any class implementing {@link it.unimi.dsi.mg4j.query.nodes.Query}), the following steps are performed:

  • the visitor's visitPre(Q n) method is called; if the method returns false, the visit is interrupted; otherwise…
  • the subtrees are recursively visited, obtaining an array of results T[] x; if at a certain point during this visit some call decides that the visit should be interrupted, we stop; otherwise…
  • the visitor's visitPost(Q n, T[] x) method is called, and the result (of type T) is returned.

To make the idea of a visitor easier to understand, consider the following simple example of a visitor:

    	        static class PrinterVisitor extends AbstractQueryBuilderVisitor<String> {
                private static void appendArray( MutableString res, String t[], char sep ) {
                        for ( int i = 0; i < t.length - 1; i++ ) res.append( t[ i ] + sep );
                        res.append( t[ t.length - 1 ] );
                }
                public String[] newArray( int len ) { return new String[ len ]; }
                public String visitPost( OrderedAnd node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "OAND(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( Or node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "OR(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( And node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "AND(" );
                        appendArray( res, t, ',' );
                        res.append( ")" );
                        return res.toString();
                }
                public String visitPost( Consecutive node, String[] t ) {
                        MutableString res = new MutableString();
                        res.append( "\"" );
                        appendArray( res, t, ' ' );
                        res.append( "\"" );
                        return res.toString();
                }
                public String visitPost( Not node, String t ) {
                        return "NOT(" + t + ")";
                }
                public String visitPost( LowPass node, String t ) {
                        return t + "~" + node.k;
                }
                public String visitPost( Select node, String t ) {
                        return node.index + ":" + t;
                }
                public String visit( Term node ) { return node.term.toString(); }
                public String visit( Prefix node ) { return node.prefix + "*"; }
        }
   	

Now, you can pass an instance of this visitor to a query, and as a result get a string (linear) representation of the query itself. (Of course, there is a simple way to get the same result, that is, calling the toString() method directly, but for this example is easy enough for illustrative purposes.)

Here is an example of how the above visitor can be used:

          public static void main( String arg[] ) throws Exception {
                Query qa = new Term( "a" );
                Query qb = new Prefix( "b" );
                Query qc = new Term( "c" );
                Query qd = new Term( "d" );

                Query q = new LowPass( new And( qa, new Select( "another_index", qb ), new Or( qc, qd ) ), 20 );

                System.out.println( q.accept( new PrinterVisitor() ) );
        }
  		
that produces AND(a,another_index:b*,OR(c,d))~20.

Building a document iterator out of a query

The interface {@link it.unimi.dsi.mg4j.query.nodes.Query} and its implementations provide the basic classes for the tree-level representation of a query. Building an iterator out of it can be thought of as a process of recursive instantiation of a query tree into an iterator tree; this is actually a sort of a copy visit, and it is indeed implemented as a special kind of visitor: the {@link it.unimi.dsi.mg4j.search.DocumentIteratorBuilderVisitor}. You can create one such visitor and use it to visit a query tree, obtaining as a result a document iterator that you can use to get the results.

To be more precise, let us recall the example presented in the last part of the {@linkplain it.unimi.dsi.mg4j.search overview of the search package}:

			Index subjectIndex = Index.getInstance( "mail-subject" );
			DocumentIterator it = AndDocumentIterator.getInstance( 
				subjectIndex.documents( "meeting" ), 
				subjectIndex.documents( "schedule" ), 
				subjectIndex.documents( "monday" ) 
			);
			while ( it.hasNext() ) {
					System.out.println( "Document #: " + it.nextDocument() );
					System.out.print( "\tPositions:" );
					for ( Interval interval: it )
						System.out.print( " " + interval );
					System.out.println();
			}
		

Here is an equivalent piece of code that uses the idea of decoupling the tree level from the iterator level.

			Index subjectIndex = Index.getInstance( "mail-subject" );
			Query q = new And( new Term( "meeting" ), new Term( "schedule" ), new Term( "monday" ) );
			DocumentIteratorBuilderVisitor visitor = new DocumentIteratorBuilderVisitor( null, subjectIndex, 0 );
			DocumentIterator it = q.accept( visitor );
			while ( it.hasNext() ) {
				System.out.println( "Document #: " + it.nextDocument() );
				System.out.print( "\tPositions:" );
				for ( Interval interval: it )
					System.out.print( " " + interval );
				System.out.println();
			}
		
it.unimi.dsi.mg4j.query.parser MG4J: Managing Gigabytes for Java

A simple JavaCC-generated parser used by the {@link it.unimi.dsi.mg4j.query.Query} class. For details about the syntax, please consult the {@link it.unimi.dsi.mg4j.search} package documentation. If you need to know the relation between queries, syntax trees and document iterators please consult the {@link it.unimi.dsi.mg4j.query.nodes} package documentation.

it.unimi.dsi.mg4j.search MG4J: Managing Gigabytes for Java

Classes that compose {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator iterators over documents}. Such iterators are returned, for instance, by {@link it.unimi.dsi.mg4j.index.IndexReader#documents(int)}.

Minimal-interval semantics

MG4J provides minimal-interval semantics. That is, if the index is full-text, a {@link it.unimi.dsi.mg4j.search.DocumentIterator} will provide a list of documents and, for each document, a list of minimal intervals. This intervals denote ranges of positions in the document that satisfy the iterator: for instance, if you compose two documents iterators using an {@link it.unimi.dsi.mg4j.search.AndDocumentIterator}, you will get as a result the intersection of the document lists of the underlying iterators. Moreover, for each document you will get the minimal set of intervals that contain one interval both from the first iterators and from the second one.

This information is of course very useful if you're going to assign a score to the document, as smaller intervals mean a more precise match. At the basic level (e.g., iterators returned by an index), the intervals returned upon a document are intervals of length one containing the term that was used to generate the iterator. Intervals for compound iterators are built in a natural way, preserving minimality. More details can be found in Charles L. A. Clarke and Gordon V. Cormack, Shortest-Substring Retrieval and Ranking (ACM Transactions on Information Systems, vol. 18, no. 1, Jan 2000, pages 44−78). Scorers for documents may be found in the {@link it.unimi.dsi.mg4j.search.score} package.

The algorithms used by classes in this package to compute minimal-interval operators are new: details can be found here.

Note that MG4J provides minimal-interval semantics for a set of indices. This extension is a significant improvement over single-index semantics. However, defining the exact meaning of a query is a nontrivial problem that will be fully dealt with in a forthcoming paper.

Searching with minimal-interval semantics

The aim of this section is to provide a minimal insight of how minimal-interval semantics works, and explain the basic syntax used by the {@link it.unimi.dsi.mg4j.query.Query} command-line tool. In this section we shall try to discuss this issue only through examples; we shall later explain how you can actually perform searches of this kind using MG4J.

Note that you do not have to understand the details of minimal-interval semantics to fruitfully use MG4J. Several natural operators (ordered conjunction, proximity limitation, etc.) are computed by MG4J very efficiently using minimal-interval semantics just under the hood.

MG4J solves queries on multiple indices; by saying so, we mean that you may have many indices concerning the same document collection, and you want to perform a query that may search on some of them. Think, for example, of a collection of emails: you might have generated a number of indices, to index their subjects, sender, recipient(s), body etc. All these indices can be thought of as individual entities, their only relation being that they index collections with the same number of documents, and that same-numbered documents conceptually "come" from the same source. The notion of multiple indices should not be new to the reader that is familiar with the {@link it.unimi.dsi.mg4j.document} package.

In our examples, we will assume that we have three indices (say, subject, from and body), and that subject is the used as default. Be warned that the actual syntax of queries in this section is immaterial (even though we shall stick to the syntax of {@link it.unimi.dsi.mg4j.query.parser.SimpleParser}).

Two different aspects should be taken into consideration when trying to determine which document actually match (i.e., satisfy) the query:

  • first, one can consider this in a purely Boolean (true/false) setting: thus a document may either satisfy the query or not; this is actually the only information you can get for indices that do not contain positions;
  • second, one can consider, for a document that matches the query in the above sense, which intervals (i.e., minimal sequences of consecutive words within the document) actually witness the match; this information will be available if the index contains positions.

In the following subsections, we shall give information about both kind of satisfiability.

Queries available on all indices

Simple queries

The simplest possible query consists in a single search term. The documents matching such a query are exactly those that contain the given term, with respect to the default index. In our example, the query

		 meeting
		 
will be matched by the documents (emails) that contain the term "meeting" in their subject. If you want, you can perform the query on another index (different from the default one); thus, for example, the query
		body: meeting
		 
will be matched by the documents that contain the term "meeting" in their body. In both cases, the intervals witnessing the match will be the single occurrences of the term "meeting" in the subject and in the body field, respectively.

Conjunctive queries

You can specify that more than one condition should be met in conjunction by using the AND operator. For example:

			meeting AND schedule
			
will be matched by those document whose subject contains both the term "meeting" and the term "schedule" (not necessarily in this order). The witnesses will be minimal intervals in the subjects that contain both terms. For example, if the subject was
			schedule the meeting (should we schedule this meeting or not)?
			
then the above query will have three witnesses: "schedule the meeting", "meeting (should we schedule" and "schedule this meeting".

The keyword AND can be subsituted with the symbol & or can even be omitted. So the above query is equivalent to:

			meeting & schedule
			
and to
			meeting schedule
			
Also in this case, you can select a different index for the query to be matched. For example:
			body: meeting schedule
			
(or, equivalently, body: meeting AND schedule or body: meeting & schedule) will be matched by documents that contain the term "meeting" in their body and the term "schedule" in their subject. In this case, witnesses come from different sources: a witness will be any single occurrence of the word "meeting" in the body (there should be at least one to make the document match the query) and any single occurrence of the word "schedule" in the subject (again, there should be at least one to make the document match the query). If you want both terms to be searched for in the body index, you can use:
			body: meeting body: schedule
			
or, simply,
			body: (meeting schedule)
			

Disjunctive queries

You can also introduce a disjunctive (OR) query, like in

			meeting OR schedule
			
that will be matched by the documents that contain the term "meeting" or the term "schedule" (or both) in their subject. A witness will then be every single occurrence of either word in the subject. The keyword OR can be substituted with |, hence the previous query is equivalent to
			meeting | schedule
			

Conjunctive and disjunctive operators can appear in the same query, with the rule that AND has higher priority than OR. So, for example:

			meeting AND schedule OR time
			
will be matched by documents whose subject contains both "meeting" and "schedule", and by documents whose subject contains "time". In this case, a witness will either be a (possibly long) interval containing both the words "meeting" and "schedule", or a one-word interval containing the word "time". If you want to change this behaviour, you should use parenthesis, like:
			meeting AND (schedule OR time)
			

Again, you can use index selectors, like in:

			body:meeting AND (schedule OR time)
			
that will be matched by documents containing "meeting" in their body (a witness being every single occurrence of the word in the body), and "schedule" or "time" in their subject (a witness being every single occurrence of either word in the subject). Similarly:
			body:(meeting AND (schedule OR time))
			
will match documents that contain "meeting" and either "schedule" or "time" (or both) in their body.

Negative (NOT) queries

You can specify that you want to exclude documents containing a certain term, or, more in general, satisfying a certain query, by using the (unary, prefix) operator NOT. For example:

				body:(meeting AND NOT tomorrow) AND subject:schedule
				
will be satistied by the emails that contain the term "schedule" in their subject, and the term "meeting" but not the term "tomorrow" in their body. The operator NOT can be substituted with !, like in:
				body:(meeting !tomorrow) subject:schedule
				

Negative queries are easily understood in a Boolean context, but may be more difficult as far as witnesses are concerned. Basically, the implementation of NOT works in such a way that NOT is actually used only for the Boolean match, but does not influence witnesses. In more detail, the only witness associated to a true NOT query is an empty interval.

Prefix and multiterm queries

A prefix query is a simple query that is matched by all terms starting from the same nonempty prefix. For example:

			   govern*
			   
is matched by all documents containing any word starting with "govern". For the prefix operator * to work, you have to endow your index with a {@link it.unimi.dsi.mg4j.index.PrefixMap}. What really happens in this case is that the query is essentially expanded into a disjunction that contains all the words in the dictionary that start with "govern".

To be true, the expansion of a prefix query does not really lead to an OR, but rather to what we call a multiterm query: a multiterm query is like an OR, but it can only contain terms as subquery, and behaves under many respects like a single term. It is unusual to specify manually a multiterm query—rather, some query expansion mechanism like prefixes should be used, but if you want to try manually, a multiterm query can be obtained using the + operator. For example:

			   house + houses + housing
			   
is a correct multiterm query, and it is loosely equivalent to
			   house OR houses OR housing
			   

Note, however, that trying to use + instead of OR does not work if the subqueries are not simple queries, or if they concern different indices. For example:

			   house + title:meeting
			   
would produce an error.

You may wonder why multiterm queries are needed, if they are essentially the same as OR queries. The first answer is efficiency: a multiterm query should be more efficient than an OR query.

The second answer is more subtle, and has to do with scorers. A scorer is a way to assign a score to a document that satisfies a query. Many scorers actually work by summing up suitable partial scores that depend on the document and on one of the terms in the query. Such partial scores are often function of the count (number of times the term appears in the document) and on the frequency (number of documents where the term appears), and they are often really high when the term has a low frequency. The idea behind this is that if I write:

			   computer OR methacrylic
			   
a document that satisfies the query because it contains "methacrylic" is more valuable than one that contains the word "computer", being the former much more infrequent.

Nonetheless, trying to use these scorers on automatically expanded queries may lead to many problems. For example, suppose you expanded

			   govern*
			   
as
			   government OR governance OR governor OR governing 
			   
(we are here assuming that the four terms above are the only ones that appear in the dictionary and start with "govern"). Now, since "governance" is presumably much rarer than "government", we expect all documents containing only "governance" to be given a high score. using
			   government + governance + governor + governing 
			   
the scorer acts on this bunch of words as a whole, and the frequency is assumed to be the maximum frequency (hence, it is the same for all words), avoiding the "governance"-prevalence problem.

Queries available on indices with positions

Ordered conjunctive queries

The operator of ordered conjunction < works like AND, but requires the subqueries to be satisfied in the exact order in which they are specified, even though not necessarily consecutively. For example:

			meeting < schedule
			
will only be matched by documents that contain in their subject at least one occurrence of the word "meeting" followed (maybe not immediately) by at least one occurrence of the word schedule. Again, for example, if the subject was
			schedule the meeting (should we schedule this meeting or not)?
			
then the above query will have only one witness: "meeting (should we schedule"; the other two minimal intervals that contain both words ("schedule the meeting" and "schedule this meeting") are not witnesses because words appear in the wrong order.

Note that the ordering between witnesses is strict: for instance, the query

			meeting < meeting
			
has as only witness "meeting (should we schedule meeting". The single word "meeting" alone is not a witness for the query.

In this case, it makes no sense (and it is indeed forbidden) to select a different index for the subqueries to be matched.

Consecutivity (phrasal queries)

You can specify that you want that some terms appear consecutively by using " (quotes). For example:

				"meeting schedule"
				
will be matched if the terms "meeting" and "schedule" appear in this order, and consecutively, in the subject. Inside quotes, you can also use subqueries, surrounding them with parenthesis, like in:
				"meeting (schedule OR time)"
				
that is matched by documents whose subject contains the term "meeting" followed by either "schedule" or "time". A witness will this time be necessarily an interval of exactly two words (the first being "meeting" and the second being either "schedule" or "time").

More precisely, the quotes operators is satisfied if there is a sequence of consecutive witnesses, with each witness coming from a different subquery, in the same order in which the queries appear.

Note that

				"meeting schedule OR time"
				
would be invalid: if you want to use operators within quotes, you should do so between parenthesis. Moreover, within quotes you cannot change index. So you can say
				body:"meeting schedule"
				
but you cannot use
				"body:meeting subject:schedule"
				

The symbol $ (dollar) can be used to specify an arbitrary word in a consecutive query. For instance,

			meeting $ schedule
			
will match "meeting our schedule" as well as "meeting my schedule". You can add dollars also at the start of a phrase, but not at the end (in the latter case, they will be ignored).

Proximity limit

As we have discussed, when a document matches a given query, there will be one or more witnesses within the document. Each such witness is a consecutive sequence of positions in the document that witness the matching. For example, consider the query

			  body:((meeting schedule) OR "John Smith") OR subject:alarm
			  

This query will be matched by documents that contain the term "alarm" in their subject, and by documents that contain either the terms "meeting" and "schedule" or the (exact) sentence "John Smith" in their body. For every document that matches the query, there will be two sets of matching intervals, one about the body and the other about the subject; at least one of these two sets will be nonempty (because of the OR keyword). Intervals concerning the subject will simply be intervals of length one that correspond to the positions where the term "alarm" appears in the subject. Intervals concerning the body will be either intervals of length two corresponding to the positions where the sentence "John Smith" appears in the body, or intervals of length two or more where both "meeting" and "schedule" appear.

You might want to accept only matching intervals up to a certain length; for example, suppose you don't want to take into considerations intervals that contain "meeting" and "schedule" too far apart, say at a distance greater than 10 words. You can do this by using the proximity limit operator ~. Just rewrite the previous query as

			  body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
			  

This way, you are simply discarding the matching intervals that contain the terms "meeting" and "schedule" if their length (number of words) is greater than 10 (i.e., if "meeting" and "schedule" are separated by more than 8 words).

The proximity limit operator can be used at any point, and limits the length of all matching intervals of the query it is applied to. Note, however, that it may only be used on full-text indices.

Difference

The Brouwerian difference operator is specified using - (minus). It is a rather esoteric operator that is rarely met by the end user, and that, given two subqueries, kills the witnesses of the first query (the minuend) that contain one or more witnesses of the second query (the subtrahend). By definition, for documents that satisfy the minuend, but not the subtrahend, the witnesses are unchanged. For instance, the following query

				schedule < meeting - this
				
will be matched only if the term "schedule" and the term "meeting" appear in this order without the term "this" inbetween. If the subject is
				schedule the meeting (should we schedule this meeting or not)?
				
the only valid witness is "schedule this meeting", and indeed, the following query
				schedule < meeting - (this | the)
				
will not match at all the subject above, as all witnesses of the minuend are killed by witnesses of the subtrahend.

As an additional feature, you can specify a left and a right margin that will be used to enlarge the intervals of the subtrahend. For instance,

				"schedule < meeting - [[1,2]] this"
				

will kill intervals of the minuend that contain the whole fragment "schedule this meeting or" (so no interval will be killed at all).

Queries available on payload-based indices

Actually, the atomic queries discussed above (term, prefix, etc.) can be used with standard indices, that is, indices of fields containing text. For payload-based indices, which represent document metadata such as dates, the standard query available in MG4J is a range query in which the first and last valid values are specified by the user. The resulting query is satisfying by all documents whose field is in the range. Both the first and the last value can be omitted. for instance, the following query

				date:[ 20/2/2007 .. 23/2/2007 ]
				
will search for documents between 20 February and 23 February 2007, inclusive, whereas the query
				date:[ .. 23/2/2007 ]
				
will search for documents up to 23 February 2007. Note that in the built-in parser spaces are necessary. They make it possible to separate the different tokens composing the query.

Range queries must not be used as a generic query mechanism, but rather to refine the result of a query over document content: a ranked query composed uniquely by a range query will have to scan the whole payload-based index just to return a few results.

Building and composing document iterators

The {@link it.unimi.dsi.mg4j.search} package contains all the classes needed to build a query and to match it against a certain collection of indices. This is actually only the semantic counterpart to a query; for the syntactic aspects, please refer to the {@link it.unimi.dsi.mg4j.query.nodes} package.

Basic classes

An {@link it.unimi.dsi.mg4j.search.Interval} represents a consecutive set of natural numbers, that is, a witness within a document (in this case, numbers represent the positions within a document: 0 is the position of the first word, 1 is the position of the second and so on). An {@link it.unimi.dsi.mg4j.search.IntervalIterator} is an iterator that returns intervals: typically, an interval iterator will return all intervals witnessing a certain query for a certain document (and a certain index).

For example, the query

			  body:((meeting schedule)~10 OR "John Smith") OR subject:alarm
			  
will give rise to an interval iterator for the body and an interval iterator for the subject: the former will return intervals within the body witnessing the first part of the query, and the latter will return the intervals the intervals witnessing the second part of the query. Note that even upon a matching document either iterator may actually return no interval (because the overall query is disjunctive); nonetheless, the two iterators cannot be both empty.

It is always understood that intervals are returned in increasing order (of their left, or equivalently right, extreme).

A {@link it.unimi.dsi.mg4j.search.DocumentIterator} is used to scan a whole collection of indices for a query. At every given moment, the iterator will be able to return the next document matching the query, and, for full-text indices, you will also be able to {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator(it.unimi.dsi.mg4j.index.Index) get the interval iterators of the witnesses for that document and for a specific index}.

Obtaining and composing document iterators

The simplest kind of {@link it.unimi.dsi.mg4j.search.DocumentIterator} you can build is an {@link it.unimi.dsi.mg4j.index.IndexIterator}: it is a document iterator that scans a specific index for a specific term. You don't actually build an index iterator directly, but you rather obtain one by calling the {@link it.unimi.dsi.mg4j.index.Index#documents(CharSequence)} (or, equivalently, {@link it.unimi.dsi.mg4j.index.IndexReader#documents(CharSequence)}) method, that returns the set of documents containing a given term (and witnesses will be the single occurrences of such term).

Hence, for example, the following snippet opens a full-text index whose basename is mail-subject, and prints out all documents containing the word "meeting", each with the sequence of positions where the word appears (all intervals will be actually singletons). (Note that a document iterator over a single index is itself iterable, and the {@link it.unimi.dsi.mg4j.search.DocumentIterator#iterator()} method is actually an alias for {@link it.unimi.dsi.mg4j.search.DocumentIterator#intervalIterator()}).

			Index subjectIndex = Index.getInstance( "mail-subject" );
			DocumentIterator it = subjectIndex.documents( "meeting" );
			while ( it.hasNext() ) {
					System.out.println( "Document #: " + it.nextDocument() );
					System.out.print( "\tPositions:" );
					for ( Interval interval: it )
						System.out.print( " " + interval );
					System.out.println();
			}
		

A number of classes in this package can be used to compose iterators; more precisely, for each query operator discussed above there is a corresponding class in this package. Each such class has a factory method that allows one to build new document iterators by composing existing iterators.

For example, the following snippet shows how to search for mails containing the words "meeting", "schedule" and "monday".

			Index subjectIndex = Index.getInstance( "mail-subject" );
			DocumentIterator it = AndDocumentIterator.getInstance( 
				subjectIndex.documents( "meeting" ), 
				subjectIndex.documents( "schedule" ), 
				subjectIndex.documents( "monday" ) 
			);
			while ( it.hasNext() ) {
					System.out.println( "Document #: " + it.nextDocument() );
					System.out.print( "\tPositions:" );
					for ( Interval interval: it )
						System.out.print( " " + interval );
					System.out.println();
			}
		

The following table shows the correspondence between query operators and composition classes:

OperatorClass
AND & (conjunction){@link it.unimi.dsi.mg4j.search.AndDocumentIterator}
OR | (disjunction){@link it.unimi.dsi.mg4j.search.OrDocumentIterator}
NOT ! (negation){@link it.unimi.dsi.mg4j.search.NotDocumentIterator}
+ (multiterm){@link it.unimi.dsi.mg4j.index.MultiTermIndexIterator}
"..." (phrase){@link it.unimi.dsi.mg4j.search.ConsecutiveDocumentIterator}
< (ordered conjunction){@link it.unimi.dsi.mg4j.search.OrderedAndDocumentIterator}
~ (proximity){@link it.unimi.dsi.mg4j.search.LowPassDocumentIterator}
- (difference){@link it.unimi.dsi.mg4j.search.DifferenceDocumentIterator}
[ .. ] (range){@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator}

Note, however, that {@link it.unimi.dsi.mg4j.search.PayloadPredicateDocumentIterator} is actually a completely generic predicate-based class that just returns documents whose payload satisfis a predicate.

Queries and document iterators

Even though it is perfectly legal to build document iterators by using these classes directly, this is not the natural way to do that. One should rather build a syntactic object corresponding to a query, and then make it into a document iterator that is, in some sense, the semantic counterpart of the query itself. To have more information about how this works exaclty, please consult the overview of the {@link it.unimi.dsi.mg4j.query.nodes} package.

it.unimi.dsi.mg4j.search.score MG4J: Managing Gigabytes for Java

Classes for assigning scores to documents.

The content of this package has changed significantly in MG4J 1.1 (hopefully for the better). A {@link it.unimi.dsi.mg4j.search.score.Scorer} is an object that wraps an underlying {@link it.unimi.dsi.mg4j.search.DocumentIterator} and assigns scores to the documents returned by the underlying iterator. In general, once a scorer has {@linkplain it.unimi.dsi.mg4j.search.score.Scorer#wrap(it.unimi.dsi.mg4j.search.DocumentIterator) wrapped a document iterator} one just calls {@link it.unimi.dsi.fastutil.ints.IntIterator#nextInt() nextInt()} and {@link it.unimi.dsi.mg4j.search.score.Scorer#score()} to get scored documents (some iterators might support {@linkplain it.unimi.dsi.mg4j.search.score.Scorer#score(it.unimi.dsi.mg4j.index.Index) index-restricted scoring}, but this is optional).

If the scorer is a {@link it.unimi.dsi.mg4j.search.score.DelegatingScorer}, then by contract it just delegates all {@link it.unimi.dsi.fastutil.ints.IntIterator}'s methods to the underlying iterator. In this case, it is possible to advance manually the underlying iterator and call {@link it.unimi.dsi.mg4j.search.score.Scorer#score()}. While this behaviour is useless for general users, it is essential for {@linkplain it.unimi.dsi.mg4j.search.score.AbstractAggregator aggregated scorers}, which combine several delegating scorers and provide services such as equalisation and interval caching (in case more than one component scorer uses intervals). See, for instance, {@link it.unimi.dsi.mg4j.search.score.LinearAggregator}.

it.unimi.dsi.mg4j.search.visitor MG4J: Managing Gigabytes for Java

Visitors for composite {@linkplain it.unimi.dsi.mg4j.search.DocumentIterator document iterators}.

Composites and visitors

A {@link it.unimi.dsi.mg4j.search.DocumentIterator} (in particular, those provided by MG4J in the package {@link it.unimi.dsi.mg4j.search}) is usually structured as a composite, with operators as internal nodes and {@link it.unimi.dsi.mg4j.index.IndexIterator}s as leaves. A composite can be explored using a visitor: thus, the {@link it.unimi.dsi.mg4j.search.DocumentIterator} interface provides two methods, {@link it.unimi.dsi.mg4j.search.DocumentIterator#accept(it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor) accept(DocumentIteratorVisitor)} and {@link it.unimi.dsi.mg4j.search.DocumentIterator#acceptOnTruePaths(it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor) acceptOnTruePaths(DocumentIteratorVisitor)}, that let a {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor} visit the composite structure.

A {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor} provides methods for visiting in {@linkplain it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#visitPre(it.unimi.dsi.mg4j.search.DocumentIterator) preorder} and in {@linkplain it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#visitPost(it.unimi.dsi.mg4j.search.DocumentIterator) postorder} all internal nodes. Leaves have a {@linkplain it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#visit(it.unimi.dsi.mg4j.index.IndexIterator) single visit method} instead. Note that a {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor} must be (re)usable after each call to {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#prepare() prepare()}.

The abstract class {@link it.unimi.dsi.mg4j.search.visitor.AbstractDocumentIteratorVisitor} provides stubs implementing internal visits and {@link it.unimi.dsi.mg4j.search.visitor.DocumentIteratorVisitor#prepare() prepare()} as no-ops.

Counting term occurrences

An example of the utility of visitors for document iterators is given by term counting: using a number of coordinated visitors, it is possible to compute a count for each term appearing in a (no matter how complex) query. The count can be used as an input for counting-based scoring schemes, such as BM25 or cosine-based measures. For more information, please read the documentation of {@link it.unimi.dsi.mg4j.search.visitor.CounterCollectionVisitor}.

it.unimi.dsi.mg4j.test
it.unimi.dsi.mg4j.tool MG4J: Managing Gigabytes for Java

Line-command tools for index construction.

The classes in this package contain a main method, and can be used to build indices starting from a document sequence. Please have a look at the MG4J manual to learn how to build an index.
it.unimi.dsi.mg4j.util MG4J: Managing Gigabytes for Java General-purpose utility classes.

Package Specification

The classes in this package are completely general utility classes that may be useful outside of MG4J. In particular, {@link it.unimi.dsi.mg4j.util.MutableString} is a highly-tuned, versatile class for content-based mutable strings.

Another general-purpose important component is {@link it.unimi.dsi.mg4j.util.MinimalPerfectHash}, which implements state-of-the-art, hypergraph-based algorithms for the construction of ordered minimal perfect hash tables.

If minimal perfect hash tables are too large, a good alternative is an {@link it.unimi.dsi.mg4j.util.ImmutableExternalPrefixDictionary}, which keeps most data in compressed form on disk, and uses a small in-memory index to access those data.

it.unimi.dsi.mg4j.util.parser MG4J: Managing Gigabytes for Java A fast, lightweight, on-demand (X)HTML parser.
it.unimi.dsi.mg4j.util.parser.callback MG4J: Managing Gigabytes for Java Callbacks for the {@link it.unimi.dsi.mg4j.util.parser.BulletParser}.
test.it.unimi.dsi.mg4j
test.it.unimi.dsi.mg4j.document
test.it.unimi.dsi.mg4j.index
test.it.unimi.dsi.mg4j.index.cluster
test.it.unimi.dsi.mg4j.index.payload
test.it.unimi.dsi.mg4j.query.nodes
test.it.unimi.dsi.mg4j.search
test.it.unimi.dsi.mg4j.search.score
test.it.unimi.dsi.mg4j.tool
test.it.unimi.dsi.mg4j.util
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.