| java.lang.Object it.unimi.dsi.mg4j.tool.Scan
Scan | public class Scan (Code) | | Scans a document sequence, dividing it in batches of occurrences and writing for each batch a
corresponding subindex.
This class (more precisely, its
Scan.run(String,DocumentSequence,TermProcessor,String,int,int,int[],VirtualDocumentResolver[],int[],String,long,String) run() method) reads a document sequence and produces several batches, that is, subindices
corresponding to subsets of term/document pairs of the collection. A set of batches is generated
for each indexed field of the collection. A main method invokes the above method setting its
parameters using suitable options.
Unless a serialised
it.unimi.dsi.mg4j.document.DocumentSequence is specified using
the suitable option, an implicit
it.unimi.dsi.mg4j.document.InputStreamDocumentSequence is created using separator byte (default is 10, i.e., newline). In the latter case, the factory
and its properties can be set with command-line options.
The only mandatory argument is a basename, which will be used to stem the names
of all files generated. The first batch of a field named field will use the basename
basename-field@0, the second batch basename-field@1
and so on. It is also possible to specify a separate directory for batch files (e.g., for easier
when they are no longer necessary).
Since documents are read sequentially, every document has a natural
index starting
from 0. If no remapping (i.e., renumbering) is specified, the document
index of each document
corresponds to its natural index. If, however, a remapping is specified, under the form of a
list of integers, the document index of a document is the integer found in the corresponding
position of the list. More precisely, a remapping for N documents is a list of
N distinct integers, and a document with natural index i has document
index given by the i-th element of the list. This is useful when indexing statically
ranked documents (e.g., if you are indexing a part of the web and would like the index to return
documents with higher static rank first). If the remapping file is provided, it must be a
sequence of integers, written using the
java.io.DataOutputStream.writeInt(int) method; if
N is the number of documents, the file is to contain exactly N distinct
integers. The integers need not be between 0 and N-1, to allow the remapping of
subindices (but a warning will be logged in this case, just to be sure you know what you're doing).
Also every term has an associated number starting from 0, assigned in lexicographic order.
Index types and indexing types
A standard index contains a list of terms, and for each term a posting list. Each
posting contains mandatorily a document pointer, and then, optionally, the count and the
positions of the term (whether the last two elements appear can be specified using suitable
).
The indexing type of a standard index can be
IndexingType.STANDARD ,
IndexingType.REMAPPED or
IndexingType.VIRTUAL . In the first case, we index the
words occurring in documents as usual. In the second case, before writing the index all documents
are renumbered following a provided map. In the third case (used only with
it.unimi.dsi.mg4j.document.DocumentFactory.FieldType.VIRTUAL fields) indexing is performed on a virtual document
obtained by collating a number of
.
Fragments are associated to documents by some key,
and a
VirtualDocumentResolver turns a key into a document natural number, so that the
collation process can take place (a settable gap is inserted between fragments).
Besides storing document pointers, document counts, and position, MG4J makes it possible to
store an arbitrary payload with each posting. This feature is presently used only to
create payload-based indices—indices without counts and positions that contain
a single, dummy word #. They are actually used to store arbitrary data associated
to each document, such as dates and integers: using a special syntax, is then possible to specify
range queries on the values of such fields.
The main difference between standard and payload-based indices is that the first type is
handled by instances of this class, whereas the second type is handled by instances of
Scan.PayloadAccumulator . The
Scan.run(String,DocumentSequence,TermProcessor,String,int,int,int[],VirtualDocumentResolver[],int[],String,long,String) run() method creates a set of suitable instances, one for each indexed field, and feeds them in
parallel with data from the appropriate field of the same document.
Batch subdivision and content
The scanning process uses a user-settable number of documents per batch, and will try to
build batches containing exactly that number of documents (for all indexed fields). There are of
course space constraints that could make building exact batches impossible, as the entire data of
a batch must into core memory. If memory is too low, a batch will be generated with fewer
documents than expected.
In some extreme cases, it could be impossible to produce cleanly a set of batches for all
fields: in that case, emergency dumps will create fragmented batches—instead
of a single batch containing k documents a certain field will generate two separate
batches. As a consequence, different fields will have a number of batches, but a simple
inspection of the property files (see below) will reveal the details of the emergency dumps (and
Combine can be used to rebuild the desired exact batches, if necessary).
The larger the number of documents in a batch is, the quicker index construction will be.
Usually, some experiments and a look at the logs is all that suffices to find out good parameters
for the Java virtual machine maximum memory setting and for the number of documents per batch.
These are the files currently generated for each batch (basename
denotes the basename of the batch, not of the index):
- basename.terms
- For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely,
the i-th line of the file (starting from 0) contains the literal string corresponding
to term index i.
- basename.terms.unsorted
- The list of indexed terms in the same order in which they were met in the document
collection. This list is not produced unless you ask for it explicitly with a suitable option.
- basename.frequencies
- For each term, the number of documents in which the term appears in γ coding. More
precisely, i-th integer of the file (starting from 0) is the number of documents in
which the term of index i appears.
- basename.sizes (not generated for payload-based indices)
- For each indexed document, the corresponding size (=number of words) in γ coding. More
precisely, i-th integer of the file (starting from 0) is the size in words of the
document of index i.
- basename.index
- The inverted index.
- basename.offsets (not generated for payload-based indices)
- For each term, the bit offset in basename.index at which the
inverted lists start. More precisely, the first integer is the offset for term 0 in γ
coding, and then the i-th integer is the difference between the i-th and
the i−1-th offset in γ coding. If T terms were indexed, this
file will contain T+1 integers, the last being the difference (in bits) between the
length of the entire inverted index and the offset of the last inverted list.
- basename.globcounts (not generated for payload-based indices)
- For each term, the number of its occurrences throughout the whole document collection, in
γ coding. More precisely, the i-th integer of the file (starting from 0) is the
number of occurrences of the term of index i.
- basename.properties
- A Java
containing information about the index.
Currently, the following keys (taken from
it.unimi.dsi.mg4j.index.Index.PropertyKeys )
generated:
- indexclass
- the class used to generate the batch (presently,
BitStreamIndexWriter );
- documents
- number documents in the collection;
- terms
- number of indexed terms;
- occurrences
- number of words throughout the whole collection;
- postings
- number
of postings (pairs term/document) throughout the whole collection;
- maxdocsize
- maximum
size of a document in words;
- termprocessor
- the term processor (if any) used during the
index construction;
- coding
- one or more items, each defining a key/pair value for the
flag map of the index; each pair is of the form component:coding
(see
it.unimi.dsi.mg4j.index.CompressionFlags );
- field
- the name of the field
that generated this batch (optional)
- maxcount
- the maximum count in the collection, that
is, the maximum count of a term maximised on all terms and documents;
- size
- the index
size in bits;
- basename.cluster.properties
- A Java
containing information about the set of batches
seen as a
it.unimi.dsi.mg4j.index.cluster.DocumentalCluster . The keys are same as in the
previous case, but additionally a number of localindex entries specify the basename
of the batches, and a splitstrategy. After creating manually suitable term maps for
each batch, you will be able to access the set of batches as a single index (but note that
standard batches have no skip structure, and should not be used
in production; if you intend to do so, you have to write a customised scanning procedure).
author: Sebastiano Vigna since: 1.0 |
Inner Class :public static enum IndexingType | |
Inner Class :public static interface VirtualDocumentFragment extends Serializable | |
Inner Class :protected static class PayloadAccumulator | |
Constructor Summary | |
public | Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) Creates a new scanner instance. | public | Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) Creates a new scanner instance. | public | Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) Creates a new scanner instance. |
Method Summary | |
protected static String | batchBasename(int batch, String basename, File batchDir) Returns the name of a batch.
You can override this method if you prefer a different batch naming scheme.
Parameters: batch - the batch number. Parameters: basename - the index basename. Parameters: batchDir - if not null , a temporary directory for batches. | public static void | cleanup(String basename, int batches, File batchDir) Cleans all intermediate files generated by a run of this class. | public void | close() Closes this pass, releasing all resources. | protected long | dumpBatch() Dumps the current batch on disk as an index.
the number of occurrences contained in the batch. | public static DocumentSequence | getSequence(String sequenceName, Class> factoryClass, String[] property, int delimiter, Logger logger) Returns the document sequence to be indexed.
Parameters: sequenceName - the name of a serialised document sequence, or null forstandard input. Parameters: factoryClass - the class of the DocumentFactory that should be passed to thedocument sequence. Parameters: property - an array of property strings to be used in the factory initialisation. Parameters: delimiter - a delimiter in case we want to use standard input. Parameters: logger - a logger. | public static void | main(String[] arg) | protected void | openSizeBitStream() | public static int[] | parseFieldNames(String[] indexedFieldName, DocumentFactory factory, boolean allSupported) | public static int[] | parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory) | public static int[] | parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory) | public static VirtualDocumentResolver[] | parseVirtualDocumentResolver(String[] virtualDocumentSpec, int[] indexedField, DocumentFactory factory) | public void | processDocument(int documentPointer, WordReader wordReader) Processes a document. | public static void | run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, String renumberingFile, long logInterval, String tempDirName) Runs in parallel a number of instances. | public static void | run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName) Runs in parallel a number of instances.
This commodity method takes care of instantiating one instance per indexed field, and to
pass the right information to each instance. | public String | toString() |
CLUSTER_PROPERTIES_EXTENSION | final public static String CLUSTER_PROPERTIES_EXTENSION(Code) | | The extension of the strategy for the cluster associated to a scan.
|
DEFAULT_BATCH_SIZE | final public static int DEFAULT_BATCH_SIZE(Code) | | The default batch size.
|
DEFAULT_BUFFER_SIZE | final public static int DEFAULT_BUFFER_SIZE(Code) | | The default buffer size.
|
DEFAULT_DELIMITER | final public static int DEFAULT_DELIMITER(Code) | | The default delimiter separating two documents read from standard input (a newline).
|
DEFAULT_VIRTUAL_DOCUMENT_GAP | final public static int DEFAULT_VIRTUAL_DOCUMENT_GAP(Code) | | The default virtual field gap.
|
currMaxPos | protected int[] currMaxPos(Code) | | The current maximum position for each document, if the field indexed is virtual.
|
flags | final Map<Component, Coding> flags(Code) | | The flag map for batches.
|
maxDocSize | int maxDocSize(Code) | | Maximum size in words of documents seen so far in the current batch.
|
nonWord | final MutableString nonWord(Code) | | |
outOfMemoryError | public boolean outOfMemoryError(Code) | | If true, this class experienced an
OutOfMemoryError during some buffer reallocation.
|
virtualDocumentGap | protected int virtualDocumentGap(Code) | | The width of the artificial gap introduced between virtual-document fragments.
|
word | final MutableString word(Code) | | |
Scan | public Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code) | | Creates a new scanner instance.
Parameters: basename - the basename (usually a global filename followed by the field name, separatedby a dash). Parameters: field - the field to be indexed. Parameters: termProcessor - the term processor for this index. Parameters: documentsAreInOrder - if true, documents will be served in increasing order. Parameters: bufferSize - the buffer size used in all I/O. Parameters: builder - a builder used to create a compressed document collection on the fly. Parameters: batchDir - a directory for batch files; batch names will be relativised to thisdirectory if it is not null . throws: FileNotFoundException - |
Scan | public Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code) | | Creates a new scanner instance.
Parameters: basename - the basename (usually a global filename followed by the field name, separatedby a dash). Parameters: field - the field to be indexed. Parameters: termProcessor - the term processor for this index. Parameters: indexingType - the type of indexing procedure. Parameters: numVirtualDocs - the number of virtual documents that will be used, in case of a virtualindex; otherwise, immaterial. Parameters: virtualDocumentGap - the artificial gap introduced between virtual documents fragments, in caseof a virtual index; otherwise, immaterial. Parameters: bufferSize - the buffer size used in all I/O. Parameters: builder - a builder used to create a compressed document collection on the fly. Parameters: batchDir - a directory for batch files; batch names will be relativised to thisdirectory if it is not null . |
batchBasename | protected static String batchBasename(int batch, String basename, File batchDir)(Code) | | Returns the name of a batch.
You can override this method if you prefer a different batch naming scheme.
Parameters: batch - the batch number. Parameters: basename - the index basename. Parameters: batchDir - if not null , a temporary directory for batches. simply basename@batch , if batchDir isnull ; otherwise, we relativise the name to batchDir . |
cleanup | public static void cleanup(String basename, int batches, File batchDir) throws IOException(Code) | | Cleans all intermediate files generated by a run of this class.
Parameters: basename - the basename of the run. Parameters: batches - the number of generated batches. Parameters: batchDir - if not null , a temporary directory where the batches are located. |
close | public void close() throws ConfigurationException, IOException(Code) | | Closes this pass, releasing all resources.
|
dumpBatch | protected long dumpBatch() throws IOException, ConfigurationException(Code) | | Dumps the current batch on disk as an index.
the number of occurrences contained in the batch. |
getSequence | public static DocumentSequence getSequence(String sequenceName, Class> factoryClass, String[] property, int delimiter, Logger logger) throws IllegalAccessException, InvocationTargetException, NoSuchMethodException, IOException, ClassNotFoundException, InstantiationException(Code) | | Returns the document sequence to be indexed.
Parameters: sequenceName - the name of a serialised document sequence, or null forstandard input. Parameters: factoryClass - the class of the DocumentFactory that should be passed to thedocument sequence. Parameters: property - an array of property strings to be used in the factory initialisation. Parameters: delimiter - a delimiter in case we want to use standard input. Parameters: logger - a logger. the document sequence to be indexed. |
parseQualifiedSizes | public static int[] parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory) throws ParseException(Code) | | |
parseVirtualDocumentGap | public static int[] parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory)(Code) | | |
processDocument | public void processDocument(int documentPointer, WordReader wordReader) throws IOException(Code) | | Processes a document.
Parameters: documentPointer - the integer pointer associated to the document. Parameters: wordReader - the word reader associated to the document. |
run | public static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, String renumberingFile, long logInterval, String tempDirName) throws ConfigurationException, IOException(Code) | | Runs in parallel a number of instances.
|
run | public static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName) throws ConfigurationException, IOException(Code) | | Runs in parallel a number of instances.
This commodity method takes care of instantiating one instance per indexed field, and to
pass the right information to each instance. All options are common to all fields, except for
the number of occurrences in a batch, which can be tuned for each field separately.
Parameters: basename - the index basename. Parameters: documentSequence - a document sequence. Parameters: termProcessor - the term processor for this index. Parameters: zipCollectionBasename - if not null , the basename of a new GZIP'dcollection built using documentSequence . Parameters: bufferSize - the buffer size used in all I/O. Parameters: documentsPerBatch - the number of documents that we should try to put in each segment. Parameters: indexedField - the fields that should be indexed, in increasing order. Parameters: virtualDocumentResolver - the array of virtual document resolvers to be used, parallelto indexedField : it can safely contain anything (even null )in correspondence to non-virtual fields, and can safely be null if no fieldsare virtual. Parameters: virtualGap - the array of virtual field gaps to be used, parallel toindexedField : it can safely contain anything in correspondence to non-virtualfields, and can safely be null if no fields are virtual. Parameters: mapFile - the name of a file containing a map to be applied to document indices. Parameters: logInterval - the minimum time interval between activity logs in milliseconds. Parameters: tempDirName - a directory for temporary files. throws: IOException - throws: ConfigurationException - |
|
|