Java Doc for Scan.java in » Search-Engine » mg4j » it » unimi » dsi » mg4j » tool » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Search Engine » mg4j » it.unimi.dsi.mg4j.tool

Source Cross Reference Class Diagram Java Document (Java Doc)

java.lang .Object

it.unimi.dsi.mg4j.tool .Scan

Scan

public class Scan (Code)

Scans a document sequence, dividing it in batches of occurrences and writing for each batch a corresponding subindex.

This class (more precisely, its Scan.run(String,DocumentSequence,TermProcessor,String,int,int,int[],VirtualDocumentResolver[],int[],String,long,String) run() method) reads a document sequence and produces several batches, that is, subindices corresponding to subsets of term/document pairs of the collection. A set of batches is generated for each indexed field of the collection. A main method invokes the above method setting its parameters using suitable options.

Unless a serialised it.unimi.dsi.mg4j.document.DocumentSequence is specified using the suitable option, an implicit it.unimi.dsi.mg4j.document.InputStreamDocumentSequence is created using separator byte (default is 10, i.e., newline). In the latter case, the factory and its properties can be set with command-line options.

The only mandatory argument is a basename, which will be used to stem the names of all files generated. The first batch of a field named field will use the basename basename-field@0, the second batch basename-field@1 and so on. It is also possible to specify a separate directory for batch files (e.g., for easier when they are no longer necessary).

Since documents are read sequentially, every document has a natural index starting from 0. If no remapping (i.e., renumbering) is specified, the document index of each document corresponds to its natural index. If, however, a remapping is specified, under the form of a list of integers, the document index of a document is the integer found in the corresponding position of the list. More precisely, a remapping for N documents is a list of N distinct integers, and a document with natural index i has document index given by the i-th element of the list. This is useful when indexing statically ranked documents (e.g., if you are indexing a part of the web and would like the index to return documents with higher static rank first). If the remapping file is provided, it must be a sequence of integers, written using the java.io.DataOutputStream.writeInt(int) method; if N is the number of documents, the file is to contain exactly N distinct integers. The integers need not be between 0 and N-1, to allow the remapping of subindices (but a warning will be logged in this case, just to be sure you know what you're doing).

Also every term has an associated number starting from 0, assigned in lexicographic order.

Index types and indexing types

A standard index contains a list of terms, and for each term a posting list. Each posting contains mandatorily a document pointer, and then, optionally, the count and the positions of the term (whether the last two elements appear can be specified using suitable ).

The indexing type of a standard index can be IndexingType.STANDARD , IndexingType.REMAPPED or IndexingType.VIRTUAL . In the first case, we index the words occurring in documents as usual. In the second case, before writing the index all documents are renumbered following a provided map. In the third case (used only with it.unimi.dsi.mg4j.document.DocumentFactory.FieldType.VIRTUAL fields) indexing is performed on a virtual document obtained by collating a number of . Fragments are associated to documents by some key, and a VirtualDocumentResolver turns a key into a document natural number, so that the collation process can take place (a settable gap is inserted between fragments).

Besides storing document pointers, document counts, and position, MG4J makes it possible to store an arbitrary payload with each posting. This feature is presently used only to create payload-based indices—indices without counts and positions that contain a single, dummy word #. They are actually used to store arbitrary data associated to each document, such as dates and integers: using a special syntax, is then possible to specify range queries on the values of such fields.

The main difference between standard and payload-based indices is that the first type is handled by instances of this class, whereas the second type is handled by instances of Scan.PayloadAccumulator . The Scan.run(String,DocumentSequence,TermProcessor,String,int,int,int[],VirtualDocumentResolver[],int[],String,long,String) run() method creates a set of suitable instances, one for each indexed field, and feeds them in parallel with data from the appropriate field of the same document.

Batch subdivision and content

The scanning process uses a user-settable number of documents per batch, and will try to build batches containing exactly that number of documents (for all indexed fields). There are of course space constraints that could make building exact batches impossible, as the entire data of a batch must into core memory. If memory is too low, a batch will be generated with fewer documents than expected.

In some extreme cases, it could be impossible to produce cleanly a set of batches for all fields: in that case, emergency dumps will create fragmented batches—instead of a single batch containing k documents a certain field will generate two separate batches. As a consequence, different fields will have a number of batches, but a simple inspection of the property files (see below) will reveal the details of the emergency dumps (and Combine can be used to rebuild the desired exact batches, if necessary).

The larger the number of documents in a batch is, the quicker index construction will be. Usually, some experiments and a look at the logs is all that suffices to find out good parameters for the Java virtual machine maximum memory setting and for the number of documents per batch.

These are the files currently generated for each batch (basename denotes the basename of the batch, not of the index):

basename.terms

For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i.

basename.terms.unsorted

The list of indexed terms in the same order in which they were met in the document collection. This list is not produced unless you ask for it explicitly with a suitable option.

basename.frequencies

For each term, the number of documents in which the term appears in γ coding. More precisely, i-th integer of the file (starting from 0) is the number of documents in which the term of index i appears.

basename.sizes (not generated for payload-based indices)

For each indexed document, the corresponding size (=number of words) in γ coding. More precisely, i-th integer of the file (starting from 0) is the size in words of the document of index i.

basename.index

The inverted index.

basename.offsets (not generated for payload-based indices)

For each term, the bit offset in basename.index at which the inverted lists start. More precisely, the first integer is the offset for term 0 in γ coding, and then the i-th integer is the difference between the i-th and the i−1-th offset in γ coding. If T terms were indexed, this file will contain T+1 integers, the last being the difference (in bits) between the length of the entire inverted index and the offset of the last inverted list.

basename.globcounts (not generated for payload-based indices)

For each term, the number of its occurrences throughout the whole document collection, in γ coding. More precisely, the i-th integer of the file (starting from 0) is the number of occurrences of the term of index i.

basename.properties

A Java containing information about the index. Currently, the following keys (taken from it.unimi.dsi.mg4j.index.Index.PropertyKeys ) generated:

indexclass: the class used to generate the batch (presently, BitStreamIndexWriter );
documents: number documents in the collection;
terms: number of indexed terms;
occurrences: number of words throughout the whole collection;
postings: number of postings (pairs term/document) throughout the whole collection;
maxdocsize: maximum size of a document in words;
termprocessor: the term processor (if any) used during the index construction;
coding: one or more items, each defining a key/pair value for the flag map of the index; each pair is of the form component:coding (see it.unimi.dsi.mg4j.index.CompressionFlags );
field: the name of the field that generated this batch (optional)
maxcount: the maximum count in the collection, that is, the maximum count of a term maximised on all terms and documents;
size: the index size in bits;

basename.cluster.properties

A Java containing information about the set of batches seen as a it.unimi.dsi.mg4j.index.cluster.DocumentalCluster . The keys are same as in the previous case, but additionally a number of localindex entries specify the basename of the batches, and a splitstrategy. After creating manually suitable term maps for each batch, you will be able to access the set of batches as a single index (but note that standard batches have no skip structure, and should not be used in production; if you intend to do so, you have to write a customised scanning procedure).

author:
Sebastiano Vigna
since:
1.0

Inner Class :public static enum IndexingType

Inner Class :public static interface VirtualDocumentFragment extends Serializable

Inner Class :protected static class PayloadAccumulator

Field Summary
final public static String	CLUSTER_PROPERTIES_EXTENSION The extension of the strategy for the cluster associated to a scan.
final public static int	DEFAULT_BATCH_SIZE The default batch size.
final public static int	DEFAULT_BUFFER_SIZE The default buffer size.
final public static int	DEFAULT_DELIMITER The default delimiter separating two documents read from standard input (a newline).
final public static int	DEFAULT_VIRTUAL_DOCUMENT_GAP The default virtual field gap.
protected int[]	currMaxPos The current maximum position for each document, if the field indexed is virtual.
final protected IntArrayList	cutPoints The cutpoints of the batches (for building later a it.unimi.dsi.mg4j.index.cluster.ContiguousDocumentalStrategy ).
final Map<Component, Coding>	flags The flag map for batches.
int	maxDocSize Maximum size in words of documents seen so far in the current batch.
final MutableString	nonWord
public boolean	outOfMemoryError If true, this class experienced an OutOfMemoryError during some buffer reallocation.
protected int	virtualDocumentGap The width of the artificial gap introduced between virtual-document fragments.
final MutableString	word

Constructor Summary
public	Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) Creates a new scanner instance.
public	Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) Creates a new scanner instance.
public	Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) Creates a new scanner instance.

Method Summary
protected static String	batchBasename(int batch, String basename, File batchDir) Returns the name of a batch. You can override this method if you prefer a different batch naming scheme. Parameters: batch - the batch number. Parameters: basename - the index basename. Parameters: batchDir - if not `null`, a temporary directory for batches.
public static void	cleanup(String basename, int batches, File batchDir) Cleans all intermediate files generated by a run of this class.
public void	close() Closes this pass, releasing all resources.
protected long	dumpBatch() Dumps the current batch on disk as an index. the number of occurrences contained in the batch.
public static DocumentSequence	getSequence(String sequenceName, Class factoryClass, String[] property, int delimiter, Logger logger) Returns the document sequence to be indexed. Parameters: sequenceName - the name of a serialised document sequence, or `null` forstandard input. Parameters: factoryClass - the class of the DocumentFactory that should be passed to thedocument sequence. Parameters: property - an array of property strings to be used in the factory initialisation. Parameters: delimiter - a delimiter in case we want to use standard input. Parameters: logger - a logger.
public static void	main(String[] arg)
protected void	openSizeBitStream()
public static int[]	parseFieldNames(String[] indexedFieldName, DocumentFactory factory, boolean allSupported)
public static int[]	parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory)
public static int[]	parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory)
public static VirtualDocumentResolver[]	parseVirtualDocumentResolver(String[] virtualDocumentSpec, int[] indexedField, DocumentFactory factory)
public void	processDocument(int documentPointer, WordReader wordReader) Processes a document.
public static void	run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, String renumberingFile, long logInterval, String tempDirName) Runs in parallel a number of instances.
public static void	run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName) Runs in parallel a number of instances. This commodity method takes care of instantiating one instance per indexed field, and to pass the right information to each instance.
public String	toString()

Field Detail

CLUSTER_PROPERTIES_EXTENSION
final public static String CLUSTER_PROPERTIES_EXTENSION(Code)
	The extension of the strategy for the cluster associated to a scan.

DEFAULT_BATCH_SIZE
final public static int DEFAULT_BATCH_SIZE(Code)
	The default batch size.

DEFAULT_BUFFER_SIZE
final public static int DEFAULT_BUFFER_SIZE(Code)
	The default buffer size.

DEFAULT_DELIMITER
final public static int DEFAULT_DELIMITER(Code)
	The default delimiter separating two documents read from standard input (a newline).

DEFAULT_VIRTUAL_DOCUMENT_GAP
final public static int DEFAULT_VIRTUAL_DOCUMENT_GAP(Code)
	The default virtual field gap.

currMaxPos
protected int[] currMaxPos(Code)
	The current maximum position for each document, if the field indexed is virtual.

cutPoints
final protected IntArrayList cutPoints(Code)
	The cutpoints of the batches (for building later a it.unimi.dsi.mg4j.index.cluster.ContiguousDocumentalStrategy ).

flags
final Map<Component, Coding> flags(Code)
	The flag map for batches.

maxDocSize
int maxDocSize(Code)
	Maximum size in words of documents seen so far in the current batch.

nonWord
final MutableString nonWord(Code)

outOfMemoryError
public boolean outOfMemoryError(Code)
	If true, this class experienced an OutOfMemoryError during some buffer reallocation.

virtualDocumentGap
protected int virtualDocumentGap(Code)
	The width of the artificial gap introduced between virtual-document fragments.

word
final MutableString word(Code)

Constructor Detail

Scan
public Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code)
	Creates a new scanner instance. Parameters: basename - the basename (usually a global filename followed by the field name, separatedby a dash). Parameters: field - the field to be indexed. Parameters: termProcessor - the term processor for this index. Parameters: documentsAreInOrder - if true, documents will be served in increasing order. Parameters: bufferSize - the buffer size used in all I/O. Parameters: builder - a builder used to create a compressed document collection on the fly. Parameters: batchDir - a directory for batch files; batch names will be relativised to thisdirectory if it is not `null`. throws: FileNotFoundException -

Scan
public Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code)
	Creates a new scanner instance. throws: FileNotFoundException -

Scan
public Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code)
	Creates a new scanner instance. Parameters: basename - the basename (usually a global filename followed by the field name, separatedby a dash). Parameters: field - the field to be indexed. Parameters: termProcessor - the term processor for this index. Parameters: indexingType - the type of indexing procedure. Parameters: numVirtualDocs - the number of virtual documents that will be used, in case of a virtualindex; otherwise, immaterial. Parameters: virtualDocumentGap - the artificial gap introduced between virtual documents fragments, in caseof a virtual index; otherwise, immaterial. Parameters: bufferSize - the buffer size used in all I/O. Parameters: builder - a builder used to create a compressed document collection on the fly. Parameters: batchDir - a directory for batch files; batch names will be relativised to thisdirectory if it is not `null`.

Method Detail

batchBasename
protected static String batchBasename(int batch, String basename, File batchDir)(Code)
	Returns the name of a batch. You can override this method if you prefer a different batch naming scheme. Parameters: batch - the batch number. Parameters: basename - the index basename. Parameters: batchDir - if not `null`, a temporary directory for batches. simply `basename@batch`, if `batchDir` is`null`; otherwise, we relativise the name to `batchDir`.

cleanup
public static void cleanup(String basename, int batches, File batchDir) throws IOException(Code)
	Cleans all intermediate files generated by a run of this class. Parameters: basename - the basename of the run. Parameters: batches - the number of generated batches. Parameters: batchDir - if not `null`, a temporary directory where the batches are located.

close
public void close() throws ConfigurationException, IOException(Code)
	Closes this pass, releasing all resources.

dumpBatch
protected long dumpBatch() throws IOException, ConfigurationException(Code)
	Dumps the current batch on disk as an index. the number of occurrences contained in the batch.

getSequence
public static DocumentSequence getSequence(String sequenceName, Class factoryClass, String[] property, int delimiter, Logger logger) throws IllegalAccessException, InvocationTargetException, NoSuchMethodException, IOException, ClassNotFoundException, InstantiationException(Code)
	Returns the document sequence to be indexed. Parameters: sequenceName - the name of a serialised document sequence, or `null` forstandard input. Parameters: factoryClass - the class of the DocumentFactory that should be passed to thedocument sequence. Parameters: property - an array of property strings to be used in the factory initialisation. Parameters: delimiter - a delimiter in case we want to use standard input. Parameters: logger - a logger. the document sequence to be indexed.

main
public static void main(String[] arg) throws JSAPException, InvocationTargetException, NoSuchMethodException, ConfigurationException, ClassNotFoundException, IOException, IllegalAccessException, InstantiationException(Code)

openSizeBitStream
protected void openSizeBitStream() throws FileNotFoundException(Code)

parseFieldNames
public static int[] parseFieldNames(String[] indexedFieldName, DocumentFactory factory, boolean allSupported)(Code)

parseQualifiedSizes
public static int[] parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory) throws ParseException(Code)

parseVirtualDocumentGap
public static int[] parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory)(Code)

parseVirtualDocumentResolver
public static VirtualDocumentResolver[] parseVirtualDocumentResolver(String[] virtualDocumentSpec, int[] indexedField, DocumentFactory factory)(Code)

processDocument
public void processDocument(int documentPointer, WordReader wordReader) throws IOException(Code)
	Processes a document. Parameters: documentPointer - the integer pointer associated to the document. Parameters: wordReader - the word reader associated to the document.

run
public static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, String renumberingFile, long logInterval, String tempDirName) throws ConfigurationException, IOException(Code)
	Runs in parallel a number of instances.

run
public static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName) throws ConfigurationException, IOException(Code)
	Runs in parallel a number of instances. This commodity method takes care of instantiating one instance per indexed field, and to pass the right information to each instance. All options are common to all fields, except for the number of occurrences in a batch, which can be tuned for each field separately. Parameters: basename - the index basename. Parameters: documentSequence - a document sequence. Parameters: termProcessor - the term processor for this index. Parameters: zipCollectionBasename - if not `null`, the basename of a new GZIP'dcollection built using `documentSequence`. Parameters: bufferSize - the buffer size used in all I/O. Parameters: documentsPerBatch - the number of documents that we should try to put in each segment. Parameters: indexedField - the fields that should be indexed, in increasing order. Parameters: virtualDocumentResolver - the array of virtual document resolvers to be used, parallelto `indexedField`: it can safely contain anything (even `null`)in correspondence to non-virtual fields, and can safely be `null` if no fieldsare virtual. Parameters: virtualGap - the array of virtual field gaps to be used, parallel to`indexedField`: it can safely contain anything in correspondence to non-virtualfields, and can safely be `null` if no fields are virtual. Parameters: mapFile - the name of a file containing a map to be applied to document indices. Parameters: logInterval - the minimum time interval between activity logs in milliseconds. Parameters: tempDirName - a directory for temporary files. throws: IOException - throws: ConfigurationException -

toString
public String toString()(Code)

Methods inherited from java.lang.Object

native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.