Java Doc for Scan.java in  » Search-Engine » mg4j » it » unimi » dsi » mg4j » tool » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Search Engine » mg4j » it.unimi.dsi.mg4j.tool 
Source Cross Reference  Class Diagram Java Document (Java Doc) 


java.lang.Object
   it.unimi.dsi.mg4j.tool.Scan

Scan
public class Scan (Code)
Scans a document sequence, dividing it in batches of occurrences and writing for each batch a corresponding subindex.

This class (more precisely, its Scan.run(String,DocumentSequence,TermProcessor,String,int,int,int[],VirtualDocumentResolver[],int[],String,long,String) run() method) reads a document sequence and produces several batches, that is, subindices corresponding to subsets of term/document pairs of the collection. A set of batches is generated for each indexed field of the collection. A main method invokes the above method setting its parameters using suitable options.

Unless a serialised it.unimi.dsi.mg4j.document.DocumentSequence is specified using the suitable option, an implicit it.unimi.dsi.mg4j.document.InputStreamDocumentSequence is created using separator byte (default is 10, i.e., newline). In the latter case, the factory and its properties can be set with command-line options.

The only mandatory argument is a basename, which will be used to stem the names of all files generated. The first batch of a field named field will use the basename basename-field@0, the second batch basename-field@1 and so on. It is also possible to specify a separate directory for batch files (e.g., for easier when they are no longer necessary).

Since documents are read sequentially, every document has a natural index starting from 0. If no remapping (i.e., renumbering) is specified, the document index of each document corresponds to its natural index. If, however, a remapping is specified, under the form of a list of integers, the document index of a document is the integer found in the corresponding position of the list. More precisely, a remapping for N documents is a list of N distinct integers, and a document with natural index i has document index given by the i-th element of the list. This is useful when indexing statically ranked documents (e.g., if you are indexing a part of the web and would like the index to return documents with higher static rank first). If the remapping file is provided, it must be a sequence of integers, written using the java.io.DataOutputStream.writeInt(int) method; if N is the number of documents, the file is to contain exactly N distinct integers. The integers need not be between 0 and N-1, to allow the remapping of subindices (but a warning will be logged in this case, just to be sure you know what you're doing).

Also every term has an associated number starting from 0, assigned in lexicographic order.

Index types and indexing types

A standard index contains a list of terms, and for each term a posting list. Each posting contains mandatorily a document pointer, and then, optionally, the count and the positions of the term (whether the last two elements appear can be specified using suitable ).

The indexing type of a standard index can be IndexingType.STANDARD , IndexingType.REMAPPED or IndexingType.VIRTUAL . In the first case, we index the words occurring in documents as usual. In the second case, before writing the index all documents are renumbered following a provided map. In the third case (used only with it.unimi.dsi.mg4j.document.DocumentFactory.FieldType.VIRTUAL fields) indexing is performed on a virtual document obtained by collating a number of . Fragments are associated to documents by some key, and a VirtualDocumentResolver turns a key into a document natural number, so that the collation process can take place (a settable gap is inserted between fragments).

Besides storing document pointers, document counts, and position, MG4J makes it possible to store an arbitrary payload with each posting. This feature is presently used only to create payload-based indices—indices without counts and positions that contain a single, dummy word #. They are actually used to store arbitrary data associated to each document, such as dates and integers: using a special syntax, is then possible to specify range queries on the values of such fields.

The main difference between standard and payload-based indices is that the first type is handled by instances of this class, whereas the second type is handled by instances of Scan.PayloadAccumulator . The Scan.run(String,DocumentSequence,TermProcessor,String,int,int,int[],VirtualDocumentResolver[],int[],String,long,String) run() method creates a set of suitable instances, one for each indexed field, and feeds them in parallel with data from the appropriate field of the same document.

Batch subdivision and content

The scanning process uses a user-settable number of documents per batch, and will try to build batches containing exactly that number of documents (for all indexed fields). There are of course space constraints that could make building exact batches impossible, as the entire data of a batch must into core memory. If memory is too low, a batch will be generated with fewer documents than expected.

In some extreme cases, it could be impossible to produce cleanly a set of batches for all fields: in that case, emergency dumps will create fragmented batches—instead of a single batch containing k documents a certain field will generate two separate batches. As a consequence, different fields will have a number of batches, but a simple inspection of the property files (see below) will reveal the details of the emergency dumps (and Combine can be used to rebuild the desired exact batches, if necessary).

The larger the number of documents in a batch is, the quicker index construction will be. Usually, some experiments and a look at the logs is all that suffices to find out good parameters for the Java virtual machine maximum memory setting and for the number of documents per batch.

These are the files currently generated for each batch (basename denotes the basename of the batch, not of the index):

basename.terms
For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i.
basename.terms.unsorted
The list of indexed terms in the same order in which they were met in the document collection. This list is not produced unless you ask for it explicitly with a suitable option.
basename.frequencies
For each term, the number of documents in which the term appears in γ coding. More precisely, i-th integer of the file (starting from 0) is the number of documents in which the term of index i appears.
basename.sizes (not generated for payload-based indices)
For each indexed document, the corresponding size (=number of words) in γ coding. More precisely, i-th integer of the file (starting from 0) is the size in words of the document of index i.
basename.index
The inverted index.
basename.offsets (not generated for payload-based indices)
For each term, the bit offset in basename.index at which the inverted lists start. More precisely, the first integer is the offset for term 0 in γ coding, and then the i-th integer is the difference between the i-th and the i−1-th offset in γ coding. If T terms were indexed, this file will contain T+1 integers, the last being the difference (in bits) between the length of the entire inverted index and the offset of the last inverted list.
basename.globcounts (not generated for payload-based indices)
For each term, the number of its occurrences throughout the whole document collection, in γ coding. More precisely, the i-th integer of the file (starting from 0) is the number of occurrences of the term of index i.
basename.properties
A Java containing information about the index. Currently, the following keys (taken from it.unimi.dsi.mg4j.index.Index.PropertyKeys ) generated:
indexclass
the class used to generate the batch (presently, BitStreamIndexWriter );
documents
number documents in the collection;
terms
number of indexed terms;
occurrences
number of words throughout the whole collection;
postings
number of postings (pairs term/document) throughout the whole collection;
maxdocsize
maximum size of a document in words;
termprocessor
the term processor (if any) used during the index construction;
coding
one or more items, each defining a key/pair value for the flag map of the index; each pair is of the form component:coding (see it.unimi.dsi.mg4j.index.CompressionFlags );
field
the name of the field that generated this batch (optional)
maxcount
the maximum count in the collection, that is, the maximum count of a term maximised on all terms and documents;
size
the index size in bits;
basename.cluster.properties
A Java containing information about the set of batches seen as a it.unimi.dsi.mg4j.index.cluster.DocumentalCluster . The keys are same as in the previous case, but additionally a number of localindex entries specify the basename of the batches, and a splitstrategy. After creating manually suitable term maps for each batch, you will be able to access the set of batches as a single index (but note that standard batches have no skip structure, and should not be used in production; if you intend to do so, you have to write a customised scanning procedure).

author:
   Sebastiano Vigna
since:
   1.0

Inner Class :public static enum IndexingType
Inner Class :public static interface VirtualDocumentFragment extends Serializable
Inner Class :protected static class PayloadAccumulator

Field Summary
final public static  StringCLUSTER_PROPERTIES_EXTENSION
     The extension of the strategy for the cluster associated to a scan.
final public static  intDEFAULT_BATCH_SIZE
     The default batch size.
final public static  intDEFAULT_BUFFER_SIZE
     The default buffer size.
final public static  intDEFAULT_DELIMITER
     The default delimiter separating two documents read from standard input (a newline).
final public static  intDEFAULT_VIRTUAL_DOCUMENT_GAP
     The default virtual field gap.
protected  int[]currMaxPos
     The current maximum position for each document, if the field indexed is virtual.
final protected  IntArrayListcutPoints
     The cutpoints of the batches (for building later a it.unimi.dsi.mg4j.index.cluster.ContiguousDocumentalStrategy ).
final  Map<Component, Coding>flags
     The flag map for batches.
 intmaxDocSize
     Maximum size in words of documents seen so far in the current batch.
final  MutableStringnonWord
    
public  booleanoutOfMemoryError
     If true, this class experienced an OutOfMemoryError during some buffer reallocation.
protected  intvirtualDocumentGap
     The width of the artificial gap introduced between virtual-document fragments.
final  MutableStringword
    

Constructor Summary
public  Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir)
     Creates a new scanner instance.
public  Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir)
     Creates a new scanner instance.
public  Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir)
     Creates a new scanner instance.

Method Summary
protected static  StringbatchBasename(int batch, String basename, File batchDir)
     Returns the name of a batch.

You can override this method if you prefer a different batch naming scheme.
Parameters:
  batch - the batch number.
Parameters:
  basename - the index basename.
Parameters:
  batchDir - if not null, a temporary directory for batches.

public static  voidcleanup(String basename, int batches, File batchDir)
     Cleans all intermediate files generated by a run of this class.
public  voidclose()
     Closes this pass, releasing all resources.
protected  longdumpBatch()
     Dumps the current batch on disk as an index. the number of occurrences contained in the batch.
public static  DocumentSequencegetSequence(String sequenceName, Class factoryClass, String[] property, int delimiter, Logger logger)
     Returns the document sequence to be indexed.
Parameters:
  sequenceName - the name of a serialised document sequence, or null forstandard input.
Parameters:
  factoryClass - the class of the DocumentFactory that should be passed to thedocument sequence.
Parameters:
  property - an array of property strings to be used in the factory initialisation.
Parameters:
  delimiter - a delimiter in case we want to use standard input.
Parameters:
  logger - a logger.
public static  voidmain(String[] arg)
    
protected  voidopenSizeBitStream()
    
public static  int[]parseFieldNames(String[] indexedFieldName, DocumentFactory factory, boolean allSupported)
    
public static  int[]parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory)
    
public static  int[]parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory)
    
public static  VirtualDocumentResolver[]parseVirtualDocumentResolver(String[] virtualDocumentSpec, int[] indexedField, DocumentFactory factory)
    
public  voidprocessDocument(int documentPointer, WordReader wordReader)
     Processes a document.
public static  voidrun(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, String renumberingFile, long logInterval, String tempDirName)
     Runs in parallel a number of instances.
public static  voidrun(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName)
     Runs in parallel a number of instances.

This commodity method takes care of instantiating one instance per indexed field, and to pass the right information to each instance.

public  StringtoString()
    

Field Detail
CLUSTER_PROPERTIES_EXTENSION
final public static String CLUSTER_PROPERTIES_EXTENSION(Code)
The extension of the strategy for the cluster associated to a scan.



DEFAULT_BATCH_SIZE
final public static int DEFAULT_BATCH_SIZE(Code)
The default batch size.



DEFAULT_BUFFER_SIZE
final public static int DEFAULT_BUFFER_SIZE(Code)
The default buffer size.



DEFAULT_DELIMITER
final public static int DEFAULT_DELIMITER(Code)
The default delimiter separating two documents read from standard input (a newline).



DEFAULT_VIRTUAL_DOCUMENT_GAP
final public static int DEFAULT_VIRTUAL_DOCUMENT_GAP(Code)
The default virtual field gap.



currMaxPos
protected int[] currMaxPos(Code)
The current maximum position for each document, if the field indexed is virtual.



cutPoints
final protected IntArrayList cutPoints(Code)
The cutpoints of the batches (for building later a it.unimi.dsi.mg4j.index.cluster.ContiguousDocumentalStrategy ).



flags
final Map<Component, Coding> flags(Code)
The flag map for batches.



maxDocSize
int maxDocSize(Code)
Maximum size in words of documents seen so far in the current batch.



nonWord
final MutableString nonWord(Code)



outOfMemoryError
public boolean outOfMemoryError(Code)
If true, this class experienced an OutOfMemoryError during some buffer reallocation.



virtualDocumentGap
protected int virtualDocumentGap(Code)
The width of the artificial gap introduced between virtual-document fragments.



word
final MutableString word(Code)




Constructor Detail
Scan
public Scan(String basename, String field, TermProcessor termProcessor, boolean documentsAreInOrder, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code)
Creates a new scanner instance.
Parameters:
  basename - the basename (usually a global filename followed by the field name, separatedby a dash).
Parameters:
  field - the field to be indexed.
Parameters:
  termProcessor - the term processor for this index.
Parameters:
  documentsAreInOrder - if true, documents will be served in increasing order.
Parameters:
  bufferSize - the buffer size used in all I/O.
Parameters:
  builder - a builder used to create a compressed document collection on the fly.
Parameters:
  batchDir - a directory for batch files; batch names will be relativised to thisdirectory if it is not null.
throws:
  FileNotFoundException -



Scan
public Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code)
Creates a new scanner instance.
throws:
  FileNotFoundException -



Scan
public Scan(String basename, String field, TermProcessor termProcessor, IndexingType indexingType, int numVirtualDocs, int virtualDocumentGap, int bufferSize, ZipDocumentCollectionBuilder builder, File batchDir) throws FileNotFoundException(Code)
Creates a new scanner instance.
Parameters:
  basename - the basename (usually a global filename followed by the field name, separatedby a dash).
Parameters:
  field - the field to be indexed.
Parameters:
  termProcessor - the term processor for this index.
Parameters:
  indexingType - the type of indexing procedure.
Parameters:
  numVirtualDocs - the number of virtual documents that will be used, in case of a virtualindex; otherwise, immaterial.
Parameters:
  virtualDocumentGap - the artificial gap introduced between virtual documents fragments, in caseof a virtual index; otherwise, immaterial.
Parameters:
  bufferSize - the buffer size used in all I/O.
Parameters:
  builder - a builder used to create a compressed document collection on the fly.
Parameters:
  batchDir - a directory for batch files; batch names will be relativised to thisdirectory if it is not null.




Method Detail
batchBasename
protected static String batchBasename(int batch, String basename, File batchDir)(Code)
Returns the name of a batch.

You can override this method if you prefer a different batch naming scheme.
Parameters:
  batch - the batch number.
Parameters:
  basename - the index basename.
Parameters:
  batchDir - if not null, a temporary directory for batches. simply basename@batch, if batchDir isnull; otherwise, we relativise the name to batchDir.




cleanup
public static void cleanup(String basename, int batches, File batchDir) throws IOException(Code)
Cleans all intermediate files generated by a run of this class.
Parameters:
  basename - the basename of the run.
Parameters:
  batches - the number of generated batches.
Parameters:
  batchDir - if not null, a temporary directory where the batches are located.



close
public void close() throws ConfigurationException, IOException(Code)
Closes this pass, releasing all resources.



dumpBatch
protected long dumpBatch() throws IOException, ConfigurationException(Code)
Dumps the current batch on disk as an index. the number of occurrences contained in the batch.



getSequence
public static DocumentSequence getSequence(String sequenceName, Class factoryClass, String[] property, int delimiter, Logger logger) throws IllegalAccessException, InvocationTargetException, NoSuchMethodException, IOException, ClassNotFoundException, InstantiationException(Code)
Returns the document sequence to be indexed.
Parameters:
  sequenceName - the name of a serialised document sequence, or null forstandard input.
Parameters:
  factoryClass - the class of the DocumentFactory that should be passed to thedocument sequence.
Parameters:
  property - an array of property strings to be used in the factory initialisation.
Parameters:
  delimiter - a delimiter in case we want to use standard input.
Parameters:
  logger - a logger. the document sequence to be indexed.



main
public static void main(String[] arg) throws JSAPException, InvocationTargetException, NoSuchMethodException, ConfigurationException, ClassNotFoundException, IOException, IllegalAccessException, InstantiationException(Code)



openSizeBitStream
protected void openSizeBitStream() throws FileNotFoundException(Code)



parseFieldNames
public static int[] parseFieldNames(String[] indexedFieldName, DocumentFactory factory, boolean allSupported)(Code)



parseQualifiedSizes
public static int[] parseQualifiedSizes(String[] qualifiedSizes, String defaultSize, int[] indexedField, DocumentFactory factory) throws ParseException(Code)



parseVirtualDocumentGap
public static int[] parseVirtualDocumentGap(String[] virtualDocumentGapSpec, int[] indexedField, DocumentFactory factory)(Code)



parseVirtualDocumentResolver
public static VirtualDocumentResolver[] parseVirtualDocumentResolver(String[] virtualDocumentSpec, int[] indexedField, DocumentFactory factory)(Code)



processDocument
public void processDocument(int documentPointer, WordReader wordReader) throws IOException(Code)
Processes a document.
Parameters:
  documentPointer - the integer pointer associated to the document.
Parameters:
  wordReader - the word reader associated to the document.



run
public static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, String renumberingFile, long logInterval, String tempDirName) throws ConfigurationException, IOException(Code)
Runs in parallel a number of instances.



run
public static void run(String basename, DocumentSequence documentSequence, TermProcessor termProcessor, String zipCollectionBasename, int bufferSize, int documentsPerBatch, int[] indexedField, VirtualDocumentResolver[] virtualDocumentResolver, int[] virtualGap, String mapFile, long logInterval, String tempDirName) throws ConfigurationException, IOException(Code)
Runs in parallel a number of instances.

This commodity method takes care of instantiating one instance per indexed field, and to pass the right information to each instance. All options are common to all fields, except for the number of occurrences in a batch, which can be tuned for each field separately.
Parameters:
  basename - the index basename.
Parameters:
  documentSequence - a document sequence.
Parameters:
  termProcessor - the term processor for this index.
Parameters:
  zipCollectionBasename - if not null, the basename of a new GZIP'dcollection built using documentSequence.
Parameters:
  bufferSize - the buffer size used in all I/O.
Parameters:
  documentsPerBatch - the number of documents that we should try to put in each segment.
Parameters:
  indexedField - the fields that should be indexed, in increasing order.
Parameters:
  virtualDocumentResolver - the array of virtual document resolvers to be used, parallelto indexedField: it can safely contain anything (even null)in correspondence to non-virtual fields, and can safely be null if no fieldsare virtual.
Parameters:
  virtualGap - the array of virtual field gaps to be used, parallel toindexedField: it can safely contain anything in correspondence to non-virtualfields, and can safely be null if no fields are virtual.
Parameters:
  mapFile - the name of a file containing a map to be applied to document indices.
Parameters:
  logInterval - the minimum time interval between activity logs in milliseconds.
Parameters:
  tempDirName - a directory for temporary files.
throws:
  IOException -
throws:
  ConfigurationException -




toString
public String toString()(Code)



Methods inherited from java.lang.Object
native protected Object clone() throws CloneNotSupportedException(Code)(Java Doc)
public boolean equals(Object obj)(Code)(Java Doc)
protected void finalize() throws Throwable(Code)(Java Doc)
final native public Class getClass()(Code)(Java Doc)
native public int hashCode()(Code)(Java Doc)
final native public void notify()(Code)(Java Doc)
final native public void notifyAll()(Code)(Java Doc)
public String toString()(Code)(Java Doc)
final native public void wait(long timeout) throws InterruptedException(Code)(Java Doc)
final public void wait(long timeout, int nanos) throws InterruptedException(Code)(Java Doc)
final public void wait() throws InterruptedException(Code)(Java Doc)

www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.