| java.lang.Object it.unimi.dsi.mg4j.index.DiskBasedIndex
DiskBasedIndex | public class DiskBasedIndex (Code) | | A static container providing facilities to load an index based on data stored on disk.
This class contains several useful static methods
such as
DiskBasedIndex.readOffsets(InputBitStream,int) and
DiskBasedIndex.readSizes(InputBitStream,int) ,
and static factor methods such as
DiskBasedIndex.getInstance(CharSequence,boolean,boolean,boolean,EnumMap) that take care of reading the properties associated to the index, identify
the correct
it.unimi.dsi.mg4j.index.Index implementation that
should be used to load the index, and load the necessary data into memory.
As an option, a disk-based index can be loaded into main memory (key:
Index.UriKeys.INMEMORY ), returning
an
it.unimi.dsi.mg4j.index.InMemoryIndex /
InMemoryHPIndex , or mapped into main memory (key:
Index.UriKeys.MAPPED ),
returning a
MemoryMappedIndex /
InMemoryHPIndex (note that the value assigned to the keys is irrelevant).
In both cases some insurmountable Java problems
prevents using indices whose size exceeds two gigabytes (but see
MemoryMappedIndex for
some elaboration on this topic).
Moreover, by default the
term-offset list is accessed using a
it.unimi.dsi.mg4j.util.SemiExternalOffsetList with a step of
DiskBasedIndex.DEFAULT_OFFSET_STEP . This behaviour can be changed using
the URI key
UriKeys.OFFSETSTEP .
Disk-based indices are the workhorse of MG4J. All other indices (clustered,
remote, etc.) ultimately rely on disk-based indices to provide results.
Note that not all data produced by
it.unimi.dsi.mg4j.tool.Scan and
by the other indexing utilities are actually necessary to run a disk-based
index. Usually the property file and the index file (plus the positions file,
for
) are sufficient: if one
needs random access, also the offsets file must be present, and if the
compression method requires document sizes or if sizes are requested explicitly,
also the sizes file must be present. A
StringMap and possibly a
PrefixMap will be fetched
automatically by
DiskBasedIndex.getInstance(CharSequence,boolean,boolean) using standard extensions.
Thread safety
A disk-based index is thread safe as long as the offset list, the size list and
the term/prefix map are. The static factory methods provided by this class load
offsets and sizes using data structures that are thread safe. If you use directly
a constructor, instead, it is your responsability to pass thread-safe data structures.
author: Sebastiano Vigna since: 1.1 |
Method Summary | |
public static BitStreamIndex | getInstance(CharSequence basename, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, boolean randomAccess, boolean documentSizes, EnumMap<UriKeys, String> queryProperties) Returns a new disk-based index, loading exactly the specified parts and using preloaded
Properties . | public static BitStreamIndex | getInstance(CharSequence basename, Properties properties, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<UriKeys, String> queryProperties) Returns a new disk-based index, using preloaded
Properties and possibly guessing reasonable term and prefix maps from the basename. | public static BitStreamIndex | getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<UriKeys, String> queryProperties) Returns a new disk-based index, possibly guessing reasonable term and prefix maps from the basename.
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements
PrefixMap . | public static BitStreamIndex | getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps) Returns a new disk-based index, using preloaded
Properties and possibly guessing reasonable term and prefix maps from the basename.
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements
PrefixMap . | public static BitStreamIndex | getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes) Returns a new disk-based index, guessing reasonable term and prefix maps from the basename. | public static BitStreamIndex | getInstance(CharSequence basename, boolean randomAccess) Returns a new local index, trying to guess reasonable term and prefix maps from the basename,
and loading document sizes only if it is necessary. | public static BitStreamIndex | getInstance(CharSequence basename) Returns a new local index, trying to guess reasonable term and prefix maps from the basename,
loading offsets but loading document sizes only if it is necessary. | public static PrefixMap<? extends CharSequence> | loadPrefixMap(String filename) Utility static method that loads a prefix map.
Parameters: filename - the name of the file containing the prefix map. | public static StringMap<? extends CharSequence> | loadStringMap(String filename) Utility static method that loads a term map.
Parameters: filename - the name of the file containing the term map. | public static LongList | readOffsets(InputBitStream in, int T) Utility method to load a compressed offset file into a list.
Parameters: in - the input bit stream providing the offsets (see BitStreamIndexWriter). Parameters: T - the number of terms indexed. | public static IntList | readSizes(InputBitStream in, int N) Utility method to load a compressed size file into a list.
Parameters: in - the input bit stream providing the offsets (see BitStreamIndexWriter). Parameters: N - the number of documents indexed. |
FREQUENCIES_EXTENSION | final public static String FREQUENCIES_EXTENSION(Code) | | Standard extension for the file of frequencies.
|
GLOBCOUNTS_EXTENSION | final public static String GLOBCOUNTS_EXTENSION(Code) | | Standard extension for the file of global counts.
|
INDEX_EXTENSION | final public static String INDEX_EXTENSION(Code) | | Standard extension for the index bitstream.
|
OFFSETS_EXTENSION | final public static String OFFSETS_EXTENSION(Code) | | Standard extension for the file of offsets.
|
POSITIONS_EXTENSION | final public static String POSITIONS_EXTENSION(Code) | | Standard extension for the positions bitstream of an
.
|
PREFIXMAP_EXTENSION | final public static String PREFIXMAP_EXTENSION(Code) | | Standard extension for the prefix map.
|
PROPERTIES_EXTENSION | final public static String PROPERTIES_EXTENSION(Code) | | Standard extension for the index properties.
|
SIZES_EXTENSION | final public static String SIZES_EXTENSION(Code) | | Standard extension for the file of sizes.
|
STATS_EXTENSION | final public static String STATS_EXTENSION(Code) | | Standard extension for the stats file.
|
TERMMAP_EXTENSION | final public static String TERMMAP_EXTENSION(Code) | | Standard extension for the term map.
|
TERMS_EXTENSION | final public static String TERMS_EXTENSION(Code) | | Standard extension for the file of terms.
|
UNSORTED_TERMS_EXTENSION | final public static String UNSORTED_TERMS_EXTENSION(Code) | | Standard extension for the file of terms, unsorted.
|
getInstance | public static BitStreamIndex getInstance(CharSequence basename, Properties properties, StringMap<? extends CharSequence> termMap, PrefixMap<? extends CharSequence> prefixMap, boolean randomAccess, boolean documentSizes, EnumMap<UriKeys, String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException(Code) | | Returns a new disk-based index, loading exactly the specified parts and using preloaded
Properties .
Parameters: basename - the basename of the index. Parameters: properties - the properties obtained from the given basename. Parameters: termMap - the term map for this index, or null for no term map. Parameters: prefixMap - the prefix map for this index, or null for no prefix map. Parameters: randomAccess - whether the index should be accessible randomly (e.g., if it willbe possible to call IndexReader.documents(int) on the index readers returned by the index). Parameters: documentSizes - if true, document sizes will be loaded (note that sometimes document sizesmight be loaded anyway because the compression method for positions requires it). Parameters: queryProperties - a map containing associations between Index.UriKeys and values, or null . |
getInstance | public static BitStreamIndex getInstance(CharSequence basename, Properties properties, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<UriKeys, String> queryProperties) throws ClassNotFoundException, IOException, InstantiationException, IllegalAccessException(Code) | | Returns a new disk-based index, using preloaded
Properties and possibly guessing reasonable term and prefix maps from the basename.
Parameters: basename - the basename of the index. Parameters: properties - the properties obtained by stemming basename . Parameters: randomAccess - whether the index should be accessible randomly. Parameters: documentSizes - if true, document sizes will be loaded. Parameters: maps - if true, and maps will be guessed and loaded. Parameters: queryProperties - a map containing associations between Index.UriKeys and values, or null . throws: IllegalAccessException - throws: InstantiationException - See Also: DiskBasedIndex.getInstance(CharSequence,Properties,StringMap,PrefixMap,boolean,boolean,EnumMap) |
getInstance | public static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps, EnumMap<UriKeys, String> queryProperties) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException(Code) | | Returns a new disk-based index, possibly guessing reasonable term and prefix maps from the basename.
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements
PrefixMap . Otherwise, we search for a prefix map (basename stemmed with .prefixmap)
and, if it implements
StringMap and no term map has been found, we use it as prefix map.
Parameters: basename - the basename of the index. Parameters: randomAccess - whether the index should be accessible randomly (e.g., if it willbe possible to call IndexReader.documents(int) on the index readers returned by the index). Parameters: documentSizes - if true, document sizes will be loaded (note that sometimes document sizesmight be loaded anyway because the compression method for positions requires it). Parameters: maps - if true, and maps will be guessed and loaded (thisfeature might not be available with some kind of index). Parameters: queryProperties - a map containing associations between Index.UriKeys and values, or null . |
getInstance | public static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes, boolean maps) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException(Code) | | Returns a new disk-based index, using preloaded
Properties and possibly guessing reasonable term and prefix maps from the basename.
If there is a term map file (basename stemmed with .termmap), it is used as term map and,
in case it implements
PrefixMap . Otherwise, we search for a prefix map (basename stemmed with .prefixmap)
and, if it implements
StringMap and no term map has been found, we use it as prefix map.
Parameters: basename - the basename of the index. Parameters: randomAccess - whether the index should be accessible randomly (e.g., if it willbe possible to call IndexReader.documents(int) on the index readers returned by the index). Parameters: documentSizes - if true, document sizes will be loaded (note that sometimes document sizesmight be loaded anyway because the compression method for positions requires it). Parameters: maps - if true, and maps will be guessed and loaded (thisfeature might not be available with some kind of index). See Also: DiskBasedIndex.getInstance(CharSequence,boolean,boolean,boolean,EnumMap) See Also: |
getInstance | public static BitStreamIndex getInstance(CharSequence basename, boolean randomAccess, boolean documentSizes) throws ConfigurationException, ClassNotFoundException, IOException, InstantiationException, IllegalAccessException(Code) | | Returns a new disk-based index, guessing reasonable term and prefix maps from the basename.
Parameters: basename - the basename of the index. Parameters: randomAccess - whether the index should be accessible randomly (e.g., if it willbe possible to call IndexReader.documents(int) on the index readers returned by the index). Parameters: documentSizes - if true, document sizes will be loaded (note that sometimes document sizesmight be loaded anyway because the compression method for positions requires it). |
loadStringMap | public static StringMap<? extends CharSequence> loadStringMap(String filename) throws IOException(Code) | | Utility static method that loads a term map.
Parameters: filename - the name of the file containing the term map. the map, or null if the file did not exist. throws: IOException - if some IOException (other than FileNotFoundException) occurred. |
readOffsets | public static LongList readOffsets(InputBitStream in, int T) throws IOException(Code) | | Utility method to load a compressed offset file into a list.
Parameters: in - the input bit stream providing the offsets (see BitStreamIndexWriter). Parameters: T - the number of terms indexed. a list of longs backed by an array; the list hasan additional final element of index T that gives the numberof bytes of the index file. |
readSizes | public static IntList readSizes(InputBitStream in, int N) throws IOException(Code) | | Utility method to load a compressed size file into a list.
Parameters: in - the input bit stream providing the offsets (see BitStreamIndexWriter). Parameters: N - the number of documents indexed. a list of integers backed by an array. |
|
|