lucene

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » Search Engine » lucene 
Lucene Search Engines
License:Apache Software License
URL:http://jakarta.apache.org/lucene/docs/index.html
Description:Jakarta Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Package NameComment
com.sleepycat.db
lucli Lucene Command Line Interface
net.sf.snowball Snowball system classes.
net.sf.snowball.ext Snowball generated stemmer classes.
org.apache.luceneTop-level package.
org.apache.lucene.analysis

API and code to convert text into indexable/searchable tokens. Covers {@link org.apache.lucene.analysis.Analyzer} and related classes.

Parsing? Tokenization? Analysis!

Lucene, indexing and search library, accepts only plain text input.

Parsing

Applications that build their search capabilities upon Lucene may support documents in various formats – HTML, XML, PDF, Word – just to name a few. Lucene does not care about the Parsing of these and other document formats, and it is the responsibility of the application using Lucene to use an appropriate Parser to convert the original format into plain text before passing that plain text to Lucene.

Tokenization

Plain text passed to Lucene for indexing goes through a process generally called tokenization – namely breaking of the input text into small indexing elements – {@link org.apache.lucene.analysis.Token Tokens}. The way input text is broken into tokens very much dictates further capabilities of search upon that text. For instance, sentences beginnings and endings can be identified to provide for more accurate phrase and proximity searches (though sentence identification is not provided by Lucene).

In some cases simply breaking the input text into tokens is not enough – a deeper Analysis is needed, providing for several functions, including (but not limited to):

  • Stemming – Replacing of words by their stems. For instance with English stemming "bikes" is replaced by "bike"; now query "bike" can find both documents containing "bike" and those containing "bikes".
  • Stop Words Filtering – Common words like "the", "and" and "a" rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some "noise" and actually improve search quality.
  • Text Normalization – Stripping accents and other character markings can make for better searching.
  • Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.

Core Analysis

The analysis package provides the mechanism to convert Strings and Readers into tokens that can be indexed by Lucene. There are three main classes in the package from which all analysis processes are derived. These are:

  • {@link org.apache.lucene.analysis.Analyzer} – An Analyzer is responsible for building a {@link org.apache.lucene.analysis.TokenStream} which can be consumed by the indexing and searching processes. See below for more information on implementing your own Analyzer.
  • {@link org.apache.lucene.analysis.Tokenizer} – A Tokenizer is a {@link org.apache.lucene.analysis.TokenStream} and is responsible for breaking up incoming text into {@link org.apache.lucene.analysis.Token}s. In most cases, an Analyzer will use a Tokenizer as the first step in the analysis process.
  • {@link org.apache.lucene.analysis.TokenFilter} – A TokenFilter is also a {@link org.apache.lucene.analysis.TokenStream} and is responsible for modifying {@link org.apache.lucene.analysis.Token}s that have been created by the Tokenizer. Common modifications performed by a TokenFilter are: deletion, stemming, synonym injection, and down casing. Not all Analyzers require TokenFilters

Hints, Tips and Traps

The synergy between {@link org.apache.lucene.analysis.Analyzer} and {@link org.apache.lucene.analysis.Tokenizer} is sometimes confusing. To ease on this confusion, some clarifications:

  • The {@link org.apache.lucene.analysis.Analyzer} is responsible for the entire task of creating tokens out of the input text, while the {@link org.apache.lucene.analysis.Tokenizer} is only responsible for breaking the input text into tokens. Very likely, tokens created by the {@link org.apache.lucene.analysis.Tokenizer} would be modified or even omitted by the {@link org.apache.lucene.analysis.Analyzer} (via one or more {@link org.apache.lucene.analysis.TokenFilter}s) before being returned.
  • {@link org.apache.lucene.analysis.Tokenizer} is a {@link org.apache.lucene.analysis.TokenStream}, but {@link org.apache.lucene.analysis.Analyzer} is not.
  • {@link org.apache.lucene.analysis.Analyzer} is "field aware", but {@link org.apache.lucene.analysis.Tokenizer} is not.

Lucene Java provides a number of analysis capabilities, the most commonly used one being the {@link org.apache.lucene.analysis.standard.StandardAnalyzer}. Many applications will have a long and industrious life with nothing more than the StandardAnalyzer. However, there are a few other classes/packages that are worth mentioning:

  1. {@link org.apache.lucene.analysis.PerFieldAnalyzerWrapper} – Most Analyzers perform the same operation on all {@link org.apache.lucene.document.Field}s. The PerFieldAnalyzerWrapper can be used to associate a different Analyzer with different {@link org.apache.lucene.document.Field}s.
  2. The contrib/analyzers library located at the root of the Lucene distribution has a number of different Analyzer implementations to solve a variety of different problems related to searching. Many of the Analyzers are designed to analyze non-English languages.
  3. The {@link org.apache.lucene.analysis.snowball contrib/snowball library} located at the root of the Lucene distribution has Analyzer and TokenFilter implementations for a variety of Snowball stemmers. See http://snowball.tartarus.org for more information on Snowball stemmers.
  4. There are a variety of Tokenizer and TokenFilter implementations in this package. Take a look around, chances are someone has implemented what you need.

Analysis is one of the main causes of performance degradation during indexing. Simply put, the more you analyze the slower the indexing (in most cases). Perhaps your application would be just fine using the simple {@link org.apache.lucene.analysis.WhitespaceTokenizer} combined with a {@link org.apache.lucene.analysis.StopFilter}. The contrib/benchmark library can be useful for testing out the speed of the analysis process.

Invoking the Analyzer

Applications usually do not invoke analysis – Lucene does it for them:

  • At indexing, as a consequence of {@link org.apache.lucene.index.IndexWriter#addDocument(org.apache.lucene.document.Document) addDocument(doc)}, the Analyzer in effect for indexing is invoked for each indexed field of the added document.
  • At search, as a consequence of {@link org.apache.lucene.queryParser.QueryParser#parse(java.lang.String) QueryParser.parse(queryText)}, the QueryParser may invoke the Analyzer in effect. Note that for some queries analysis does not take place, e.g. wildcard queries.
However an application might invoke Analysis of any text for testing or for any other purpose, something like:
      Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
      TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here"));
      Token t = ts.next();
      while (t!=null) {
        System.out.println("token: "+t));
        t = ts.next();
      }
  

Indexing Analysis vs. Search Analysis

Selecting the "correct" analyzer is crucial for search quality, and can also affect indexing and search performance. The "correct" analyzer differs between applications. Lucene java's wiki page AnalysisParalysis provides some data on "analyzing your analyzer". Here are some rules of thumb:

  1. Test test test... (did we say test?)
  2. Beware of over analysis – might hurt indexing performance.
  3. Start with same analyzer for indexing and search, otherwise searches would not find what they are supposed to...
  4. In some cases a different analyzer is required for indexing and search, for instance:
    • Certain searches require more stop words to be filtered. (I.e. more than those that were filtered at indexing.)
    • Query expansion by synonyms, acronyms, auto spell correction, etc.
    This might sometimes require a modified analyzer – see the next section on how to do that.

Implementing your own Analyzer

Creating your own Analyzer is straightforward. It usually involves either wrapping an existing Tokenizer and set of TokenFilters to create a new Analyzer or creating both the Analyzer and a Tokenizer or TokenFilter. Before pursuing this approach, you may find it worthwhile to explore the contrib/analyzers library and/or ask on the java-user@lucene.apache.org mailing list first to see if what you need already exists. If you are still committed to creating your own Analyzer or TokenStream derivation (Tokenizer or TokenFilter) have a look at the source code of any one of the many samples located in this package.

The following sections discuss some aspects of implementing your own analyzer.

Field Section Boundaries

When {@link org.apache.lucene.document.Document#add(org.apache.lucene.document.Fieldable) document.add(field)} is called multiple times for the same field name, we could say that each such call creates a new section for that field in that document. In fact, a separate call to {@link org.apache.lucene.analysis.Analyzer#tokenStream(java.lang.String, java.io.Reader) tokenStream(field,reader)} would take place for each of these so called "sections". However, the default Analyzer behavior is to treat all these sections as one large section. This allows phrase search and proximity search to seamlessly cross boundaries between these "sections". In other words, if a certain field "f" is added like this:

      document.add(new Field("f","first ends",...);
      document.add(new Field("f","starts two",...);
      indexWriter.addDocument(document);
  
Then, a phrase search for "ends starts" would find that document. Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections", simply by overriding {@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}:
      Analyzer myAnalyzer = new StandardAnalyzer() {
         public int getPositionIncrementGap(String fieldName) {
           return 10;
         }
      };
  

Token Position Increments

By default, all tokens created by Analyzers and Tokenizers have a {@link org.apache.lucene.analysis.Token#getPositionIncrement() position increment} of one. This means that the position stored for that token in the index would be one more than that of the previous token. Recall that phrase and proximity searches rely on position info.

If the selected analyzer filters the stop words "is" and "the", then for a document containing the string "blue is the sky", only the tokens "blue", "sky" are indexed, with position("sky") = 1 + position("blue"). Now, a phrase query "blue is the sky" would find that document, because the same analyzer filters the same stop words from that query. But also the phrase query "blue sky" would find that document.

If this behavior does not fit the application needs, a modified analyzer can be used, that would increment further the positions of tokens following a removed stop word, using {@link org.apache.lucene.analysis.Token#setPositionIncrement(int)}. This can be done with something like:

      public TokenStream tokenStream(final String fieldName, Reader reader) {
        final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
        TokenStream res = new TokenStream() {
          public Token next() throws IOException {
            int extraIncrement = 0;
            while (true) {
              Token t = ts.next();
              if (t!=null) {
                if (stopWords.contains(t.termText())) {
                  extraIncrement++; // filter this word
                  continue;
                } 
                if (extraIncrement>0) {
                  t.setPositionIncrement(t.getPositionIncrement()+extraIncrement);
                }
              }
              return t;
            }
          }
        };
        return res;
      }
   
Now, with this modified analyzer, the phrase query "blue sky" would find that document. But note that this is yet not a perfect solution, because any phrase query "blue w1 w2 sky" where both w1 and w2 are stop words would match that document.

Few more use cases for modifying position increments are:

  1. Inhibiting phrase and proximity matches in sentence boundaries – for this, a tokenizer that identifies a new sentence can add 1 to the position increment of the first token of the new sentence.
  2. Injecting synonyms – here, synonyms of a token should be added after that token, and their position increment should be set to 0. As result, all synonyms of a token would be considered to appear in exactly the same position as that token, and so would they be seen by phrase and proximity searches.

org.apache.lucene.analysis.br Analyzer for Brazilian.
org.apache.lucene.analysis.cjk Analyzer for Chinese, Japanese and Korean.
org.apache.lucene.analysis.cn Analyzer for Chinese.
org.apache.lucene.analysis.cz Analyzer for Czech.
org.apache.lucene.analysis.de Analyzer for German.
org.apache.lucene.analysis.el Analyzer for Greek.
org.apache.lucene.analysis.fr Analyzer for French.
org.apache.lucene.analysis.ngram
org.apache.lucene.analysis.nl Analyzer for Dutch.
org.apache.lucene.analysis.payloads org.apache.lucene.analysis.payloads
Provides various convenience classes for creating payloads on Tokens.
 
Copyright © 2007 Apache Software Foundation
org.apache.lucene.analysis.ru Analyzer for Russian.
org.apache.lucene.analysis.sinks org.apache.lucene.analysis.sinks
Implementations of the SinkTokenizer that might be useful.
 
Copyright © 2007 Apache Software Foundation
org.apache.lucene.analysis.snowball {@link org.apache.lucene.analysis.TokenFilter} and {@link org.apache.lucene.analysis.Analyzer} implementations that use Snowball stemmers.
org.apache.lucene.analysis.standard A fast grammar-based tokenizer constructed with JFlex.
org.apache.lucene.analysis.th
org.apache.lucene.ant Ant task to create Lucene indexes.
org.apache.lucene.benchmark
org.apache.lucene.benchmark.byTask Benchmarking Lucene By Tasks
Benchmarking Lucene By Tasks.

This package provides "task based" performance benchmarking of Lucene. One can use the predefined benchmarks, or create new ones.

Contained packages:

Package Description
stats Statistics maintained when running benchmark tasks.
tasks Benchmark tasks.
feeds Sources for benchmark inputs: documents and queries.
utils Utilities used for the benchmark, and for the reports.
programmatic Sample performance test written programatically.

Table Of Contents

  1. Benchmarking By Tasks
  2. How to use
  3. Benchmark "algorithm"
  4. Supported tasks/commands
  5. Benchmark properties
  6. Example input algorithm and the result benchmark report.
  7. Results record counting clarified

Benchmarking By Tasks

Benchmark Lucene using task primitives.

A benchmark is composed of some predefined tasks, allowing for creating an index, adding documents, optimizing, searching, generating reports, and more. A benchmark run takes an "algorithm" file that contains a description of the sequence of tasks making up the run, and some properties defining a few additional characteristics of the benchmark run.

How to use

Easiest way to run a benchmarks is using the predefined ant task:

  • ant run-task
    - would run the micro-standard.alg "algorithm".
  • ant run-task -Dtask.alg=conf/compound-penalty.alg
    - would run the compound-penalty.alg "algorithm".
  • ant run-task -Dtask.alg=[full-path-to-your-alg-file]
    - would run your perf test "algorithm".
  • java org.apache.lucene.benchmark.byTask.programmatic.Sample
    - would run a performance test programmatically - without using an alg file. This is less readable, and less convinient, but possible.

You may find existing tasks sufficient for defining the benchmark you need, otherwise, you can extend the framework to meet your needs, as explained herein.

Each benchmark run has a DocMaker and a QueryMaker. These two should usually match, so that "meaningful" queries are used for a certain collection. Properties set at the header of the alg file define which "makers" should be used. You can also specify your own makers, implementing the DocMaker and QureyMaker interfaces.

Benchmark .alg file contains the benchmark "algorithm". The syntax is described below. Within the algorithm, you can specify groups of commands, assign them names, specify commands that should be repeated, do commands in serial or in parallel, and also control the speed of "firing" the commands.

This allows, for instance, to specify that an index should be opened for update, documents should be added to it one by one but not faster than 20 docs a minute, and, in parallel with this, some N queries should be searched against that index, again, no more than 2 queries a second. You can have the searches all share an index reader, or have them each open its own reader and close it afterwords.

If the commands available for use in the algorithm do not meet your needs, you can add commands by adding a new task under org.apache.lucene.benchmark.byTask.tasks - you should extend the PerfTask abstract class. Make sure that your new task class name is suffixed by Task. Assume you added the class "WonderfulTask" - doing so also enables the command "Wonderful" to be used in the algorithm.

External classes: It is sometimes useful to invoke the benchmark package with your external alg file that configures the use of your own doc/query maker and or html parser. You can work this out without modifying the benchmark package code, by passing your class path with the benchmark.ext.classpath property:

  • ant run-task -Dtask.alg=[full-path-to-your-alg-file] -Dbenchmark.ext.classpath=/mydir/classes -Dtask.mem=512M

Benchmark "algorithm"

The following is an informal description of the supported syntax.

  1. Measuring: When a command is executed, statistics for the elapsed execution time and memory consumption are collected. At any time, those statistics can be printed, using one of the available ReportTasks.
  2. Comments start with '#'.
  3. Serial sequences are enclosed within '{ }'.
  4. Parallel sequences are enclosed within '[ ]'
  5. Sequence naming: To name a sequence, put '"name"' just after '{' or '['.
    Example - { "ManyAdds" AddDoc } : 1000000 - would name the sequence of 1M add docs "ManyAdds", and this name would later appear in statistic reports. If you don't specify a name for a sequence, it is given one: you can see it as the algorithm is printed just before benchmark execution starts.
  6. Repeating: To repeat sequence tasks N times, add ': N' just after the sequence closing tag - '}' or ']' or '>'.
    Example - [ AddDoc ] : 4 - would do 4 addDoc in parallel, spawning 4 threads at once.
    Example - [ AddDoc AddDoc ] : 4 - would do 8 addDoc in parallel, spawning 8 threads at once.
    Example - { AddDoc } : 30 - would do addDoc 30 times in a row.
    Example - { AddDoc AddDoc } : 30 - would do addDoc 60 times in a row.
    Exhaustive repeating: use * instead of a number to repeat exhaustively. This is sometimes useful, for adding as many files as a doc maker can create, without iterating over the same file again, especially when the exact number of documents is not known in advance. For insance, TREC files extracted from a zip file. Note: when using this, you must also set doc.maker.forever to false.
    Example - { AddDoc } : * - would add docs until the doc maker is "exhausted".
  7. Command parameter: a command can optionally take a single parameter. If the certain command does not support a parameter, or if the parameter is of the wrong type, reading the algorithm will fail with an exception and the test would not start. Currently the following tasks take optional parameters:
    • AddDoc takes a numeric parameter, indicating the required size of added document. Note: if the DocMaker implementation used in the test does not support makeDoc(size), an exception would be thrown and the test would fail.
    • DeleteDoc takes numeric parameter, indicating the docid to be deleted. The latter is not very useful for loops, since the docid is fixed, so for deletion in loops it is better to use the doc.delete.step property.
    • SetProp takes a name,value mandatory param, ',' used as a separator.
    • SearchTravRetTask and SearchTravTask take a numeric parameter, indicating the required traversal size.
    • SearchTravRetLoadFieldSelectorTask takes a string parameter: a comma separated list of Fields to load.

    Example - AddDoc(2000) - would add a document of size 2000 (~bytes).
    See conf/task-sample.alg for how this can be used, for instance, to check which is faster, adding many smaller documents, or few larger documents. Next candidates for supporting a parameter may be the Search tasks, for controlling the qurey size.
  8. Statistic recording elimination: - a sequence can also end with '>', in which case child tasks would not store their statistics. This can be useful to avoid exploding stats data, for adding say 1M docs.
    Example - { "ManyAdds" AddDoc > : 1000000 - would add million docs, measure that total, but not save stats for each addDoc.
    Notice that the granularity of System.currentTimeMillis() (which is used here) is system dependant, and in some systems an operation that takes 5 ms to complete may show 0 ms latency time in performance measurements. Therefore it is sometimes more accurate to look at the elapsed time of a larger sequence, as demonstrated here.
  9. Rate: To set a rate (ops/sec or ops/min) for a sequence, add ': N : R' just after sequence closing tag. This would specify repetition of N with rate of R operations/sec. Use 'R/sec' or 'R/min' to explicitely specify that the rate is per second or per minute. The default is per second,
    Example - [ AddDoc ] : 400 : 3 - would do 400 addDoc in parallel, starting up to 3 threads per second.
    Example - { AddDoc } : 100 : 200/min - would do 100 addDoc serially, waiting before starting next add, if otherwise rate would exceed 200 adds/min.
  10. Command names: Each class "AnyNameTask" in the package org.apache.lucene.benchmark.byTask.tasks, that extends PerfTask, is supported as command "AnyName" that can be used in the benchmark "algorithm" description. This allows to add new commands by just adding such classes.

Supported tasks/commands

Existing tasks can be divided into a few groups: regular index/search work tasks, report tasks, and control tasks.

  1. Report tasks: There are a few Report commands for generating reports. Only task runs that were completed are reported. (The 'Report tasks' themselves are not measured and not reported.)
    • RepAll - all (completed) task runs.
    • RepSumByName - all statistics, aggregated by name. So, if AddDoc was executed 2000 times, only 1 report line would be created for it, aggregating all those 2000 statistic records.
    • RepSelectByPref   prefixWord - all records for tasks whose name start with prefixWord.
    • RepSumByPref   prefixWord - all records for tasks whose name start with prefixWord, aggregated by their full task name.
    • RepSumByNameRound - all statistics, aggregated by name and by Round. So, if AddDoc was executed 2000 times in each of 3 rounds, 3 report lines would be created for it, aggregating all those 2000 statistic records in each round. See more about rounds in the NewRound command description below.
    • RepSumByPrefRound   prefixWord - similar to RepSumByNameRound, just that only tasks whose name starts with prefixWord are included.
    If needed, additional reports can be added by extending the abstract class ReportTask, and by manipulating the statistics data in Points and TaskStats.
  2. Control tasks: Few of the tasks control the benchmark algorithm all over:
    • ClearStats - clears the entire statistics. Further reports would only include task runs that would start after this call.
    • NewRound - virtually start a new round of performance test. Although this command can be placed anywhere, it mostly makes sense at the end of an outermost sequence.
      This increments a global "round counter". All task runs that would start now would record the new, updated round counter as their round number. This would appear in reports. In particular, see RepSumByNameRound above.
      An additional effect of NewRound, is that numeric and boolean properties defined (at the head of the .alg file) as a sequence of values, e.g. merge.factor=mrg:10:100:10:100 would increment (cyclic) to the next value. Note: this would also be reflected in the reports, in this case under a column that would be named "mrg".
    • ResetInputs - DocMaker and the various QueryMakers would reset their counters to start. The way these Maker interfaces work, each call for makeDocument() or makeQuery() creates the next document or query that it "knows" to create. If that pool is "exhausted", the "maker" start over again. The resetInpus command therefore allows to make the rounds comparable. It is therefore useful to invoke ResetInputs together with NewRound.
    • ResetSystemErase - reset all index and input data and call gc. Does NOT reset statistics. This contains ResetInputs. All writers/readers are nullified, deleted, closed. Index is erased. Directory is erased. You would have to call CreateIndex once this was called...
    • ResetSystemSoft - reset all index and input data and call gc. Does NOT reset statistics. This contains ResetInputs. All writers/readers are nullified, closed. Index is NOT erased. Directory is NOT erased. This is useful for testing performance on an existing index, for instance if the construction of a large index took a very long time and now you would to test its search or update performance.
  3. Other existing tasks are quite straightforward and would just be briefly described here.
    • CreateIndex and OpenIndex both leave the index open for later update operations. CloseIndex would close it.
    • OpenReader, similarly, would leave an index reader open for later search operations. But this have further semantics. If a Read operation is performed, and an open reader exists, it would be used. Otherwise, the read operation would open its own reader and close it when the read operation is done. This allows testing various scenarios - sharing a reader, searching with "cold" reader, with "warmed" reader, etc. The read operations affected by this are: Warm, Search, SearchTrav (search and traverse), and SearchTravRet (search and traverse and retrieve). Notice that each of the 3 search task types maintains its own queryMaker instance.

Benchmark properties

Properties are read from the header of the .alg file, and define several parameters of the performance test. As mentioned above for the NewRound task, numeric and boolean properties that are defined as a sequence of values, e.g. merge.factor=mrg:10:100:10:100 would increment (cyclic) to the next value, when NewRound is called, and would also appear as a named column in the reports (column name would be "mrg" in this example).

Some of the currently defined properties are:

  1. analyzer - full class name for the analyzer to use. Same analyzer would be used in the entire test.
  2. directory - valid values are This tells which directory to use for the performance test.
  3. Index work parameters: Multi int/boolean values would be iterated with calls to NewRound. There would be also added as columns in the reports, first string in the sequence is the column name. (Make sure it is no shorter than any value in the sequence).
    • max.buffered
      Example: max.buffered=buf:10:10:100:100 - this would define using maxBufferedDocs of 10 in iterations 0 and 1, and 100 in iterations 2 and 3.
    • merge.factor - which merge factor to use.
    • compound - whether the index is using the compound format or not. Valid values are "true" and "false".

Here is a list of currently defined properties:

  1. Root directory for data and indexes:
    • work.dir (default is System property "benchmark.work.dir" or "work".)
  2. Docs and queries creation:
    • analyzer
    • doc.maker
    • doc.maker.forever
    • html.parser
    • doc.stored
    • doc.tokenized
    • doc.term.vector
    • doc.term.vector.positions
    • doc.term.vector.offsets
    • doc.store.body.bytes
    • docs.dir
    • query.maker
    • file.query.maker.file
    • file.query.maker.default.field
  3. Logging:
    • doc.add.log.step
    • doc.delete.log.step
    • log.queries
    • task.max.depth.log
    • doc.tokenize.log.step
  4. Index writing:
    • compound
    • merge.factor
    • max.buffered
    • directory
    • ram.flush.mb
    • autocommit
  5. Doc deletion:
    • doc.delete.step

For sample use of these properties see the *.alg files under conf.

Example input algorithm and the result benchmark report

The following example is in conf/sample.alg:

# --------------------------------------------------------
#
# Sample: what is the effect of doc size on indexing time?
#
# There are two parts in this test:
# - PopulateShort adds 2N documents of length  L
# - PopulateLong  adds  N documents of length 2L
# Which one would be faster?
# The comparison is done twice.
#
# --------------------------------------------------------

# -------------------------------------------------------------------------------------
# multi val params are iterated by NewRound's, added to reports, start with column name.
merge.factor=mrg:10:20
max.buffered=buf:100:1000
compound=true

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory

doc.stored=true
doc.tokenized=true
doc.term.vector=false
doc.add.log.step=500

docs.dir=reuters-out

doc.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleDocMaker

query.maker=org.apache.lucene.benchmark.byTask.feeds.SimpleQueryMaker

# task at this depth or less would print when they start
task.max.depth.log=2

log.queries=false
# -------------------------------------------------------------------------------------
{

    { "PopulateShort"
        CreateIndex
        { AddDoc(4000) > : 20000
        Optimize
        CloseIndex
    >

    ResetSystemErase

    { "PopulateLong"
        CreateIndex
        { AddDoc(8000) > : 10000
        Optimize
        CloseIndex
    >

    ResetSystemErase

    NewRound

} : 2

RepSumByName
RepSelectByPref Populate

The command line for running this sample:
ant run-task -Dtask.alg=conf/sample.alg

The output report from running this test contains the following:

Operation     round mrg  buf   runCnt   recsPerRun        rec/s  elapsedSec    avgUsedMem    avgTotalMem
PopulateShort     0  10  100        1        20003        119.6      167.26    12,959,120     14,241,792
PopulateLong -  - 0  10  100 -  -   1 -  -   10003 -  -  - 74.3 -  - 134.57 -  17,085,208 -   20,635,648
PopulateShort     1  20 1000        1        20003        143.5      139.39    63,982,040     94,756,864
PopulateLong -  - 1  20 1000 -  -   1 -  -   10003 -  -  - 77.0 -  - 129.92 -  87,309,608 -  100,831,232

Results record counting clarified

Two columns in the results table indicate records counts: records-per-run and records-per-second. What does it mean?

Almost every task gets 1 in this count just for being executed. Task sequences aggregate the counts of their child tasks, plus their own count of 1. So, a task sequence containing 5 other task sequences, each running a single other task 10 times, would have a count of 1 + 5 * (1 + 10) = 56.

The traverse and retrieve tasks "count" more: a traverse task would add 1 for each traversed result (hit), and a retrieve task would additionally add 1 for each retrieved doc. So, regular Search would count 1, SearchTrav that traverses 10 hits would count 11, and a SearchTravRet task that retrieves (and traverses) 10, would count 21.

Confusing? this might help: always examine the elapsedSec column, and always compare "apples to apples", .i.e. it is interesting to check how the rec/s changed for the same task (or sequence) between two different runs, but it is not very useful to know how the rec/s differs between Search and SearchTrav tasks. For the latter, elapsedSec would bring more insight.

 
org.apache.lucene.benchmark.byTask.feeds Sources for benchmark inputs: documents and queries.
org.apache.lucene.benchmark.byTask.programmatic Sample performance test written programatically - no algorithm file is needed here.
org.apache.lucene.benchmark.byTask.stats Statistics maintained when running benchmark tasks.
org.apache.lucene.benchmark.byTask.tasks Extendable benchmark tasks.
org.apache.lucene.benchmark.byTask.utils Utilities used for the benchmark, and for the reports.
org.apache.lucene.benchmark.quality

Search Quality Benchmarking.

This package allows to benchmark search quality of a Lucene application.

In order to use this package you should provide:

For benchmarking TREC collections with TREC QRels, take a look at the trec package.

Here is a sample code used to run the TREC 2006 queries 701-850 on the .Gov2 collection:

    File topicsFile = new File("topics-701-850.txt");
    File qrelsFile = new File("qrels-701-850.txt");
    Searcher searcher = new IndexSearcher("index");

    int maxResults = 1000;
    String docNameField = "docname"; 
    
    PrintWriter logger = new PrintWriter(System.out,true); 

    // use trec utilities to read trec topics into quality queries
    TrecTopicsReader qReader = new TrecTopicsReader();
    QualityQuery qqs[] = qReader.readQueries(new BufferedReader(new FileReader(topicsFile)));
    
    // prepare judge, with trec utilities that read from a QRels file
    Judge judge = new TrecJudge(new BufferedReader(new FileReader(qrelsFile)));
    
    // validate topics & judgments match each other
    judge.validateData(qqs, logger);
    
    // set the parsing of quality queries into Lucene queries.
    QualityQueryParser qqParser = new SimpleQQParser("title", "body");
    
    // run the benchmark
    QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField);
    SubmissionReport submitLog = null;
    QualityStats stats[] = qrun.execute(maxResults, judge, submitLog, logger);
    
    // print an avarage sum of the results
    QualityStats avg = QualityStats.average(stats);
    avg.log("SUMMARY",2,logger, "  ");

Some immediate ways to modify this program to your needs are:

org.apache.lucene.benchmark.quality.trec Utilities for Trec related quality benchmarking, feeding from Trec Topics and QRels inputs.
org.apache.lucene.benchmark.quality.utils Miscellaneous utilities for search quality benchmarking: query parsing, submission reports.
org.apache.lucene.benchmark.standard
org.apache.lucene.benchmark.stats
org.apache.lucene.benchmark.utils
org.apache.lucene.demo
org.apache.lucene.demo.html
org.apache.lucene.document

The logical representation of a {@link org.apache.lucene.document.Document} for indexing and searching.

The document package provides the user level logical representation of content to be indexed and searched. The package also provides utilities for working with {@link org.apache.lucene.document.Document}s and {@link org.apache.lucene.document.Fieldable}s.

Document and Fieldable

A {@link org.apache.lucene.document.Document} is a collection of {@link org.apache.lucene.document.Fieldable}s. A {@link org.apache.lucene.document.Fieldable} is a logical representation of a user's content that needs to be indexed or stored. {@link org.apache.lucene.document.Fieldable}s have a number of properties that tell Lucene how to treat the content (like indexed, tokenized, stored, etc.) See the {@link org.apache.lucene.document.Field} implementation of {@link org.apache.lucene.document.Fieldable} for specifics on these properties.

Note: it is common to refer to {@link org.apache.lucene.document.Document}s having {@link org.apache.lucene.document.Field}s, even though technically they have {@link org.apache.lucene.document.Fieldable}s.

Working with Documents

First and foremost, a {@link org.apache.lucene.document.Document} is something created by the user application. It is your job to create Documents based on the content of the files you are working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is completely up to you. That being said, there are many tools available in other projects that can make the process of taking a file and converting it into a Lucene {@link org.apache.lucene.document.Document}. To see an example of this, take a look at the Lucene demo and the associated source code for extracting content from HTML.

The {@link org.apache.lucene.document.DateTools} and {@link org.apache.lucene.document.NumberTools} classes are utility classes to make dates, times and longs searchable (remember, Lucene only searches text).

The {@link org.apache.lucene.document.FieldSelector} class provides a mechanism to tell Lucene how to load Documents from storage. If no FieldSelector is used, all Fieldables on a Document will be loaded. As an example of the FieldSelector usage, consider the common use case of displaying search results on a web page and then having users click through to see the full document. In this scenario, it is often the case that there are many small fields and one or two large fields (containing the contents of the original file). Before the FieldSelector, the full Document had to be loaded, including the large fields, in order to display the results. Now, using the FieldSelector, one can {@link org.apache.lucene.document.FieldSelectorResult#LAZY_LOAD} the large fields, thus only loading the large fields when a user clicks on the actual link to view the original content.

org.apache.lucene.index Code to maintain and access indices.
org.apache.lucene.index.memory High-performance single-document main memory Apache Lucene fulltext search index.
org.apache.lucene.index.store
org.apache.lucene.misc
org.apache.lucene.queryParser A simple query parser implemented with JavaCC.

Note that JavaCC defines lots of public classes, methods and fields that do not need to be public.  These clutter the documentation.  Sorry.

Note that because JavaCC defines a class named Token, org.apache.lucene.analysis.Token must always be fully qualified in source code in this package.

org.apache.lucene.queryParser.analyzing
org.apache.lucene.queryParser.precedence
org.apache.lucene.queryParser.surround.parser Surround parser package This package contains the QueryParser.jj source file for the Surround parser.

Parsing the text of a query results in a SrndQuery in the org.apache.lucene.queryParser.surround.query package.

org.apache.lucene.queryParser.surround.query Surround query package This package contains SrndQuery and its subclasses.

The parser in the org.apache.lucene.queryParser.surround.parser package normally generates a SrndQuery.

For searching an org.apache.lucene.search.Query is provided by the SrndQuery.makeLuceneQueryField method. For this, TermQuery, BooleanQuery and SpanQuery are used from Lucene.

org.apache.lucene.search Code to search indices.

Table Of Contents

  1. Search Basics
  2. The Query Classes
  3. Changing the Scoring

Search

Search over indices. Applications usually call {@link org.apache.lucene.search.Searcher#search(Query)} or {@link org.apache.lucene.search.Searcher#search(Query,Filter)}.

Query Classes

TermQuery

Of the various implementations of Query, the TermQuery is the easiest to understand and the most often used in applications. A TermQuery matches all the documents that contain the specified Term, which is a word that occurs in a certain Field. Thus, a TermQuery identifies and scores all Documents that have a Field with the specified string in it. Constructing a TermQuery is as simple as:

        TermQuery tq = new TermQuery(new Term("fieldName", "term"));
    
In this example, the Query identifies all Documents that have the Field named "fieldName" containing the word "term".

BooleanQuery

Things start to get interesting when one combines multiple TermQuery instances into a BooleanQuery. A BooleanQuery contains multiple BooleanClauses, where each clause contains a sub-query (Query instance) and an operator (from BooleanClause.Occur) describing how that sub-query is combined with the other clauses:

  1. SHOULD — Use this operator when a clause can occur in the result set, but is not required. If a query is made up of all SHOULD clauses, then every document in the result set matches at least one of these clauses.

  2. MUST — Use this operator when a clause is required to occur in the result set. Every document in the result set will match all such clauses.

  3. MUST NOT — Use this operator when a clause must not occur in the result set. No document in the result set will match any such clauses.

Boolean queries are constructed by adding two or more BooleanClause instances. If too many clauses are added, a TooManyClauses exception will be thrown during searching. This most often occurs when a Query is rewritten into a BooleanQuery with many TermQuery clauses, for example by WildcardQuery. The default setting for the maximum number of clauses 1024, but this can be changed via the static method setMaxClauseCount in BooleanQuery.

Phrases

Another common search is to find documents containing certain phrases. This is handled two different ways:

  1. PhraseQuery — Matches a sequence of Terms. PhraseQuery uses a slop factor to determine how many positions may occur between any two terms in the phrase and still be considered a match.

  2. SpanNearQuery — Matches a sequence of other SpanQuery instances. SpanNearQuery allows for much more complicated phrase queries since it is constructed from other SpanQuery instances, instead of only TermQuery instances.

RangeQuery

The RangeQuery matches all documents that occur in the exclusive range of a lower Term and an upper Term. For example, one could find all documents that have terms beginning with the letters a through c. This type of Query is frequently used to find documents that occur in a specific date range.

PrefixQuery, WildcardQuery

While the PrefixQuery has a different implementation, it is essentially a special case of the WildcardQuery. The PrefixQuery allows an application to identify all documents with terms that begin with a certain string. The WildcardQuery generalizes this by allowing for the use of * (matches 0 or more characters) and ? (matches exactly one character) wildcards. Note that the WildcardQuery can be quite slow. Also note that WildcardQuery should not start with * and ?, as these are extremely slow. To remove this protection and allow a wildcard at the beginning of a term, see method setAllowLeadingWildcard in QueryParser.

FuzzyQuery

A FuzzyQuery matches documents that contain terms similar to the specified term. Similarity is determined using Levenshtein (edit) distance. This type of query can be useful when accounting for spelling variations in the collection.

Changing Similarity

Chances are DefaultSimilarity is sufficient for all your searching needs. However, in some applications it may be necessary to customize your Similarity implementation. For instance, some applications do not need to distinguish between shorter and longer documents (see a "fair" similarity).

To change Similarity, one must do so for both indexing and searching, and the changes must happen before either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it just isn't well-defined what is going to happen.

To make this change, implement your own Similarity (likely you'll want to simply subclass DefaultSimilarity) and then use the new class by calling IndexWriter.setSimilarity before indexing and Searcher.setSimilarity before searching.

If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at Overriding Similarity. In summary, here are a few use cases:

  1. SweetSpotSimilaritySweetSpotSimilarity gives small increases as the frequency increases a small amount and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is more significant.

  2. Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a matching term occurs. In these cases people have overridden Similarity to return 1 from the tf() method.

  3. Changing Length Normalization — By overriding lengthNorm, it is possible to discount how the length of a field contributes to a score. In DefaultSimilarity, lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be 1 / (numTerms in field), all fields will be treated "fairly".

In general, Chris Hostetter sums it up best in saying (from the Lucene users's mailing list):
[One would override the Similarity in] ... any situation where you know more about your data then just that it's "text" is a situation where it *might* make sense to to override your Similarity method.

Changing Scoring — Expert Level

Changing scoring is an expert level task, so tread carefully and be prepared to share your code if you want help.

With the warning out of the way, it is possible to change a lot more than just the Similarity when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by three main classes:

  1. Query — The abstract object representation of the user's information need.
  2. Weight — The internal interface representation of the user's Query, so that Query objects may be reused.
  3. Scorer — An abstract class containing common functionality for scoring. Provides both scoring and explanation capabilities.
Details on each of these classes, and their children, can be found in the subsections below.

The Query Class

In some sense, the Query class is where it all begins. Without a Query, there would be nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it is often responsible for creating them or coordinating the functionality between them. The Query class has several methods that are important for derived classes:

  1. createWeight(Searcher searcher) — A Weight is the internal representation of the Query, so each Query implementation must provide an implementation of Weight. See the subsection on The Weight Interface below for details on implementing the Weight interface.
  2. rewrite(IndexReader reader) — Rewrites queries into primitive queries. Primitive queries are: TermQuery, BooleanQuery, OTHERS????

The Weight Interface

The Weight interface provides an internal representation of the Query so that it can be reused. Any Searcher dependent state should be stored in the Weight implementation, not in the Query class. The interface defines six methods that must be implemented:

  1. Weight#getQuery() — Pointer to the Query that this Weight represents.
  2. Weight#getValue() — The weight for this Query. For example, the TermQuery.TermWeight value is equal to the idf^2 * boost * queryNorm
  3. Weight#sumOfSquaredWeights() — The sum of squared weights. For TermQuery, this is (idf * boost)^2
  4. Weight#normalize(float) — Determine the query normalization factor. The query normalization may allow for comparing scores between queries.
  5. Weight#scorer(IndexReader) — Construct a new Scorer for this Weight. See The Scorer Class below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents given the Query.
  6. Weight#explain(IndexReader, int) — Provide a means for explaining why a given document was scored the way it was.

The Scorer Class

The Scorer abstract class provides common scoring functionality for all Scorer implementations and is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which must be implemented:

  1. Scorer#next() — Advances to the next document that matches this Query, returning true if and only if there is another document that matches.
  2. Scorer#doc() — Returns the id of the Document that contains the match. It is not valid until next() has been called at least once.
  3. Scorer#score() — Return the score of the current document. This value can be determined in any appropriate way for an application. For instance, the TermScorer returns the tf * Weight.getValue() * fieldNorm.
  4. Scorer#skipTo(int) — Skip ahead in the document matches to the document whose id is greater than or equal to the passed in value. In many instances, skipTo can be implemented more efficiently than simply looping through all the matching documents until the target document is identified.
  5. Scorer#explain(int) — Provides details on why the score came about.

Why would I want to add my own Query?

In a nutshell, you want to add your own custom Query implementation when you think that Lucene's aren't appropriate for the task that you want to do. You might be doing some cutting edge research or you need more information back out of Lucene (similar to Doug adding SpanQuery functionality).

Examples

FILL IN HERE

org.apache.lucene.search.function org.apache.lucene.search.function
Programmatic control over documents scores.
The function package provides tight control over documents scores.
WARNING: The status of the search.function package is experimental. The APIs introduced here might change in the future and will not be supported anymore in such a case.
Two types of queries are available in this package:
  1. Custom Score queries - allowing to set the score of a matching document as a mathematical expression over scores of that document by contained (sub) queries.
  2. Field score queries - allowing to base the score of a document on numeric values of indexed fields.
 
Some possible uses of these queries:
  1. Normalizing the document scores by values indexed in a special field - for instance, experimenting with a different doc length normalization.
  2. Introducing some static scoring element, to the score of a document, - for instance using some topological attribute of the links to/from a document.
  3. Computing the score of a matching document as an arbitrary odd function of its score by a certain query.
Performance and Quality Considerations:
  1. When scoring by values of indexed fields, these values are loaded into memory. Unlike the regular scoring, where the required information is read from disk as necessary, here field values are loaded once and cached by Lucene in memory for further use, anticipating reuse by further queries. While all this is carefully cached with performance in mind, it is recommended to use these features only when the default Lucene scoring does not match your "special" application needs.
  2. Use only with carefully selected fields, because in most cases, search quality with regular Lucene scoring would outperform that of scoring by field values.
  3. Values of fields used for scoring should match. Do not apply on a field containing arbitrary (long) text. Do not mix values in the same field if that field is used for scoring.
  4. Smaller (shorter) field tokens means less RAM (something always desired). When using FieldScoreQuery, select the shortest FieldScoreQuery.Type that is sufficient for the used field values.
  5. Reusing IndexReaders/IndexSearchers is essential, because the caching of field tokens is based on an IndexReader. Whenever a new IndexReader is used, values currently in the cache cannot be used and new values must be loaded from disk. So replace/refresh readers/searchers in a controlled manner.
History and Credits:
  • A large part of the code of this package was originated from Yonik's FunctionQuery code that was imported from Solr (see LUCENE-446).
  • The idea behind CustomScoreQurey is borrowed from the "Easily create queries that transform sub-query scores arbitrarily" contribution by Mike Klaas (see LUCENE-850) though the implementation and API here are different.
Code sample:

Note: code snippets here should work, but they were never really compiled... so, tests sources under TestCustomScoreQuery, TestFieldScoreQuery and TestOrdValues may also be useful.

  1. Using field (byte) values to as scores:

    Indexing:

          f = new Field("score", "7", Field.Store.NO, Field.Index.UN_TOKENIZED);
          f.setOmitNorms(true);
          d1.add(f);
        

    Search:

          Query q = new FieldScoreQuery("score", FieldScoreQuery.Type.BYTE);
        
    Document d1 above would get a score of 7.
  2. Manipulating scores

    Dividing the original score of each document by a square root of its docid (just to demonstrate what it takes to manipulate scores this way)

          Query q = queryParser.parse("my query text");
          CustomScoreQuery customQ = new CustomScoreQuery(q) {
            public float customScore(int doc, float subQueryScore, float valSrcScore) {
              return subQueryScore / Math.sqrt(docid);
            }
          };
        

    For more informative debug info on the custom query, also override the name() method:

          CustomScoreQuery customQ = new CustomScoreQuery(q) {
            public float customScore(int doc, float subQueryScore, float valSrcScore) {
              return subQueryScore / Math.sqrt(docid);
            }
            public String name() {
              return "1/sqrt(docid)";
            }
          };
        

    Taking the square root of the original score and multiplying it by a "short field driven score", ie, the short value that was indexed for the scored doc in a certain field:

          Query q = queryParser.parse("my query text");
          FieldScoreQuery qf = new FieldScoreQuery("shortScore", FieldScoreQuery.Type.SHORT);
          CustomScoreQuery customQ = new CustomScoreQuery(q,qf) {
            public float customScore(int doc, float subQueryScore, float valSrcScore) {
              return Math.sqrt(subQueryScore) * valSrcScore;
            }
            public String name() {
              return "shortVal*sqrt(score)";
            }
          };
        
org.apache.lucene.search.highlight The highlight package contains classes to provide "keyword in context" features typically used to highlight search terms in the text of results pages. The Highlighter class is the central component and can be used to extract the most interesting sections of a piece of text and highlight them, with the help of Fragmenter, FragmentScorer, Formatter classes.

Example Usage

  //... Above, create documents with two fields, one with term vectors (tv) and one without (notv)
  IndexSearcher searcher = new IndexSearcher(directory);
  QueryParser parser = new QueryParser("notv", analyzer);
  Query query = parser.parse("million");
  //query = query.rewrite(reader); //required to expand search terms
  Hits hits = searcher.search(query);

  SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
  Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
  for (int i = 0; i < 10; i++) {
    String text = hits.doc(i).get("notv");
    TokenStream tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.id(i), "notv", analyzer);
    TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);//highlighter.getBestFragments(tokenStream, text, 3, "...");
    for (int j = 0; j < frag.length; j++) {
      if ((frag[j] != null) && (frag[j].getScore() > 0)) {
        System.out.println((frag[j].toString()));
      }
    }
    //Term vector
    text = hits.doc(i).get("tv");
    tokenStream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), hits.id(i), "tv", analyzer);
    frag = highlighter.getBestTextFragments(tokenStream, text, false, 10);
    for (int j = 0; j < frag.length; j++) {
      if ((frag[j] != null) && (frag[j].getScore() > 0)) {
        System.out.println((frag[j].toString()));
      }
    }
    System.out.println("-------------");
  }

New features 06/02/2005

This release adds options for encoding (thanks to Nicko Cadell). An "Encoder" implementation such as the new SimpleHTMLEncoder class can be passed to the highlighter to encode all those non-xhtml standard characters such as & into legal values. This simple class may not suffice for some languages - Commons Lang has an implementation that could be used: escapeHtml(String) in http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/lang/trunk/src/java/org/apache/commons/lang/StringEscapeUtils.java?rev=137958&view=markup

New features 22/12/2004

This release adds some new capabilities:
  1. Faster highlighting using Term vector support
  2. New formatting options to use color intensity to show informational value
  3. Options for better summarization by using term IDF scores to influence fragment selection

The highlighter takes a TokenStream as input. Until now these streams have typically been produced using an Analyzer but the new class TokenSources provides helper methods for obtaining TokenStreams from the new TermVector position support (see latest CVS version).

The new class GradientFormatter can use a scale of colors to highlight terms according to their score. A subtle use of color can help emphasise the reasons for matching (useful when doing "MoreLikeThis" queries and you want to see what the basis of the similarities are).

The QueryScorer class has a new constructor which can use an IndexReader to derive the IDF (inverse document frequency) for each term in order to influcence the score. This is useful for helping to extracting the most significant sections of a document and in supplying scores used by the new GradientFormatter to color significant words more strongly. The QueryScorer.getMaxWeight method is useful when passed to the GradientFormatter constructor to define the top score which is associated with the top color.

org.apache.lucene.search.payloads org.apache.lucene.search.payloads
The payloads package provides Query mechanisms for finding and using payloads. The following Query implementations are provided:
  1. BoostingTermQuery -- Boost a term's score based on the value of the payload located at that term
 
org.apache.lucene.search.regex Regular expression Query.
org.apache.lucene.search.similar Document similarity query generators.
org.apache.lucene.search.spans The calculus of spans.

A span is a <doc,startPosition,endPosition> tuple.

The following span query operators are implemented:

  • A SpanTermQuery matches all spans containing a particular Term.
  • A SpanNearQuery matches spans which occur near one another, and can be used to implement things like phrase search (when constructed from SpanTermQueries) and inter-phrase proximity (when constructed from other SpanNearQueries).
  • A SpanOrQuery merges spans from a number of other SpanQueries.
  • A SpanNotQuery removes spans matching one SpanQuery which overlap another. This can be used, e.g., to implement within-paragraph search.
  • A SpanFirstQuery matches spans matching q whose end position is less than n. This can be used to constrain matches to the first part of the document.
In all cases, output spans are minimally inclusive. In other words, a span formed by matching a span in x and y starts at the lesser of the two starts and ends at the greater of the two ends.

For example, a span query which matches "John Kerry" within ten words of "George Bush" within the first 100 words of the document could be constructed with:

SpanQuery john   = new SpanTermQuery(new Term("content", "john"));
SpanQuery kerry  = new SpanTermQuery(new Term("content", "kerry"));
SpanQuery george = new SpanTermQuery(new Term("content", "george"));
SpanQuery bush   = new SpanTermQuery(new Term("content", "bush"));

SpanQuery johnKerry =
   new SpanNearQuery(new SpanQuery[] {john, kerry}, 0, true);

SpanQuery georgeBush =
   new SpanNearQuery(new SpanQuery[] {george, bush}, 0, true);

SpanQuery johnKerryNearGeorgeBush =
   new SpanNearQuery(new SpanQuery[] {johnKerry, georgeBush}, 10, false);

SpanQuery johnKerryNearGeorgeBushAtStart =
   new SpanFirstQuery(johnKerryNearGeorgeBush, 100);

Span queries may be freely intermixed with other Lucene queries. So, for example, the above query can be restricted to documents which also use the word "iraq" with:

Query query = new BooleanQuery();
query.add(johnKerryNearGeorgeBushAtStart, true, false);
query.add(new TermQuery("content", "iraq"), true, false);
org.apache.lucene.search.spell Suggest alternate spellings for words. Also see the spell checker Wiki page.
org.apache.lucene.store Binary i/o API, used for all index data.
org.apache.lucene.store.db
org.apache.lucene.store.je
org.apache.lucene.swing.models Decorators for JTable TableModel and JList ListModel encapsulating Lucene indexing and searching functionality.
org.apache.lucene.util Some utility classes.
org.apache.lucene.wikipedia.analysis
org.apache.lucene.wordnet WordNet Lucene Synonyms Integration This package uses synonyms defined by WordNet to build a Lucene index storing them, which in turn can be used for query expansion. You normally run {@link org.apache.lucene.wordnet.Syns2Index} once to build the query index/"database", and then call {@link org.apache.lucene.wordnet.SynExpand#expand SynExpand.expand(...)} to expand a query.

Instructions

  1. Download the WordNet prolog database , gunzip, untar etc.
  2. Invoke Syn2Index as appropriate to build a synonym index. It'll take 2 arguments, the path to wn_s.pl from that WordNet downlaod, and the index name.
  3. Update your UI so that as appropriate you call SynExpand.expand(...) to expand user queries with synonyms.
org.apache.lucene.xmlparser
org.apache.lucene.xmlparser.builders
org.apache.regexp This package exists to allow access to useful package protected data within Jakarta Regexp. This data has now been opened up with an accessor, but an official release with that change has not been made to date.
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.