DocSearcher is a search tool.


Overview

Download

Creating Indexes

Developer Info

Search Techniques

Support

Servlet

History

Changelog

ToDo

Creating searchable CDROM

Developer Info

This is a Java (Swing) Application.

To extract the source, unzip the source.jar file;
      jar -xf source.jar

Notes:
      The main class is called DocSearch.java.
      The method that performs indexing is called createNewIndex
      Other classes of interest are the wrapper objects for various file types;

  • WordProps ; for working with POI HDF API
  • ExcelProps ; for working with POI HSSF API
  • PdfToText ; for working with PDF Box API
  • RtfToText ; uses javax.swing.rtf API
  • OoToText ; uses java.util.zip API to unzip the star office / open office documents and extract the content XML file

DocSearcher creates and stores its indexes and all related files in the ".docSearcher" folder underneath the user's home directory. On a linux system this might be
/home/john/.docSearcher
and on a windows system it might be something like:
C:\Documents and Settings\username\.docSearcher

DocSearcher indexes are Lucene indexes with the following fields and types:

Field Description Indexing Properties
author taken from the documents meta data text
path file handle unindexed
mod_date date document was last modified text (Lucene DateField text object)
title title obtained via meta data (if exists) otherwise a grab of the first few lines or characters text
summary first few lines of text text
body text of entire document (without meta data) text
URL if the index is created as a "web" index - DocSearcher will construct a URL for each file text
keywords taken from document meta data (if exits); mostly relevant on indexed web page documents text
size size in bytes keyword
type document suffix (htm, doc, pdf, etc...) text

If you want to constuct a search JSP or servlet, the above table should be very helpful.

In addition, you may want to review the doSearch() method in DocSearch.java. This will show you how dates are handled and other meta data are searched. DocSearcher creates and stores its Lucene indexes in $user_home/.docSearcher/indexes directory.

I hope to have an example JSP ready soon.

Another resource that may be of assistance is reviewing the standard output of DocSearch.jar:
i.e. : java -jar DocSearch.jar

DocSearcher will display the Lucene search string that it builds from your GUI input so that you can see what this search string looks like to the Lucene API.

If you are curious how it performs index updates; please take a look at the source code DocSearcherIndex.java, and then look at the DocSearch.java method updateIndex(docSearcherIndex di) which performs the actual updates to the Lucene indexes.

I've attempted to tune this method to scale fairly well even on large indexes; but if you have suggestions on improvement - those are always welcome. ;)

Command Line Arguments

java -jar DocSearch.jar ["action"]  ["index" or log file name]
... where actions can be:

update : which means update an index

export : which means export an index to a zip file

list : which lists the indexes

analyze_log : which analyzes search log data (from a servlet)

"Search:text to find" : which performs a search and outputs
the text result to the console.