Developer Info
This is a Java (Swing) Application.
To extract the source, unzip the source.jar file;
jar -xf source.jar
Notes:
The main class is called DocSearch.java.
The method that performs indexing is called createNewIndex
Other classes of interest are the wrapper objects
for various file types;
- WordProps ; for working with POI HDF API
- ExcelProps ; for working with POI HSSF API
- PdfToText ; for working with PDF Box API
- RtfToText ; uses javax.swing.rtf API
- OoToText ; uses java.util.zip API to unzip the star office / open office documents and extract the content XML file
DocSearcher creates and stores its indexes and all related files in the ".docSearcher" folder underneath the user's home
directory. On a linux system this might be
/home/john/.docSearcher
and on a windows system it might be something like:
C:\Documents and Settings\username\.docSearcher
DocSearcher indexes are Lucene indexes with the following fields and types:
Field |
Description |
Indexing Properties |
author |
taken from the documents meta data |
text |
path |
file handle |
unindexed |
mod_date |
date document was last modified |
text (Lucene DateField text object) |
title |
title obtained via meta data (if exists) otherwise a grab of the first few lines or characters |
text |
summary |
first few lines of text |
text |
body |
text of entire document (without meta data) |
text |
URL |
if the index is created as a "web" index - DocSearcher will construct a URL for each file |
text |
keywords |
taken from document meta data (if exits); mostly relevant on indexed web page documents |
text |
size |
size in bytes |
keyword |
type |
document suffix (htm, doc, pdf, etc...) |
text |
If you want to constuct a search JSP or servlet, the above table should be very helpful.
In addition, you may want to review the doSearch() method in
DocSearch.java. This will show you how dates are handled and other meta
data are searched. DocSearcher creates and stores its Lucene indexes in
$user_home/.docSearcher/indexes directory.
I hope to have an example JSP ready soon.
Another resource that may be of assistance is
reviewing the standard output of DocSearch.jar:
i.e. : java -jar DocSearch.jar
DocSearcher will display the Lucene search string that it builds from
your GUI input so that you can see what this search string looks like
to the Lucene API.
If you are curious how it performs index updates; please take a look at
the source code DocSearcherIndex.java, and then look at the
DocSearch.java method updateIndex(docSearcherIndex di) which
performs the actual updates to the Lucene indexes.
I've attempted to tune this method to scale fairly well even on large
indexes; but if you have suggestions on improvement - those are always
welcome. ;)
Command Line Arguments
java -jar DocSearch.jar ["action"] ["index" or log file name]
... where actions can be:
update : which means update an index
export : which means export an index to a zip file
list : which lists the indexes
analyze_log : which analyzes search log data (from a servlet)
"Search:text to find" : which performs a search and outputs
the text result to the console.
|