DocSearcher is a search tool.


Overview

Download

Creating Indexes

Developer Info

Search Techniques

Support

Servlet

History

Changelog

ToDo

Creating searchable CDROM

Developer Info

This is a Java (Swing) Application.

Download the source archive or use git clone from Codeberg.

Notes:
      The main class is called DocSearch.java.
      The method that performs indexing is called createNewIndex
      Other classes of interest are the wrapper objects for various file types;

  • WordProps ; for working with POI HDF API
  • ExcelProps ; for working with POI HSSF API
  • PdfToText ; for working with PDF Box API
  • RtfToText ; uses javax.swing.rtf API
  • OoToText ; uses java.util.zip API to unzip the star office / open office documents and extract the content XML file

DocSearcher creates and stores its indexes and all related files in the .docsearcher2 folder underneath the user's home directory. On a linux system this might be
/home/<username>/.docsearcher2
and on a windows system it might be something like:
C:\users\<username>\.docsearcher2

DocSearcher indexes are Lucene indexes with the following fields and types:

Field Description Indexing properties
author taken from the document meta data stored, tokenized, indexed
path file handle stored
mod_date date document was last modified stored
title title obtained via meta data (if exists) otherwise a grab of the first few lines or characters stored, tokenized, indexed
summary first few lines of text stored, tokenized, indexed
body text of entire document (without meta data) tokenized, indexed
URL if the index is created as a "web" index - DocSearcher will construct a URL for each file stored, tokenized, indexed
keywords taken from document meta data (if exits); mostly relevant on indexed web page documents stored, tokenized, indexed
size size in bytes stored
type document suffix (htm, doc, pdf, etc...) stored, tokenized, indexed