DocSearcher is a search tool.

Overview

Download

Creating Indexes

Developer Info

Search Techniques

Support

Servlet

History

Changelog

ToDo

Creating searchable CDROM

Developer Info

This is a Java (Swing) Application.

Download the source archive or use git clone from Codeberg.

Notes:
The main class is called DocSearch.java.
The method that performs indexing is called createNewIndex
Other classes of interest are the wrapper objects for various file types;

WordProps ; for working with POI HDF API
ExcelProps ; for working with POI HSSF API
PdfToText ; for working with PDF Box API
RtfToText ; uses javax.swing.rtf API
OoToText ; uses java.util.zip API to unzip the star office / open office documents and extract the content XML file

DocSearcher creates and stores its indexes and all related files in the .docsearcher2 folder underneath the user's home directory. On a linux system this might be
/home/<username>/.docsearcher2
and on a windows system it might be something like:
C:\users\<username>\.docsearcher2

DocSearcher indexes are Lucene indexes with the following fields and types:

Field	Description	Indexing properties
author	taken from the document meta data	stored, tokenized, indexed
path	file handle	stored
mod_date	date document was last modified	stored
title	title obtained via meta data (if exists) otherwise a grab of the first few lines or characters	stored, tokenized, indexed
summary	first few lines of text	stored, tokenized, indexed
body	text of entire document (without meta data)	tokenized, indexed
URL	if the index is created as a "web" index - DocSearcher will construct a URL for each file	stored, tokenized, indexed
keywords	taken from document meta data (if exits); mostly relevant on indexed web page documents	stored, tokenized, indexed
size	size in bytes	stored
type	document suffix (htm, doc, pdf, etc...)	stored, tokenized, indexed