A FullText search engine based on the Xapian library. Provides support for multiple concurrent indexes via the XML-RPC protocol.
3a8224897c44 — Bill Welliver default tip 2 years ago
don't leave a space before elipsis
d71f1d175094 — Bill Welliver 2 years ago
FullText.UpdateClient: correctly provide content when adding
8f931324c1af — Bill Welliver 3 years ago
remove some new fangled syntax

clone

read-only
https://hg.sr.ht/~hww3/fulltext
read/write
ssh://hg@hg.sr.ht/~hww3/fulltext
Getting Started

You should have the following installed:

Pike 7.8
Fins framework
Xapian full text library
Public.Xapian module

To start the server, first look at the settings in the config/dev.cfg file to
make sure that everything looks good. Once you've done that, set the FINS_HOME
environment variable so that it points to the location you've installed the 
Fins framework. Once you've done that, you should be able to run the start 
script, which is located in the bin/ directory:

FINS_HOME=/path/to/fins
export FINS_HOME
cd /path/to/fulltext
bin/start.sh 
 (or "bin/start.sh -d", optionally once you've verified everything's working)

FTAdmin

Fins/Xapian provides a script that performs certain administrative functions. This
script is located within the bin directory and performs the following functions:

Create a new index:
bin/ftadmin.sh new indexname 

Grant access to an index (prints out the newly granted auth code):
bin/ftadmin.sh grant indexname

Revoke access to an index for an auth code:
bin/ftadmin.sh grant indexname authcode

Shut down the server (optionally after a delay):
bin/ftadmin.sh shutdown [seconds] 

Note that in order for the script to work, the server must be running on the local 
host.

Security

The FullText application supports 2 levels of security: standard and simplified. You 
may choose either based on your particular needs, however the "standard" model is
enabled by default.

When using the standard security model, there are administrative authorization codes
that are used to create new indices as well as to grant or revoke access to a given
index. The administrative authorization codes are placed in the "auth" section of the 
application configuration file, and multiple administrative authorization codes may
be enabled at one time. These codes are read at start up time and the application must
be shut down in order to flush existing codes.

In order to search or update an index while running in standard security mode, a client
must provide a valid index authorization code. A given code is specific for a particular
index and may be obtained by using the administrative client. Similarly, codes may
be revoked using the administrative client. Codes may be granted and accessed at any
time, without restarting the FullText application.

When using the simplified mode, during search or update operations, the FullText 
application simply validates the authorization code provided by a client against its 
list of administrative authorization codes. This can simplify management of 
authorization codes for certain scenarios, such as developement or other small scale 
installations at the expense of giving each user "the keys to the castle".

You may enable the simplified security mechanism by setting the "use_simple_security" 
flag in the "auth" section of the application configuration file. When running in 
the simplified mode, the grant and revoke functionality is disabled.

In either case, if a valid administrative access code is not present in the application
configuration file on startup, one will be created and enabled. A message will be
displayed in the application log along with the new administration authorization code.

Client Example

import FullText;
string index = "myFTIndex";

// change to '1' if you want to create the index if it doesn't exist.
int create_if_new = 0;  
string authcode = "1234567890"; // see the security section for details on auth codes.

// if we're running the FullText application on http://localhost:8124, 
// we can use the default url.
object a = AdminClient(0, authcode);
if(!a->exists(index))
{
  a->new(index);
  werror("new auth code for index: %O\n",
    authcode = a->grant_access(index));
}
object u = UpdateClient(0, index, authcode);

// now, let's add some content
string content =  "mary had a little lamb, its fleece was white as snow.";
string title = "mary and her lamb"; // the title of the content, stored and returned with searches
string handle = "/rhymes/mary"; // a (hopefully) unique identifier for this bit of content

 u->add(title, Calendar.now()->seconds(), content, handle, 0, "text/plain");

// ok, now that we've added, we can search:

object s = SearchClient(0, index, authcode);

foreach(s->search("lamb");; mapping doc)
  werror("found a hit: %O, rating: %O, handle: %O", doc->title, doc->score, doc->handle);



Indexing support for various file formats

The indexer has built in support for plain text files and HTML. You may add support for 
additional file formats by telling the engine about programs that can convert other formats
to HTML or plain text.

Some examples that have been successfully tested:

PDF

http://pdftohtml.sourceforge.net/

Install pdftohtml and then add the following to your FullText config file:

[transform_pdf]
type=converter
mimetype=application/pdf
command=/usr/local/bin/pdftohtml -stdout -q %f

RTF

http://sourceforge.net/projects/rtf2html-lite/

The free tool, rtf2html, can be used to process rtf files. However, out of the box,
this tool does not behave as either a filter or converter. A simple script is included
in the extras folder which can be used to make the rtf2html utility behave in a compatible
mannter.

Install rtf2html, edit the rtf2html_converter script appropriately,  and then add the 
following to your FullText config file:

[transform_rtf]
type=converter
mimetype=text/rtf
command=/path/to/extras/rtf2html_converter %f

DOC/DOCX/ODT/ABW

AbiWord can load various Word/OpenOffice formats and includes a tool called "AbiCommand"
that can be used to read a file and convert it into HTML format. The actual implementation
is left as an exercise to for the reader, however, the following page includes almost
everything a user might need to make this happen. Hint: start with the RTF filter above. 

http://www.abisource.com/wiki/AbiCommand

OTHERS

Apache Tika seems like it could be a useful tool, it includes support for a number 
of popular file formats and has an out of the box command line utility.

Drawbacks include:

- written in java, so not exactly nimble

If you download the Tika jar from the Apache Tika website, you can use the following
config section to handle pdf, doc and various other formats:

[transform_tika]
type=converter
command=java -jar /home/hww3/Fins/FullText/extras/tika-app-1.0.jar -h %f
mimetype=application/pdf
mimetype=text/rtf
mimetype=application/rtf
mimetype=application/msword
mimetype=application/vnd.openxmlformats-officedocument.wordprocessingml.document
mimetype=application/vnd.oasis.opendocument.text
mimetype=application/x-vnd.oasis.opendocument.text