| 1 |
=pod |
|---|
| 2 |
|
|---|
| 3 |
=head1 Introduction to Swish3 |
|---|
| 4 |
|
|---|
| 5 |
Swish is the Simple Web Indexing System for Humans. Swish |
|---|
| 6 |
is an information retrieval tool. It is B<not> a search engine, but |
|---|
| 7 |
can be used as an integral part of creating a search engine. Swish gathers, |
|---|
| 8 |
parses, indexes and searches document collections. A collection can be any |
|---|
| 9 |
set of real or virtual documents: web pages, database rows, PDFs or office |
|---|
| 10 |
files, or anything else that can be converted to text. |
|---|
| 11 |
|
|---|
| 12 |
Swish3 is version three of Swish. |
|---|
| 13 |
Kevin Hughes wrote the original version in 1994. In 2000, the project was |
|---|
| 14 |
updated and released as Swish-e version 2 (the -e is for Enhanced). Swish3 |
|---|
| 15 |
is the third phase in the evolution of the project. |
|---|
| 16 |
|
|---|
| 17 |
In this document, the name C<Swish> will refer to the entire project, |
|---|
| 18 |
without regard to a particular version. C<Swish-e> will refer specifically |
|---|
| 19 |
to version 2.x. C<Swish3> will refer specifically to version three. |
|---|
| 20 |
|
|---|
| 21 |
=head2 Anatomy of a Search Tool Chain |
|---|
| 22 |
|
|---|
| 23 |
The following description could apply to any search system or information |
|---|
| 24 |
retrival project, not just Swish. First we'll look at the various |
|---|
| 25 |
parts of the system, then look at how they are implemented in Swish3. |
|---|
| 26 |
|
|---|
| 27 |
Every search system implements the following chain of features: |
|---|
| 28 |
|
|---|
| 29 |
=over |
|---|
| 30 |
|
|---|
| 31 |
=item aggregator |
|---|
| 32 |
|
|---|
| 33 |
An aggregator assembles documents into a collection. It can be as simple |
|---|
| 34 |
as a filesystem tool like the Unix B<find> command or as sophisticated as a |
|---|
| 35 |
web crawler. An aggregator selects documents based on various criteria: |
|---|
| 36 |
content, MIME type or format, date, author, URL, or any other criteria |
|---|
| 37 |
that you desire. |
|---|
| 38 |
|
|---|
| 39 |
=item normalizer |
|---|
| 40 |
|
|---|
| 41 |
A normalizer verifies that all documents the aggregator collects are in a format |
|---|
| 42 |
that the analyzer can parse. For example, a binary file format like PDF is |
|---|
| 43 |
converted to HTML or unmarked text. The same is true for all office file formats, |
|---|
| 44 |
PostScript, etc. |
|---|
| 45 |
|
|---|
| 46 |
=item analyzer |
|---|
| 47 |
|
|---|
| 48 |
An analyzer examines the text supplied via the aggegrator/normalizer steps. |
|---|
| 49 |
The analyzer does several things, some of them optional: |
|---|
| 50 |
|
|---|
| 51 |
=over |
|---|
| 52 |
|
|---|
| 53 |
=item parsing |
|---|
| 54 |
|
|---|
| 55 |
Separates text from any surrounding markup, optionally |
|---|
| 56 |
remembering the context (tag) in which text was found. |
|---|
| 57 |
|
|---|
| 58 |
=item case folding |
|---|
| 59 |
|
|---|
| 60 |
Changes the text to all lowercase or all uppercase, to make comparisons |
|---|
| 61 |
easier. |
|---|
| 62 |
|
|---|
| 63 |
=item tokenizing |
|---|
| 64 |
|
|---|
| 65 |
Splitting a string of text into tokens or words. |
|---|
| 66 |
|
|---|
| 67 |
=item stemming |
|---|
| 68 |
|
|---|
| 69 |
Using one of a variety of word-stemming algorithms, tries to discover the root |
|---|
| 70 |
C<stem> of each word. |
|---|
| 71 |
|
|---|
| 72 |
=item customization |
|---|
| 73 |
|
|---|
| 74 |
Many advanced analyzers offer some level of customization to apply at some |
|---|
| 75 |
point in the analysis, whether it be synonym matching or other linguistic |
|---|
| 76 |
logic. |
|---|
| 77 |
|
|---|
| 78 |
=back |
|---|
| 79 |
|
|---|
| 80 |
=item indexer |
|---|
| 81 |
|
|---|
| 82 |
An indexer stores basic document metadata and token (word) information |
|---|
| 83 |
in an index for fast and efficient retrieval. |
|---|
| 84 |
|
|---|
| 85 |
=item searcher |
|---|
| 86 |
|
|---|
| 87 |
A searcher parses a user query using the same logic used by the analyser |
|---|
| 88 |
when processing the original document collection, |
|---|
| 89 |
applies some well-defined rules for matching documents in the index, |
|---|
| 90 |
and then returns results, typically a list or iterator of matching documents. |
|---|
| 91 |
|
|---|
| 92 |
=back |
|---|
| 93 |
|
|---|
| 94 |
Now let's look at how Swish3 implements these five features. |
|---|
| 95 |
|
|---|
| 96 |
=head2 A Library, Not a Command |
|---|
| 97 |
|
|---|
| 98 |
The first thing to know about Swish3 is that, unlike previous versions of |
|---|
| 99 |
Swish, there is not a single Swish3 implementation. |
|---|
| 100 |
|
|---|
| 101 |
That might sound confusing at first, because it is a significant |
|---|
| 102 |
departure from earlier versions of Swish, where there was a primary |
|---|
| 103 |
program, written in C, which handled all five links in the search chain. |
|---|
| 104 |
Swish3 takes a different approach. |
|---|
| 105 |
|
|---|
| 106 |
Swish3 is primarily a C library called B<libswish3>. The library has a |
|---|
| 107 |
well-defined list of public functions and data structures that aim |
|---|
| 108 |
to fill a particular void in the world of information retrieval tools: |
|---|
| 109 |
analyzing HTML and XML documents. |
|---|
| 110 |
|
|---|
| 111 |
Swish3 takes as its starting point the B<-S prog> feature of Swish-e, |
|---|
| 112 |
where you can define your own aggregator/normalizer program, and makes that |
|---|
| 113 |
Swish3's central feature. Swish3 extends the B<-S prog> API to include |
|---|
| 114 |
additional header values, and adds the same MIME-type-matching feature |
|---|
| 115 |
as the popular Apache web server. |
|---|
| 116 |
|
|---|
| 117 |
Swish3 has no native indexer or searcher features [TODO: this might change |
|---|
| 118 |
if the 2.6 BDB backend is ported]. Nor does it have any aggregator or normalizer |
|---|
| 119 |
features. Swish3 is primarily an analyzer. |
|---|
| 120 |
|
|---|
| 121 |
The Swish3 distribution does come with some examples of how to write Swish3 |
|---|
| 122 |
applications, including an example program for using the popular Xapian |
|---|
| 123 |
library. And there is a Perl implementation based on the SWISH::Prog package. |
|---|
| 124 |
|
|---|
| 125 |
=head2 So How Does It Work? |
|---|
| 126 |
|
|---|
| 127 |
libswish3 defines hooks or callbacks where you can override the default |
|---|
| 128 |
behaviour of the analyzer. These hooks are intended for making it easy to |
|---|
| 129 |
plug libswish3 into the indexing chain. |
|---|
| 130 |
|
|---|
| 131 |
Here's one example. If you wanted to index a web site, you might use an |
|---|
| 132 |
aggregator/normalizer tool like Swish-e's B<spider.pl>. spider.pl will print its |
|---|
| 133 |
output on stdout. |
|---|
| 134 |
|
|---|
| 135 |
% spider.pl your_config > spider_output |
|---|
| 136 |
|
|---|
| 137 |
Then you could use a program like B<swish_xapian> to analyze and index the |
|---|
| 138 |
output: |
|---|
| 139 |
|
|---|
| 140 |
% swish_xapian -c swish.conf - < spider_output |
|---|
| 141 |
|
|---|
| 142 |
If you look at the source for the B<swish_xapian> program, in |
|---|
| 143 |
the libswish3 distribution, you will see that there is a B<handler> function |
|---|
| 144 |
defined that takes the output of the libswish3 parsing function and |
|---|
| 145 |
adds it to a Xapian index. |
|---|
| 146 |
|
|---|
| 147 |
=head2 See Also |
|---|
| 148 |
|
|---|
| 149 |
This document provides an overview of Swish3's anatomy. You might also be |
|---|
| 150 |
interested in these docs: |
|---|
| 151 |
|
|---|
| 152 |
=over |
|---|
| 153 |
|
|---|
| 154 |
=item |
|---|
| 155 |
|
|---|
| 156 |
L<Migrating from Swish-e to Swish3|swish_migration.7> |
|---|
| 157 |
|
|---|
| 158 |
=item |
|---|
| 159 |
|
|---|
| 160 |
L<Perl implementation of Swish3|SWISH::Prog> |
|---|
| 161 |
|
|---|
| 162 |
=item |
|---|
| 163 |
|
|---|
| 164 |
L<libswish3 API|libswish3.3> |
|---|
| 165 |
|
|---|
| 166 |
=back |
|---|
| 167 |
|
|---|