root/libswish3/trunk/doc/swish_intro.7.pod

Revision 2087, 5.4 kB (checked in by karpet, 4 months ago)

init some docs

Line 
1 =pod
2
3 =head1 Introduction to Swish3
4
5 Swish is the Simple Web Indexing System for Humans. Swish
6 is an information retrieval tool. It is B<not> a search engine, but
7 can be used as an integral part of creating a search engine. Swish gathers,
8 parses, indexes and searches document collections. A collection can be any
9 set of real or virtual documents: web pages, database rows, PDFs or office
10 files, or anything else that can be converted to text.
11
12 Swish3 is version three of Swish.
13 Kevin Hughes wrote the original version in 1994. In 2000, the project was
14 updated and released as Swish-e version 2 (the -e is for Enhanced). Swish3
15 is the third phase in the evolution of the project.
16
17 In this document, the name C<Swish> will refer to the entire project,
18 without regard to a particular version. C<Swish-e> will refer specifically
19 to version 2.x. C<Swish3> will refer specifically to version three.
20
21 =head2 Anatomy of a Search Tool Chain
22
23 The following description could apply to any search system or information
24 retrival project, not just Swish. First we'll look at the various
25 parts of the system, then look at how they are implemented in Swish3.
26
27 Every search system implements the following chain of features:
28
29 =over
30
31 =item aggregator
32
33 An aggregator assembles documents into a collection. It can be as simple
34 as a filesystem tool like the Unix B<find> command or as sophisticated as a
35 web crawler. An aggregator selects documents based on various criteria:
36 content, MIME type or format, date, author, URL, or any other criteria
37 that you desire.
38
39 =item normalizer
40
41 A normalizer verifies that all documents the aggregator collects are in a format
42 that the analyzer can parse. For example, a binary file format like PDF is
43 converted to HTML or unmarked text. The same is true for all office file formats,
44 PostScript, etc.
45
46 =item analyzer
47
48 An analyzer examines the text supplied via the aggegrator/normalizer steps.
49 The analyzer does several things, some of them optional:
50
51 =over
52
53 =item parsing
54
55 Separates text from any surrounding markup, optionally
56 remembering the context (tag) in which text was found.
57
58 =item case folding
59
60 Changes the text to all lowercase or all uppercase, to make comparisons
61 easier.
62
63 =item tokenizing
64
65 Splitting a string of text into tokens or words.
66
67 =item stemming
68
69 Using one of a variety of word-stemming algorithms, tries to discover the root
70 C<stem> of each word.
71
72 =item customization
73
74 Many advanced analyzers offer some level of customization to apply at some
75 point in the analysis, whether it be synonym matching or other linguistic
76 logic.
77
78 =back
79
80 =item indexer
81
82 An indexer stores basic document metadata and token (word) information
83 in an index for fast and efficient retrieval.
84
85 =item searcher
86
87 A searcher parses a user query using the same logic used by the analyser
88 when processing the original document collection,
89 applies some well-defined rules for matching documents in the index,
90 and then returns results, typically a list or iterator of matching documents.
91
92 =back
93
94 Now let's look at how Swish3 implements these five features.
95
96 =head2 A Library, Not a Command
97
98 The first thing to know about Swish3 is that, unlike previous versions of
99 Swish, there is not a single Swish3 implementation.
100
101 That might sound confusing at first, because it is a significant
102 departure from earlier versions of Swish, where there was a primary
103 program, written in C, which handled all five links in the search chain.
104 Swish3 takes a different approach.
105
106 Swish3 is primarily a C library called B<libswish3>. The library has a
107 well-defined list of public functions and data structures that aim
108 to fill a particular void in the world of information retrieval tools:
109 analyzing HTML and XML documents.
110
111 Swish3 takes as its starting point the B<-S prog> feature of Swish-e,
112 where you can define your own aggregator/normalizer program, and makes that
113 Swish3's central feature. Swish3 extends the B<-S prog> API to include
114 additional header values, and adds the same MIME-type-matching feature
115 as the popular Apache web server.
116
117 Swish3 has no native indexer or searcher features [TODO: this might change
118 if the 2.6 BDB backend is ported]. Nor does it have any aggregator or normalizer
119 features. Swish3 is primarily an analyzer.
120
121 The Swish3 distribution does come with some examples of how to write Swish3
122 applications, including an example program for using the popular Xapian
123 library. And there is a Perl implementation based on the SWISH::Prog package.
124
125 =head2 So How Does It Work?
126
127 libswish3 defines hooks or callbacks where you can override the default
128 behaviour of the analyzer. These hooks are intended for making it easy to
129 plug libswish3 into the indexing chain.
130
131 Here's one example. If you wanted to index a web site, you might use an
132 aggregator/normalizer tool like Swish-e's B<spider.pl>. spider.pl will print its
133 output on stdout.
134
135  % spider.pl your_config > spider_output
136
137 Then you could use a program like B<swish_xapian> to analyze and index the
138 output:
139
140  % swish_xapian -c swish.conf - < spider_output
141
142 If you look at the source for the B<swish_xapian> program, in
143 the libswish3 distribution, you will see that there is a B<handler> function
144 defined that takes the output of the libswish3 parsing function and
145 adds it to a Xapian index.
146
147 =head2 See Also
148
149 This document provides an overview of Swish3's anatomy. You might also be
150 interested in these docs:
151
152 =over
153
154 =item
155
156 L<Migrating from Swish-e to Swish3|swish_migration.7>
157
158 =item
159
160 L<Perl implementation of Swish3|SWISH::Prog>
161
162 =item
163
164 L<libswish3 API|libswish3.3>
165
166 =back
167
Note: See TracBrowser for help on using the browser.