root/libswish3/tags/10Feb2008/doc/libswish3.3.pod.in

Revision 1955, 19.3 kB (checked in by karpet, 1 year ago)

doc tweek; come config work

Line 
1 =pod
2
3 =head1 NAME
4
5 libswish3 - Swish3 C library
6
7 =head1 SYNOPSIS
8
9 <<libswish3.h_HERE>>
10
11 =head1 DESCRIPTION
12
13 B<libswish3> is the core C library of B<Swish3>.
14
15 B<libswish3> uses the GNOME L<Libxml2|http://xmlsoft.org/> library to parse words and metadata
16 from XML, HTML and plain text files. B<libswish3> supports full UTF-8 encoding.
17
18 B<libswish3> is a parsing tool for use with information retrieval (IR) libraries.
19 Dynamic language bindings are available in the source distribution in the C<bindings>
20 directory.
21
22 =head1 APIs
23
24 The following APIs are defined:
25
26 =head1 Parsing API
27
28 B<libswish3> provides three basic input functions:
29
30 =over
31
32 =item
33
34 swish_parse_file()
35
36 =item
37
38 swish_parse_fh()
39
40 =item
41
42 swish_parse_buffer()
43
44 =back
45
46 Each of these functions takes a C<swish_Parser> struct pointer
47 and optional I<user_data>.
48
49 In addition:
50
51 =over
52
53 =item
54
55 The swish_parse_file() function takes a file path, which must be a valid file.
56 Directories and links are not supported. The assumption is that you will use
57 your calling code to recurse through directories and handle links.
58
59 =item
60
61 swish_parse_buffer() takes a string representing the document
62 headers and the full text of the document.
63
64 =item
65
66 swish_parse_fh() takes a filehandle pointer, which if set to NULL,
67 defaults to stdin.
68
69 =back
70
71 See the L<Headers API> section for more
72 information on using swish_parse_fh() and
73 swish_parse_buffer().
74
75 See the L<I<handler> Function> section for more information on how
76 to deal with the data extracted by each of the swish_parse_* functions.
77
78
79 =head1 Headers API
80
81 The Headers API supports and extends the Swish-e B<-S prog> feature,
82 which allows you to feed the indexer with output from another I<prog>ram.
83 The API has been extended from Swish-e's to allow for MIME types
84 and more congruence with the HTTP 1.1 specification.
85
86 See SWISH-RUN documentation
87 in the Swish-e distribution for the Swish-e version 2 headers API.
88
89 This is the libswish3 implementation. See B<SWISH::Prog::Headers> for a simple
90 Perl-based way of generating the proper headers.
91
92 =over
93
94 =item Content-Location
95
96 B<Swish-e name:> Path-Name
97
98 The name of the document. May be any string: an ID of a record in a database,
99 a URL or a simple file name. The string is stored in the swish_DocInfo B<uri> struct member,
100 which is often used as the primary identifier of a document in an index.
101
102 This header is required.
103
104 =item Content-Length
105
106 The length in bytes of the document, starting after the blank line separating the headers
107 from the document itself.
108 The value must be exactly the length of the document, including any extra
109 line feeds or carriage returns at the end of the document.
110
111 Example:
112
113  Content-Location: foo.html
114  Content-Length: 9
115
116  The doc.\n
117  ^^^^^^^^ ^
118  12345678 9
119
120 The value is stored in the swish_DocInfo B<size> struct member.
121
122 This header is required.
123
124
125 =item Last-Modified
126
127 B<Swish-e name:> Last-Mtime
128
129 The last modification time of the document. The value must be an integer:
130 the seconds since the Epoch on your system.
131
132 If not present, will default to the current time.
133
134 The value is stored in the swish_DocInfo B<mtime> struct member.
135
136 This header is not required.
137
138 =item Parser-Type
139
140 B<Swish-e name:> Document-Type
141
142 Explicitly name the parser used for the document, rather than defaulting to the MIME
143 type mapping based on B<Content-Type> and/or B<Content-Location>. The three parser types are:
144
145 =over
146
147 =item
148
149 XML
150
151 =item
152
153 HTML
154
155 =item
156
157 TXT
158
159 =back
160
161 The Swish-e values B<XML2>, B<XML*>, B<HTML2>, B<HTML*>, B<TXT2>, B<TXT*> are also
162 supported for compatibility, but they map to the three libswish3 types.
163
164 The value is stored in the swish_DocInfo B<parser> struct member.
165
166 If not present, the document parser will be automatically chosen based on the following logic:
167
168 =over
169
170 =item
171
172 If a B<Content-Type> is given, the parser mapped to that MIME type will be used. You may override
173 the default mappings in your configuration. See B<Configuration API>.
174
175 =item
176
177 If no B<Content-Type> is given, a MIME type will be guessed at based on the file extension of the
178 document's B<Content-Location>, and the parser mapped to that MIME type will be used.
179
180 =item
181
182 Finally, if a MIME type is not identified, the parser defined in B<SWISHP_CONFIG_DEFAULT_PARSER>
183 in B<libswish3.h> will be used.
184
185 =back
186
187 See also B<Content-Type> and B<Content-Location>.
188
189 This header is not required.
190
191 =item Content-Type
192
193 The MIME type of the document. The libswish3 MIME type list is based on the Apache 2.0
194 version. See L<http://www.iana.org/assignments/media-types/> for the official registry.
195
196 If not defined with B<Content-Type>, the MIME type will be guessed based on the
197 file extension in the B<Content-Location>
198 header. If the B<Content-Location> string does not contain a file extension (as might be the case
199 with non-URL value), or the file extension has no MIME mapping, then the MIME type will default
200 to B<SWISHP_DEFAULT_MIME> as defined in B<libswish3.h>.
201
202 You may override the default extension-to-MIME mappings in your configuration. See B<Configuration API>.
203
204 The value is stored in the swish_DocInfo B<mime> struct member.
205
206 See also B<Content-Location> and B<Parser-Type>.
207
208 This header is not required.
209
210
211 =item Update-Mode
212
213 B<NOTE:> This header exists only for backwards compatibility with Swish-e's incremental
214 index feature. B<It may be deprecated in a future version of libswish3.>
215
216 =back
217
218
219
220 =head1 Structures API
221
222 Writing an effective I<handler> function requires an understanding of some of the key
223 B<libswish3> data structures.
224
225 For more details on any of these structures, see the SYNOPSIS.
226
227 =head2 swish_Config
228
229 A configuration object. This object is required for initializing both a C<swish_Analyzer>
230 object and a C<swish_Parser> object.
231
232 =head2 swish_Parser
233
234 A parser object. Required for executing any of the three C<swish_parse_*> functions.
235
236 =head2 swish_ParseData
237
238 A parser data object. This object is passed around internally by the libxml2
239 SAX2 handlers, and is eventually the object passed to the I<handler> function pointer.
240 See L<The I<handler> Function>.
241
242 =head2 swish_WordList
243
244 A list of words or tokens. The object contains a linked list of swish_Word objects.
245 You can iterate over the contents of the WordList like this:
246
247  swish_debug_msg("%d words in list", list->nwords);
248  list->current = list->head;
249  while (list->current != NULL)
250  {
251         swish_debug_msg("   ---------- WORD ---------  ");
252         swish_debug_msg("word  : %s", list->current->word);
253         swish_debug_msg(" meta : %s", list->current->metaname);
254         swish_debug_msg(" context : %s", list->current->context);
255         swish_debug_msg("  pos : %d", list->current->position);
256         swish_debug_msg("soffset: %d", list->current->start_offset);
257         swish_debug_msg("eoffset: %d", list->current->end_offset);
258            
259         list->current = list->current->next;
260  }
261
262 =head2 swish_Word
263
264 An object representing one word or token in an object. The word's start and end offset,
265 position relative to other words, tag context and MetaName are all available in the object.
266
267 =head2 swish_DocInfo
268
269 An object describing metadata about the document itself: URI, MIME type, size, etc.
270
271 =head2 swish_Analyzer
272
273 The Analyzer object controls how the character content of a document is parsed: whether
274 or not a WordList is created with a tokenizer, if the words (tokens) are lowercased or
275 stemmed, etc.
276
277 =head1 The I<handler> Function
278
279 The I<handler> function pointer is the final link in the parsing chain. The function
280 pointer is set in the Parser object constructor, and is called by each of the
281 swish_parse_* functions after the entire document has been parsed and (optionally)
282 tokenized.
283
284 The I<handler> receives one argument: a swish_ParseData object containing all the metadata
285 and words in the document.
286
287 If all you wanted to do was print out a report about each document as it was parsed,
288 your I<handler> function might be as simple as:
289
290  void
291  my_handler( swish_ParseData * parse_data )
292  {
293     swish_debug_docinfo( parse_data->docinfo );
294     swish_debug_wordlist( parse_data->wordlist );
295     swish_debug_nb( parse_data->properties, "Property" );
296     swish_debug_nb( parse_data->metanames, "MetaName" );
297  }
298  
299 B<IMPORTANT:> After the I<handler> function is called, all the structures referenced
300 by the swish_ParseData object are automatically freed, so if you intend to keep any of the
301 data for storing in an index, you will need to strdup() words, properties, docinfo, etc.
302 as part of your indexing code.
303
304 See the example C<swish_lint.c> file for how to create and pass in a I<handler>
305 function pointer to the swish_Parser constructor.
306
307 =head1 Configuration API
308
309 Configuration is different with B<libswish3> than with Swish-e. The biggest change
310 is that B<libswish3> configuration files are written in XML. This is done for several
311 reasons:
312
313 =over
314
315 =item 1
316
317 Since B<libswish3> already has a powerful XML parser built-in, it's much easier to
318 parse a configuration file written in XML than to port the Swish-e config-style parser
319 to B<libswish3>.
320
321 =item 2
322
323 B<libswish3> stores index header information in a XML format nearly identical
324 to the configuration file format. So the parser needs to understand only one XML
325 schema.
326
327 =item 3
328
329 You can store UTF-8 text in your configuration file and it will be parsed correctly.
330
331 =item 4
332
333 The configuration directive list is extensible. Simple key/value configuration directives
334 can be added without any modification to the B<libswish3> config parser. They are simply
335 stored in the C<swish_Config> struct hash for your own use and amusement.
336
337 B<CAUTION:> The configuration directive names documented in the L<Directives> section below
338 are reserved for use by B<libswish3>. Some of them have special handling considerations
339 (like MetaNames and PropertyNames). So the important idea to grasp with the extensible
340 configuration feature is "simple key/value pairs."
341
342 =back
343
344 This section describes how to build a B<libswish3> configuration file.
345
346 =head2 Configuration Example
347
348 Here's an example B<libswish3> configuration file:
349
350  <swish>
351   <FollowSymLinks>yes</FollowSymLinks>
352  
353   <Meta name="foo" bias="+10" />
354   <Meta name="bar" bias="-5" />
355   <Meta name="swishtitle" bias="+50" alias="title" />
356   <Meta name="other">color size weight</Meta>
357  
358   <Prop name="foo" type="text" ignorecase="1" />
359   <Prop name="bar" type="int" />
360   <Prop name="lastmod" type="date" />
361   <Prop name="bing" comparecase="1" />
362   <Prop name="description" verbatim="1" max="10000" alias="body" length="20" />
363   <Prop name="notsorted" sort="0" />
364  
365   <Tokenize>1</Tokenize>
366  </swish>
367
368 And here's that same example, dissected:
369
370  <swish>
371
372 The top level tag.
373
374  <FollowSymLinks>yes</FollowSymLinks>
375
376 Equivalent to the Swish-e style:
377
378  FollowSymLinks yes
379
380 which simply informs whatever aggregator you are using that when confronted
381 with a symlink on the filesystem, it should be followed.
382
383 C<FollowSymLinks> is an example of a simple key/value pair (see the B<CAUTION> above).
384
385 =head3 MetaNames
386
387 Here's the first big difference from Swish-e. MetaNames, MetaNameAlias, and
388 MetaNamesRank have been combined into a single XML tag with appropriate
389 attributes.
390
391  <Meta name="foo" bias="10" />
392
393 is the same thing as (in Swish-e style):
394
395  MetaNames foo
396  MetaNamesRank 10 foo
397
398 while:
399
400  <Meta name="swishtitle" bias="50" alias="title" />
401
402 is equivalent to:
403
404  MetaNames swishtitle
405  MetaNameAlias swishtitle title
406  MetaNamesRank 50 swishtitle
407
408 You can see that the XML style allows for a terser, more compact expression.
409 You can still assign multiple aliases to a single MetaName:
410
411  <Meta name="other">color size weight</meta>
412
413 is equivalent to:
414
415  MetaNames other
416  MetaNameAlias other color size weight
417
418 In addition, there are some special features intended for use with HTML documents.
419
420  <Meta name="links" html="1" alias="href" />      # same as HTMLLinksMetaName
421  <Meta name="images" html="1" alias="src" />      # same as ImageLinksMetaName
422  <Meta name="alttext" html="1" alias="alt" />     # same as IndexAltTagMetaName
423  <Meta name="as-text" html="1" alias="alt" />     # same as IndexAltTagMetaName
424
425 =head3 PropertyNames
426
427 PropertyNames, PropertyNamesCompareCase, PropertyNamesIgnoreCase, PropertyNamesNoStripChars,
428 PropertyNamesNumeric, PropertyNamesDate, PropertyNameAlias, PropertyNamesMaxLength,
429 PropertyNamesSortKeyLength, StoreDescription and PreSortedIndex
430 have all been combined into a single XML directive.
431
432 Here's the example from above with equivalent Swish-e directives annotated:
433
434  <Prop name="foo" ignorecase="1" />
435  # PropertyNamesIgnoreCase foo
436
437  <Prop name="bar" type="int" />
438  # PropertyNamesNumeric bar
439  
440  <Prop name="lastmod" type="date" />
441  # PropertyNamesDate lastmod
442  
443  <Prop name="bing" comparecase="1" />
444  # PropertyNamesCompareCase bing
445  
446  <Prop name="description" verbatim="1" max="10000" alias="body" length="20" />
447  # PropertyNamesNoStripChars description
448  # PropertyNamesMaxLength 10000 description
449  # PropertyNameAlias description body
450  # PropertyNamesSortKeyLength 20 description
451
452  <Prop name="notsorted" sort="0" />
453  # PreSortedIndex foo bar lastmod bind description
454
455 Again, the XML format greatly simplifies the syntax. You can assign attributes
456 as you need, though be aware that some attributes are inherently mismatched
457 and might generate an error or unexpected behaviour:
458
459  <Prop name="foo" ignorecase="1" type="int" />      # wrong
460  <Prop name="foo" comparecase="1" type="date" />    # wrong
461  <Prop name="foo" verbatim="1" type="int" />        # wrong
462  <Prop name="foo" sort="0" length="20" />           # wrong
463
464 =head2 Directives
465
466 The following configuration directives are currently supported.
467
468  TODO
469
470 =head1 EXAMPLES
471
472 See the C<swish_lint.c> file included in the libswish3 distribution.
473
474 =head1 FAQ
475
476 =head2 What is IR?
477
478 Information Retrieval.
479
480 =head2 How is libswish3 related to Swish-e?
481
482 libswish3 is the core parsing library for Swish-e version 3 (Swish3).
483
484 =head2 Is libswish3 a search engine?
485
486 No. libswish3 is a document parser. It might work well in or with any number of search engines,
487 but it is not in itself any kind of search tool.
488
489 =head2 So what does libswish3 DO exactly?
490
491 libswish3 reads text, HTML and XML files and extracts just the words and document
492 properties from each document. It then hands off the wordlist and properties
493 to a I<handler> function. Finally, it frees all the memory associated with the wordlist
494 and properties.
495
496 The I<handler> function can do whatever you wish, though typically a I<handler>
497 would iterate over the words in the wordlist and add each one to an index using
498 an IR library API.
499
500
501 =head1 BACKGROUND
502
503 libswish3 is part of the Swish-e project.
504 It was born out of the need for UTF-8 and incremental
505 indexing support and a desire to experiment with alternate indexing
506 libraries like Lucene, KinoSearch, Xapian and Hyperestraier.
507
508 libswish3 was developed with the idea that many quality IR libraries already exist,
509 but few if any provide an easy and fast way of preparing documents for indexing.
510 The following assumptions informed the development of libswish3.
511
512 =head2 The IR Toolchain
513
514 A decent IR toolchain requires 5 parts:
515
516 =over 4
517
518 =item aggregator
519
520 Collects documents from a filesystem, database, website or other sources.
521
522 =item filter
523
524 Normalizes documents to a standard format (plain text or a delimited/markup
525 like YAML, HTML or XML) for indexing.
526
527 =item parser
528
529 Breaks a document into a list of words, including their context and position.
530
531 =item indexer
532
533 Writes the list of words in a storage system for quick, efficient retrieval.
534
535 =item searcher
536
537 Parses queries and fetches data from the indexer's storage system.
538
539 =back
540
541 Of course, the division between these parts is not always clean or apparent. Parsing search
542 queries, for example, will necessarily involve elements of the parser and searcher
543 components, while the indexer and searcher are of necessity intrinsically bound.
544
545 But any complete IR system will have these five parts in some combination.
546
547 =head2 Swish-e aggregators and filters are already good
548
549 The existing Swish-e document aggregators (B<DirTree.pl> and B<spider.pl>) and filtering
550 system (B<SWISH::Filter>) are good. They are all written in Perl and are easily modified,
551 and they have ample configuration options and documentation.
552
553 =head2 Why reinvent the wheel?
554
555 Several good IR libraries exist that provide an indexer and searcher. These libraries
556 do UTF-8, incremental indexing, and have search syntax on par with (or better than)
557 Swish-e 2.x. Examples include Xapian, KinoSearch and Lucene.
558 While they might be a little slower
559 than Swish-e (at least in terms of indexing speed) they make up that for with:
560
561 =over
562
563 =item
564
565 well-documented APIs
566
567 =item
568
569 bindings in a variety of programming languages
570
571 =item
572
573 active development communities
574
575 =item
576
577 the flexibility that comes with being a library instead of a fixed program
578
579 =back
580
581
582 =head2 The missing link
583
584 The piece that Swish-e provides that other IR libraries lack is a fast, stable, integrated
585 document parser. Xapian has Omega, but it does not parse XML, nor does it recognize
586 ad hoc word context (metanames).
587
588 However, the Swish-e 2.x parser does not work independently of the Swish-e indexer
589 and searcher, nor does it support UTF-8.
590
591 One piece is missing: a parser that works with the Swish-e aggregator/filter system, supports
592 UTF-8, and offers flexible options for connecting with other IR libraries.
593
594 Ergo, libswish3: a document parser compatible with the existing Swish-e -S prog API
595 and capable of generating UTF-8 wordlists for indexing with a variety of IR libraries.
596
597 =head2 Where does libswish3 fit?
598
599 libswish3 is the core C library in Swish3.
600
601 However, libswish3 may be used without the rest of the Swish3.
602 The assumption is that libswish3 could fit into an IR toolchain like this:
603
604  aggregator -> filter -> libswish3 -> some IR library
605
606 You could then use the native search API of the IR library.
607
608 For example, you might use the Swish-e B<spider.pl> script to spider a website, filtering
609 documents with B<SWISH::Filter> and then handing the output to a B<libswish3>-based
610 program that will parse the documents into words and store the data in a
611 Xapian or KinoSearch index (or both!). That model is, in fact, what Swish3 does.
612
613 Or you might use the B<SWISH::Prog> Perl module (from the CPAN) to build your own
614 aggregator/filter system, then hand the output to libswish3.
615
616 =head1 AUTHOR
617
618 Peter Karman (peter@peknet.com).
619
620 =head1 CREDITS
621
622 B<libswish3> is inspired by code from
623 Swish-e (http://www.swish-e.org),
624 Libxml2 (http://www.xmlsoft.org),
625 Apache (http://www.apache.org),
626 Rahul Dhesi (http://www.tug.org/tex-archive/tools/zoo/),
627 Angel Ortega (http://www.triptico.com/software/unicode.html),
628 James Henstridge (http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html),
629 YoLinux (http://www.yolinux.com/TUTORIALS/GnomeLibXml2.html)
630 and no doubt many unnamed others.
631
632 All mistakes, errors and poor programming choices are, however, those
633 of the author.
634
635 =head1 LICENSE
636
637 B<libswish3> is licensed under the GPL.
638
639 libswish3 is free software; you can redistribute it and/or
640 modify it under the terms of the GNU Library General Public
641 License as published by the Free Software Foundation; either
642 version 2 of the License, or (at your option) any later version.
643
644 libswish3 is distributed in the hope that it will be useful,
645 but WITHOUT ANY WARRANTY; without even the implied warranty of
646 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
647 Library General Public License for more details.
648
649 You should have received a copy of the GNU Library General Public
650 License along with libswish3; see the file COPYING.  If
651 not, write to the
652
653  Free Software Foundation, Inc.
654  59 Temple Place - Suite 330
655  Boston, MA 02111-1307, USA
656
657 =head1 SEE ALSO
658
659 The project homepage: http://dev.swish-e.org/wiki/swish3
660
661 swish_lint(1), swish_isw(1), swish_words(1)
662
663 =cut
Note: See TracBrowser for help on using the browser.