root/libswish3/trunk/doc/libswish3.3.pod.in

Revision 2047, 19.3 kB (checked in by karpet, 4 months ago)

document new config/header format

Line 
1 =pod
2
3 =head1 NAME
4
5 libswish3 - Swish3 C library
6
7 =head1 SYNOPSIS
8
9 <<libswish3.h_HERE>>
10
11 =head1 DESCRIPTION
12
13 B<libswish3> is the core C library of B<Swish3>.
14
15 B<libswish3> uses the GNOME L<Libxml2|http://xmlsoft.org/> library to parse words and metadata
16 from XML, HTML and plain text files. B<libswish3> supports full UTF-8 encoding.
17
18 B<libswish3> is a parsing tool for use with information retrieval (IR) libraries.
19 Dynamic language bindings are available in the source distribution in the C<bindings>
20 directory.
21
22 =head1 APIs
23
24 The following APIs are defined:
25
26 =head1 Parsing API
27
28 B<libswish3> provides three basic input functions:
29
30 =over
31
32 =item
33
34 swish_parse_file()
35
36 =item
37
38 swish_parse_fh()
39
40 =item
41
42 swish_parse_buffer()
43
44 =back
45
46 Each of these functions takes a C<swish_Parser> struct pointer
47 and optional I<user_data>.
48
49 In addition:
50
51 =over
52
53 =item
54
55 The swish_parse_file() function takes a file path, which must be a valid file.
56 Directories and links are not supported. The assumption is that you will use
57 your calling code to recurse through directories and handle links.
58
59 =item
60
61 swish_parse_buffer() takes a string representing the document
62 headers and the full text of the document.
63
64 =item
65
66 swish_parse_fh() takes a filehandle pointer, which if set to NULL,
67 defaults to stdin.
68
69 =back
70
71 See the L<Headers API> section for more
72 information on using swish_parse_fh() and
73 swish_parse_buffer().
74
75 See the L<I<handler> Function> section for more information on how
76 to deal with the data extracted by each of the swish_parse_* functions.
77
78
79 =head1 Headers API
80
81 The Headers API supports and extends the Swish-e B<-S prog> feature,
82 which allows you to feed the indexer with output from another I<prog>ram.
83 The API has been extended from Swish-e's to allow for MIME types
84 and more congruence with the HTTP 1.1 specification.
85
86 See SWISH-RUN documentation
87 in the Swish-e distribution for the Swish-e version 2 headers API.
88
89 This is the libswish3 implementation. See B<SWISH::Prog::Headers> for a simple
90 Perl-based way of generating the proper headers.
91
92 =over
93
94 =item Content-Location
95
96 B<Swish-e name:> Path-Name
97
98 The name of the document. May be any string: an ID of a record in a database,
99 a URL or a simple file name. The string is stored in the swish_DocInfo B<uri> struct member,
100 which is often used as the primary identifier of a document in an index.
101
102 This header is required.
103
104 =item Content-Length
105
106 The length in bytes of the document, starting after the blank line separating the headers
107 from the document itself.
108 The value must be exactly the length of the document, including any extra
109 line feeds or carriage returns at the end of the document.
110
111 Example:
112
113  Content-Location: foo.html
114  Content-Length: 9
115
116  The doc.\n
117  ^^^^^^^^ ^
118  12345678 9
119
120 The value is stored in the swish_DocInfo B<size> struct member.
121
122 This header is required.
123
124
125 =item Last-Modified
126
127 B<Swish-e name:> Last-Mtime
128
129 The last modification time of the document. The value must be an integer:
130 the seconds since the Epoch on your system.
131
132 If not present, will default to the current time.
133
134 The value is stored in the swish_DocInfo B<mtime> struct member.
135
136 This header is not required.
137
138 =item Parser-Type
139
140 B<Swish-e name:> Document-Type
141
142 Explicitly name the parser used for the document, rather than defaulting to the MIME
143 type mapping based on B<Content-Type> and/or B<Content-Location>. The three parser types are:
144
145 =over
146
147 =item
148
149 XML
150
151 =item
152
153 HTML
154
155 =item
156
157 TXT
158
159 =back
160
161 The Swish-e values B<XML2>, B<XML*>, B<HTML2>, B<HTML*>, B<TXT2>, B<TXT*> are also
162 supported for compatibility, but they map to the three libswish3 types.
163
164 The value is stored in the swish_DocInfo B<parser> struct member.
165
166 If not present, the document parser will be automatically chosen based on the following logic:
167
168 =over
169
170 =item
171
172 If a B<Content-Type> is given, the parser mapped to that MIME type will be used. You may override
173 the default mappings in your configuration. See B<Configuration API>.
174
175 =item
176
177 If no B<Content-Type> is given, a MIME type will be guessed at based on the file extension of the
178 document's B<Content-Location>, and the parser mapped to that MIME type will be used.
179
180 =item
181
182 Finally, if a MIME type is not identified, the parser defined in B<SWISHP_CONFIG_DEFAULT_PARSER>
183 in B<libswish3.h> will be used.
184
185 =back
186
187 See also B<Content-Type> and B<Content-Location>.
188
189 This header is not required.
190
191 =item Content-Type
192
193 The MIME type of the document. The libswish3 MIME type list is based on the Apache 2.0
194 version. See L<http://www.iana.org/assignments/media-types/> for the official registry.
195
196 If not defined with B<Content-Type>, the MIME type will be guessed based on the
197 file extension in the B<Content-Location>
198 header. If the B<Content-Location> string does not contain a file extension (as might be the case
199 with non-URL value), or the file extension has no MIME mapping, then the MIME type will default
200 to B<SWISHP_DEFAULT_MIME> as defined in B<libswish3.h>.
201
202 You may override the default extension-to-MIME mappings in your configuration. See B<Configuration API>.
203
204 The value is stored in the swish_DocInfo B<mime> struct member.
205
206 See also B<Content-Location> and B<Parser-Type>.
207
208 This header is not required.
209
210
211 =item Update-Mode
212
213 B<NOTE:> This header exists only for backwards compatibility with Swish-e's incremental
214 index feature. B<It may be deprecated in a future version of libswish3.>
215
216 =back
217
218
219
220 =head1 Structures API
221
222 Writing an effective I<handler> function requires an understanding of some of the key
223 B<libswish3> data structures.
224
225 For more details on any of these structures, see the SYNOPSIS.
226
227 =head2 swish_3
228
229 The main data structure. A swish_3 object has a swish_Config, swish_Analyzer and swish_Parser
230 object and delegates to eash as appropriate.
231
232 This is typically the only object you need to create and use.
233
234 =head2 swish_Config
235
236 A configuration object. This object is required for initializing both a C<swish_Analyzer>
237 object and a C<swish_Parser> object.
238
239 =head2 swish_Parser
240
241 A parser object. Required for executing any of the three C<swish_parse_*> functions.
242
243 =head2 swish_ParserData
244
245 A parser data object. This object is passed around internally by the libxml2
246 SAX2 handlers, and is eventually the object passed to the I<handler> function pointer.
247 See L<The I<handler> Function>.
248
249 =head2 swish_WordList
250
251 A list of words or tokens. The object contains a linked list of swish_Word objects.
252 You can iterate over the contents of the WordList like this:
253
254  SWISH_DEBUG_MSG("%d words in list", list->nwords);
255  list->current = list->head;
256  while (list->current != NULL)
257  {
258         swish_debug_msg("   ---------- WORD ---------  ");
259         swish_debug_msg("word     : %s", list->current->word);
260         swish_debug_msg(" meta    : %s", list->current->metaname);
261         swish_debug_msg(" context : %s", list->current->context);
262         swish_debug_msg("  pos    : %d", list->current->position);
263         swish_debug_msg("soffset  : %d", list->current->start_offset);
264         swish_debug_msg("eoffset  : %d", list->current->end_offset);
265            
266         list->current = list->current->next;
267  }
268
269 =head2 swish_Word
270
271 An object representing one word or token. The word's start and end offset,
272 position relative to other words, tag context and MetaName are all available in the object.
273
274 =head2 swish_DocInfo
275
276 An object describing metadata about the document itself: URI, MIME type, size, etc.
277
278 =head2 swish_Analyzer
279
280 The Analyzer object controls how the character content of a document is parsed: whether
281 or not a WordList is created with a tokenizer, if the words (tokens) are lowercased or
282 stemmed, etc.
283
284 =head1 The I<handler> Function
285
286 The I<handler> function pointer is the final link in the parsing chain. The function
287 pointer is set in the swish_Parser object constructor, and is called by each of the
288 swish_parse_* functions after the entire document has been parsed and (optionally)
289 tokenized.
290
291 The I<handler> receives one argument: a swish_ParserData object containing all the metadata
292 and words in the document.
293
294 If all you wanted to do was print out a report about each document as it was parsed,
295 your I<handler> function might be as simple as:
296
297  void
298  my_handler( swish_ParserData * parse_data )
299  {
300     swish_debug_docinfo( parse_data->docinfo );
301     swish_debug_wordlist( parse_data->wordlist );
302     swish_debug_nb( parse_data->properties, "Property" );
303     swish_debug_nb( parse_data->metanames, "MetaName" );
304  }
305  
306 B<IMPORTANT:> After the I<handler> function is called, all the structures referenced
307 by the swish_ParserData object are automatically freed, so if you intend to keep any of the
308 data for storing in an index, you will need to strdup() words, properties, docinfo, etc.
309 as part of your indexing code.
310
311 See the example C<swish_lint.c> file for how to create and pass in a I<handler>
312 function pointer to the swish_init_swish3() constructor.
313
314 =head1 Configuration API
315
316 Configuration is different with B<libswish3> than with Swish-e. The biggest change
317 is that B<libswish3> configuration files are written in XML. This is done for several
318 reasons:
319
320 =over
321
322 =item 1
323
324 Since B<libswish3> already has a powerful XML parser built-in, it's much easier to
325 parse a configuration file written in XML than to port the Swish-e config parser
326 to B<libswish3>.
327
328 =item 2
329
330 B<libswish3> stores index header information in a XML format nearly identical
331 to the configuration file format. So the parser needs to understand only one XML
332 schema.
333
334 =item 3
335
336 You can store UTF-8 text in your configuration file and it will be parsed correctly.
337
338 =item 4
339
340 The configuration directive list is extensible. Simple key/value configuration directives
341 can be added without any modification to the B<libswish3> config parser. They are simply
342 stored in the C<swish_Config> struct hash for your own use and amusement.
343
344 B<CAUTION:> The configuration directive names documented in the L<Directives> section below
345 are reserved for use by B<libswish3>. Some of them have special handling considerations
346 (like MetaNames and PropertyNames). So the important idea to grasp with the extensible
347 configuration feature is "simple key/value pairs."
348
349 =back
350
351 This section describes how to build a B<libswish3> configuration file.
352
353 =head2 Configuration Example
354
355 Here's an example B<libswish3> configuration file:
356
357  <swish>
358   <FollowSymLinks>yes</FollowSymLinks>
359  
360   <MetaNames>
361    <foo bias="+10" />
362    <bar bias="-5" />
363    <swishtitle bias="+50" alias="title" />
364    <other>color size weight</other>
365   </MetaNames>
366  
367   <PropertyNames>
368    <foo type="text" ignorecase="1" />
369    <bar type="int" />
370    <lastmod type="date" />
371    <bing comparecase="1" />
372    <description verbatim="1" max="10000" alias="body" length="20" />
373    <notsorted sort="0" />
374   </PropertyNames>
375  
376   <Tokenize>1</Tokenize>
377  </swish>
378
379 And here's that same example, dissected:
380
381  <swish>
382
383 The top level tag.
384
385  <FollowSymLinks>yes</FollowSymLinks>
386
387 Equivalent to the Swish-e style:
388
389  FollowSymLinks yes
390
391 which simply informs whatever aggregator you are using that when confronted
392 with a symlink on the filesystem, it should be followed.
393
394 C<FollowSymLinks> is an example of a simple key/value pair (see the B<CAUTION> above).
395
396 =head3 MetaNames
397
398 Here's the first big difference from Swish-e. MetaNames, MetaNameAlias, and
399 MetaNamesRank have been combined into a single XML tag with appropriate
400 attributes.
401
402  <foo bias="10" />
403
404 is the same thing as (in Swish-e style):
405
406  MetaNames foo
407  MetaNamesRank 10 foo
408
409 while:
410
411  <swishtitle bias="50" alias="title" />
412
413 is equivalent to:
414
415  MetaNames swishtitle
416  MetaNameAlias swishtitle title
417  MetaNamesRank 50 swishtitle
418
419 You can see that the XML style allows for a terser, more compact expression.
420 You can still assign multiple aliases to a single MetaName:
421
422  <other>color size weight</other>
423
424 is equivalent to:
425
426  MetaNames other
427  MetaNameAlias other color size weight
428
429 In addition, there are some special features intended for use with HTML documents.
430
431  <links html="1" alias="href" />      # same as HTMLLinksMetaName
432  <images html="1" alias="src" />      # same as ImageLinksMetaName
433  <alttext html="1" alias="alt" />     # same as IndexAltTagMetaName
434  <as-text html="1" alias="alt" />     # same as IndexAltTagMetaName
435
436 =head3 PropertyNames
437
438 PropertyNames, PropertyNamesCompareCase, PropertyNamesIgnoreCase, PropertyNamesNoStripChars,
439 PropertyNamesNumeric, PropertyNamesDate, PropertyNameAlias, PropertyNamesMaxLength,
440 PropertyNamesSortKeyLength, StoreDescription and PreSortedIndex
441 have all been combined into a single XML directive.
442
443 Here's the example from above with equivalent Swish-e directives annotated:
444
445  <foo ignorecase="1" />
446  # PropertyNamesIgnoreCase foo
447
448  <bar type="int" />
449  # PropertyNamesNumeric bar
450  
451  <lastmod type="date" />
452  # PropertyNamesDate lastmod
453  
454  <bing comparecase="1" />
455  # PropertyNamesCompareCase bing
456  
457  <description verbatim="1" max="10000" alias="body" length="20" />
458  # PropertyNamesNoStripChars description
459  # PropertyNamesMaxLength 10000 description
460  # PropertyNameAlias description body
461  # PropertyNamesSortKeyLength 20 description
462
463  <notsorted sort="0" />
464  # PreSortedIndex foo bar lastmod bind description
465
466 Again, the XML format greatly simplifies the syntax. You can assign attributes
467 as you need, though be aware that some attributes are inherently mismatched
468 and might generate an error or unexpected behaviour:
469
470  <foo ignorecase="1" type="int" />      # wrong
471  <foo comparecase="1" type="date" />    # wrong
472  <foo verbatim="1" type="int" />        # wrong
473  <foo sort="0" length="20" />           # wrong
474
475 =head2 Directives
476
477 The following configuration directives are currently supported.
478
479  TODO
480
481 =head1 EXAMPLES
482
483 See the C<swish_lint.c> file included in the libswish3 distribution.
484
485 =head1 FAQ
486
487 =head2 What is IR?
488
489 Information Retrieval.
490
491 =head2 How is libswish3 related to Swish-e?
492
493 libswish3 is the core parsing library for Swish-e version 3 (Swish3).
494
495 =head2 Is libswish3 a search engine?
496
497 No. libswish3 is a document parser. It might work well in or with any number of search engines,
498 but it is not in itself any kind of search tool.
499
500 =head2 So what does libswish3 DO exactly?
501
502 libswish3 reads text, HTML and XML files and extracts just the words and document
503 properties from each document. It then hands off the wordlist and properties
504 to a I<handler> function. Finally, it frees all the memory associated with the wordlist
505 and properties.
506
507 The I<handler> function can do whatever you wish, though typically a I<handler>
508 would iterate over the words in the wordlist and add each one to an index using
509 an IR library API.
510
511
512 =head1 BACKGROUND
513
514 libswish3 is part of the Swish-e project.
515 It was born out of the need for UTF-8 and incremental
516 indexing support and a desire to experiment with alternate indexing
517 libraries like Lucene, KinoSearch, Xapian and Hyperestraier.
518
519 libswish3 was developed with the idea that many quality IR libraries already exist,
520 but few if any provide an easy and fast way of preparing documents for indexing.
521 The following assumptions informed the development of libswish3.
522
523 =head2 The IR Toolchain
524
525 A decent IR toolchain requires 5 parts:
526
527 =over 4
528
529 =item aggregator
530
531 Collects documents from a filesystem, database, website or other sources.
532
533 =item filter
534
535 Normalizes documents to a standard format (plain text or a delimited/markup
536 like YAML, HTML or XML) for indexing.
537
538 =item parser
539
540 Breaks a document into a list of words, including their context and position.
541
542 =item indexer
543
544 Writes the list of words in a storage system for quick, efficient retrieval.
545
546 =item searcher
547
548 Parses queries and fetches data from the indexer's storage system.
549
550 =back
551
552 Of course, the division between these parts is not always clean or apparent. Parsing search
553 queries, for example, will necessarily involve elements of the parser and searcher
554 components, while the indexer and searcher are of necessity intrinsically bound.
555
556 But any complete IR system will have these five parts in some combination.
557
558 =head2 Swish-e aggregators and filters are already good
559
560 The existing Swish-e document aggregators (B<DirTree.pl> and B<spider.pl>) and filtering
561 system (B<SWISH::Filter>) are good. They are all written in Perl and are easily modified,
562 and they have ample configuration options and documentation.
563
564 =head2 Why reinvent the wheel?
565
566 Several good IR libraries exist that provide an indexer and searcher. These libraries
567 do UTF-8, incremental indexing, and have search syntax on par with (or better than)
568 Swish-e 2.x. Examples include Xapian, KinoSearch and Lucene.
569 While they might be a little slower
570 than Swish-e (at least in terms of indexing speed) they make up that for with:
571
572 =over
573
574 =item
575
576 well-documented APIs
577
578 =item
579
580 bindings in a variety of programming languages
581
582 =item
583
584 active development communities
585
586 =item
587
588 the flexibility that comes with being a library instead of a fixed program
589
590 =back
591
592
593 =head2 The missing link
594
595 The piece that Swish-e provides that other IR libraries lack is a fast, stable, integrated
596 document parser. Xapian has Omega, but it does not parse XML, nor does it recognize
597 ad hoc word context (metanames).
598
599 However, the Swish-e 2.x parser does not work independently of the Swish-e indexer
600 and searcher, nor does it support UTF-8.
601
602 One piece is missing: a parser that works with the Swish-e aggregator/filter system, supports
603 UTF-8, and offers flexible options for connecting with other IR libraries.
604
605 Ergo, libswish3: a document parser compatible with the existing Swish-e -S prog API
606 and capable of generating UTF-8 wordlists for indexing with a variety of IR libraries.
607
608 =head2 Where does libswish3 fit?
609
610 libswish3 is the core C library in Swish3.
611
612 However, libswish3 may be used without the rest of the Swish3.
613 The assumption is that libswish3 could fit into an IR toolchain like this:
614
615  aggregator -> filter -> libswish3 -> some IR library
616
617 You could then use the native search API of the IR library.
618
619 For example, you might use the Swish-e B<spider.pl> script to spider a website, filtering
620 documents with B<SWISH::Filter> and then handing the output to a B<libswish3>-based
621 program that will parse the documents into words and store the data in a
622 Xapian or KinoSearch index (or both!). That model is, in fact, what Swish3 does.
623
624 Or you might use the B<SWISH::Prog> Perl module (from the CPAN) to build your own
625 aggregator/filter system, then hand the output to libswish3.
626
627 =head1 AUTHOR
628
629 Peter Karman (peter@peknet.com).
630
631 =head1 CREDITS
632
633 B<libswish3> is inspired by code from
634 Swish-e (http://www.swish-e.org),
635 Libxml2 (http://www.xmlsoft.org),
636 Apache (http://www.apache.org),
637 Rahul Dhesi (http://www.tug.org/tex-archive/tools/zoo/),
638 Angel Ortega (http://www.triptico.com/software/unicode.html),
639 James Henstridge (http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html),
640 YoLinux (http://www.yolinux.com/TUTORIALS/GnomeLibXml2.html)
641 and no doubt many unnamed others.
642
643 All mistakes, errors and poor programming choices are, however, those
644 of the author.
645
646 =head1 LICENSE
647
648 B<libswish3> is licensed under the GPL.
649
650 libswish3 is free software; you can redistribute it and/or
651 modify it under the terms of the GNU Library General Public
652 License as published by the Free Software Foundation; either
653 version 2 of the License, or (at your option) any later version.
654
655 libswish3 is distributed in the hope that it will be useful,
656 but WITHOUT ANY WARRANTY; without even the implied warranty of
657 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
658 Library General Public License for more details.
659
660 You should have received a copy of the GNU Library General Public
661 License along with libswish3; see the file COPYING.  If
662 not, write to the
663
664  Free Software Foundation, Inc.
665  59 Temple Place - Suite 330
666  Boston, MA 02111-1307, USA
667
668 =head1 SEE ALSO
669
670 The project homepage: http://dev.swish-e.org/wiki/swish3
671
672 swish_lint(1), swish_isw(1), swish_words(1)
673
674 =cut
Note: See TracBrowser for help on using the browser.