root/swish-e/trunk/pod/CHANGES.pod

Revision 2147, 49.1 kB (checked in by karpet, 4 months ago)

update changes file and fix longstanding doc bug

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
Line 
1 =head1 NAME
2
3 CHANGES - List of revisions
4
5 =head1 OVERVIEW
6
7 This document contains list of bug fixes and feature additions to Swish-e.
8
9 =head2 Version 2.4.6 - 10 March 2008
10
11 =over 4
12
13 =item MinWordLength respected in query parser
14
15 Clark Vent reported that the query parser was not respecting MinWordLength
16 settings.  See http://dev.swish-e.org/changeset/2145
17
18 =item Patch to file.c. 
19
20 The file.c patch was in response to
21 http://swish-e.org/archive/2007-03/11321.html
22 although that user never responded about that patch.
23
24 =item SWISH_DEBUG_RANK env var now enables rank debugging
25
26 Set SWISH_DEBUG_RANK to a true value to enable lots of rank debugging
27 on stderr.
28
29 =item Perl Makefile.PL patched to fix MakeMaker issue
30
31 Recent versions of ExtUtils::MakeMaker revealed a bug in Makefile.PL.
32 Patch from mschwern via RT, report by mpeters.
33
34 =item LARGEFILE support detected automatically in configure
35
36 jrobinson852@yahoo.com suggest LARGEFILE support be auto-detected since
37 it is needed so often on Linux systems.
38
39 =item New Snowball stemmers
40
41 Trygve Falch contributed patches to update
42 the Snowball stemmers, including new Hungarian and Romanian stemmers.
43
44 =item Patched leaks
45
46 Anthony Dovgal patched two leaks.  One when there's a failure to
47 open a file the file name was not freed.
48
49 SwishSetSearchLimit() was nulling the search limits when an error was
50 found in the parameters, but not freeing the existing limits.
51
52 =item Leak in SwishResetSearchLimit
53
54 Fixed a leak if a limit was set and then reset but not prepared.
55 Patch provided by Antony Dovgal.
56
57 =item New API functions added
58
59 Added SwishGetStructure() and SwishGetPhraseDelimiter() functions which return
60 relevant properties of the search object.
61 Patch provided by Antony Dovgal.
62
63
64 =back
65
66 =head2 Version 2.4.5 - 22 Jan 2007
67
68 =over 4
69
70 =item Fixed 'deflate' handling in spider.pl
71
72 spider.pl was using the wrong method do uncompress HTTP responses that were
73 'deflate' encoded.  Also decode content based on the document's charset and
74 encode back to charset before outputting.
75
76 =item re-indexing required
77
78 The magic numbers in src/swish.h were changed to require re-indexing from
79 version 2.4.4 indexes. This should have been done in 2.4.4 as well, and anytime
80 the index format changes. -- karman
81
82 =item fixed stemmer bug introduced in 2.4.4
83
84 stemmer.c had a mix up in the deprecated stemmer assignments for "Stemmer_en"
85 and "Stem". Also fixed stemmer.h so that 2.4.3 indexes can be read correctly.
86 -- karman
87
88 =item Now fork/exec to run filters
89
90 FileFilter* was using popen to run the filter, which could pass user
91 data though the shell.  Now uses fork/exec if fork is available which
92 should be everywhere except Windows.  In windows popen is used but all
93 parameters are double-quoted. -- moseley
94
95 =item fixed signed/unsigned warnings from gcc 4.x
96
97 Cleaned up search.c to catch mismatched signedness warnings from newer GCC versions.
98 This issue pre-existed 2.4.4 but the new wildcard features in search.c made for a lot
99 more warnings. -- karman
100
101 =item Makefile.mingw included in distrib
102
103 Modified root Makefile to include the perl/Makefile.mingw file. -- karman
104
105 =back
106
107 =head2 Version 2.4.4 - 11 Oct 2006
108
109 =over 4
110
111 =item Version 2.4.4 RC1
112
113 Release Candidate 1 for 2.4.4, 2 Oct 2006.
114
115 =item quote fix for FileFilter config param
116
117 Ludovic Drolez contributed a patch to fix a quoting issue with filenames. This affects
118 non-Windows builds only.
119
120 =item SWISH::Filter now on CPAN
121
122 SWISH::Filter is now available on http://cpan.org/. The version in the distribution is
123 B<not> kept in sync with the CPAN version. Install the CPAN version if you want
124 the latest and greatest version.
125
126 =item SWISH::API updated to 0.04
127
128 Added several fixes, including:
129
130 =over
131
132 =item Perlish method names from mpeters@plusthree.com
133
134 =item switched to XSLoader with DynaLoader as fallback
135
136 =item added VERSION method to satisfy some versions of MakeMaker
137
138 =item Fuzzify() method now actually works as advertised
139
140 =back
141
142 =item added proximity feature and single character wildcard with '?' instead of '*'
143
144 Herman Knoops contributed these patches.
145 See http://swish-e.org/archive/2006-05/10543.html
146
147 Error messages were also changed to better reflect correct use of wildcards.
148
149 =item fixed bug when using DoubleMetaphone
150
151 Fixed problem reported by Andreas Völter where a query that generated a
152 two-word query with DoubleMetaphone fuzzy mode was not working.
153
154 =item fix sparc64 property issue
155
156 Sorithy Seng (pourlassi@gmail.com) submitted a patch against docprop.c to fix
157 an issue on sparc64 platforms. It is unknown whether this bug affected other 64-bit
158 architectures.
159
160 =item fixed bug when StopWords resulted in no unique words
161
162 Added check in db_native.c to check that some words exist before writing index.
163
164 =item updates to SWISH-RUN.1
165
166 Added doc for -u and -r options.
167
168 =item filename only in SWISH::Filters
169
170 added fix to SWISH::Filters::pp2html and SWISH::Filters::XLtoHTML to
171 save only filename as title without full path
172
173 =item Removed Stem and Stemmer_en
174
175 The legacy Porter stemmer was removed. This had been deprecated some time ago.
176 A warning will issue if the old stemmer is indicated in config file, and Stemmer_en1
177 will be used instead.
178
179 =item GPL'd all the source files with the new Swish-e License
180
181 After a source code review, the developers decided to put Swish-e under the GPL
182 with a special exception for linking against libswish-e. See http://swish-e.org/license.html
183 for the details.
184
185 =item Fixed Segfault with updating incremental index
186
187 Dobrica Pavlinusic reported a segfaut after updating an index multiple times.
188 José provided updated worddata.c.  - April 27, 2005
189
190 =item Fixed NOT check with incremental indexes
191
192 Swish was returning results for deleted files when the NOT operator was used.
193
194 =item Fixed bug when using old parsers with zero length input
195
196 Thomas Angst reported swish consuming memory when using -S prog
197 to process large number of empty documents.
198
199 When -S prog generated a zero length file the old parsers (e.g. TXT) would
200 attempt to read in *all* content from the -S prog program into a buffer.
201 The old parser incorrectly assumed it was reading from a filter and tried to
202 read to eof().
203
204 =item Changes to ParserWarnLevel
205
206 The default value for ParserWarnLevel was changed form zero to two.
207
208 The ParserWarnLevel controls the error handling of the libxml2 parser. The higher
209 the setting, the more verbose the output. The change to the default is to report
210 when libxml2 has problems parsing a document (which often times results in processing
211 only part of a document).
212
213 To get the old behavior, either set ParserWarnLevel to zero in your config file,
214 or use the new -W command line option to set the ParserWarnLevel at run time.
215 If ParserWarnLevel is set in the config file, it will override the -W option.
216
217 Also, to see UTF-8 to 8859-1 conversion errors set ParserWarnLevel to 3 or more.  Previously,
218 these warning were issues at ParserWarnLevel of one.
219
220 =item Documentation changes
221
222 Removed all the target documentation (html, pdf, ps) from cvs.  There's now a separate
223 cvs module "swish_website" that is used to generate both the website and the html
224 docs.  If building swish-e from cvs please see the README.cvs file for instructions.
225
226 =item Fixed bug in pre-sorted indexes with USE_BTREE
227
228 Gunnar Mätzler reported a problem with reading the pre-sorted property index
229 tables when running with USE_BTREE (--enable-enremental).  Not all entries were
230 being written to disk.  There was/is a question if the "array" code used for
231 pre-sorted indexes with USE_BTREE would be slower.  So, added a separate
232 define USE_PRESORT_ARRAY to enable that code when USE_BTREE is set.  This allows
233 using the old integer arrays with USE_BTREE.  Gunnar reported that this is working,
234 but more testing is needed.  Need to compare speed of the array code vs. the non-array
235 code, and to verify the workings of USE_PRESORT_ARRAY code.
236
237 =item Add strcoll() usage for sorting properties
238
239 Andreas Seltenreich provided a patch to use strcoll when sorting properties.
240 strcoll is locale dependent.
241
242 =item Fix incremental indexing when adding back a file
243
244 Jose fixed a problem with incremental indexing where a file could not be
245 added back to the index once removed.
246
247 Patch initially provided by Dobrica Pavlinusic:
248
249     http://swish-e.org/Discussion/archive/2004-12/8694.html
250
251
252
253 =item Documentation correction
254
255 A change in the default way the index is compressed was not documented
256 in 2.4.3.  The change resulted in larger indexes.  See CompressPositions
257 below and in SWISH-CONFIG.
258
259 =item libxml2 UTF-8 conversion failures
260
261 Fixed issue where a UTF-8 to Latin1 encoding failure would skip
262 more input than just the failed character.   Libxml2 passes swish text
263 that is not null terminated, but the libxml2 functions to skip UTF-8
264 chars expected a null-terminated string.  Replace libxml2 call with
265 fixed version.
266
267 =back
268
269 =head2 Version 2.4.3 December 9, 2004
270
271 =over 4
272
273 =item New config directive: CompressPositions
274
275 This option enables zlib compression for word data in the index.
276 Previously word data was always compressed but resulted in slower
277 wildcard searches.  The default now is to not compress the word data,
278 but results in larger index files.  Set to "YES" to get pre-2.4.3 index
279 sizes.
280
281 [This CHANGES entry was added after 2.4.3 was released]
282
283 =item Improved error messsages when using incremental indexing
284
285 There was a bit of confusion on how to use incremental indexing (still
286 experimental) so added better logic for error messages.
287
288 Also fixed a logic error when setting the incremental update mode.  Caught by
289 Paul Loner.
290
291 =back
292
293 =head2 Version 2.4.3-pr1 - Wed Dec  1 09:52:50 PST 2004
294
295 =over 4
296
297 =item "Fixed" libxml2's change in UTF8Toisolat1() return value
298
299 Bernhard Weisshuhn supplied a patch to parser.c for checking the return value of
300 UTF8Toisolat1().  Seems that libxml2 now returns the number of characters converted
301 instead of zero for success.
302
303    http://bugzilla.gnome.org/show_bug.cgi?id=153937
304
305 =item Added swish-config and pkg-config
306
307 Swish now provides a swish-config script and config file for the pkg-config
308 utility.  These tools help when building programs that link with the swish-e
309 library.
310
311 The SWISH::API Makefile.PL program uses swish-config to locate the installation
312 directory of swish-e.  This should make building SWISH::API easier when swish-e
313 is installed in a non-standard location.
314
315 =item Fixed rank bias in merge
316
317 Peter van Dijk noticed that MetaNamesRank settings were not being copied to the output
318 index when merging.
319
320 =item Added SwishFuzzy function
321
322 SwishFuzzy function (SWISH::API::Fuzzy) lets you stem a word without first searching.
323 This might be helpful for playing with queries prior to the search.
324
325
326 =item Fixed translate character table
327
328 Michael Levy found an error in the table used to translate 8859-1 to
329 ascii7.  Luckily, it was an upper case translation and the table is only used on lower
330 case characters.
331
332 =item MetaNamesRank documentation
333
334 Changed the 'not yet implemented' caveat to 'implemented but experimental'.
335
336 =item Added Continuation option to config processing
337
338 You can now use continuation lines in the config file:
339
340     IgnoreWords \
341         the \
342         am \
343         is \
344         are \
345         was
346
347 There may not be any characters following the backslash.
348
349 =item Fixed Buzzwords (and other word lists entered in the config)
350
351 Words entered in config were not converted to lower case before storing in the index.
352
353
354 =item Fixed metaname mapping problem in Merge
355
356 Peter Karman found an error when merging indexes where the source indexes had the
357 same metanames, but listed in a different order in their config files.  Words
358 would then be indexed under the wrong metaID number in the output index.
359
360
361 =item SWISH::Filters and spider.pl updates
362
363 The web spider F<spider.pl> was updated to work better with SWISH::Filter
364 by default and also make it easier to use the spider default along with
365 a spider config file.  See spider.pl for details.
366
367 SWISH::Filter was updated.  The way filters are created has changed.
368 If you created your own filters you will need to update them.  Take a look
369 at SWISH::Filter and the filters included in the distribution.
370
371 =item Updates to Documentation
372
373 Richard Morin submitted formatting and punctuation dates to the README and
374 INSTALL docs.
375
376 =item Added -R option to support IDF word weighting in ranking. (karman)
377
378 Added Inverse Document Frequency calculation to the getrank() routine.
379 This will allow the relative frequency of a word in relationship to other
380 words in the query to impact the ranking of documents.
381
382 Example: if 'foo' is present twice as often as 'bar' in the collection as a whole,
383 a search for 'foo bar' will weight documents with 'bar' more heavily (i.e., higher
384 rank) than those with 'foo'.
385
386 The impact is greatest when OR'ing words in a query rather than
387 AND'ing them (which is the default).
388
389 Also added Rank discussion to the FAQ.
390
391
392 =item Updates to the example scripts
393
394 Updated PhraseHighlight.pm as suggested by Bill Schell for an optimization
395 when all words in a document are highlighted.
396
397 Updated search.cgi and PhraseHighlight.pm to use the internal stemmers via
398 the SWISH::API module as suggested by Jonas Wolf.
399
400
401 =item Leak when using C library
402
403 David Windmueller found a memory leak when calling multiple searches
404 on a swish handle.  The problem was swish loading the pre-sorted
405 property index on every search, even after the table had been loaded
406 into memory.
407
408 =item Swish.cgi now kills swish-e on time out
409
410 The example script F<swish.cgi> uses an alarm (on platforms that support
411 alarm) to abort processing after some number of seconds, but it was not
412 killing the child process, swish-e.  Bill Schell submitted a patch to kill
413 the child when the alarm triggers.
414
415 =item The template search.tt was renamed to swish.tt
416
417 The template was renamed because it's used by F<swish.cgi>, not by
418 F<search.cgi>, which was confusing.
419
420 =item Updates to the search.cgi
421
422 The example script F<search.cgi> was updated to work better with mod_perl
423 and to use external template files and style sheets.
424
425
426 =item New MS Word Filter
427
428 James Job provided the SWISH::Filter::Doc2html filter that uses
429 the wvWare (http://wvware.sourceforge.net/) program for filtering
430 MS Word documents.  If both catdoc and wvWare are installed then wvWare
431 will be used.
432
433 wvWare is reported to do a good job at converting MS Word docs
434 to HTML.  In a few tests it did work well, but other cases it
435 failed to generate correct output.  It was also much, much slower
436 than catdoc.  I tested with wvWare 0.7.3 on Debian Linux.  Testing with
437 both is recommended.
438
439 =item Change in way symbolic links are followed
440
441 John-Marc Chandonia pointed out that if a symlink is skipped
442 by FileRules, then the actual file/directory is marked as
443 "already seen" and cannot be indexed by other links or directly.
444
445 Now, files and directories are not marked "already seen" until
446 after passing FileRules (i.e after a file is actually indexed
447 or a directory is processed).
448
449 =item Could not set SwishSetSort() more than once
450
451 David Windmueller found a problem when trying to set the sort
452 order more than once on an existing search object.  Memory was not
453 correctly reset after clearing the previous sort values.
454
455 =item Access MetaNames and PropertyNames from API
456
457 Patch provided by Jamie Herre to access the MetaNames and PropertyNames
458 via the C API and to test via the testlib program.  Swish::API also updated
459 to access this data.
460
461 =item SwishResultPropertyULong() bug fixed
462
463 David Windmueller reported that SwishResultPropertyULong() was
464 returning ULONG_MAX on all calls.  This was fixed.
465
466 =item Null written to wrong location in file.c
467
468 Bill Schell with the help of valgrind found a null written past the end of a
469 buffer in file.c in the code that supports the old parsers.  This resulted in a
470 segfault while indexing a large set of XML documents.
471
472 =item Fixed problem when indexing very large files
473
474 Steve Harris reported a problem when indexing a very large document that
475 caused an integer overflow.  José Ruiz updated to used unsigned integers.
476
477 =item Bump word position on block tags with HTML2 parser
478
479 Peter Karman pointed out the the libxml2 HTML parser was allowing phrase
480 matches across block level html elements.  Swish now bumps the word
481 position on these elements.
482
483
484 =back
485
486 =head2 Version 2.4.2 - March 09, 2004
487
488 =over 4
489
490 =item * UseStemming didn't take no for an answer
491
492 UseStemming was coded as an alias for FuzzyIndexingMode when Snowball was
493 compiled in (the default), but "no" doesn't always mean no when the Norwegian
494 stemmer is available.
495
496 =item * Fixed problem building incremental version
497
498 Fixed compile problem with building incremental indexing mode.  This is an
499 experimental option with swish-e to allow adding files to an index.
500 See configure --help for build option.  Incremental indexes are not
501 compatible with standard indexes.
502
503 =item * Updated build instructions in INSTALL
504
505 Added a few comments about use of CPPFLAGS and LDFLAGS.
506
507 =item * Updated the index_hypermail.pl
508
509 Updated to work with latest version of hypermail (pre-2.1.9).
510
511
512 =item * Time zone in ResultPropertyStr()
513
514 Format string for generating date did not include the time zone in location.
515 Add strftime format string to config.h
516
517 =item * Undefined and Blank Properties and (NULL)
518
519 Fixed a few problems with printing properties:
520
521 1) Using -p and -x showed different results if a bad property value was given:
522
523     $ swish-e -w not dkdk -p badname -H0
524     err: Unknown Display property name "badname"
525     .
526     $ swish-e -w not dkdk -x '<badname>\n' -H0
527     (NULL)
528
529 Now both return an error.
530
531 2) Fixed bug where using a "fmt" string with -x output generated (bad) output
532 if the result did not have the specified property.
533
534     $ swish-e -w not dkdk -x '<somedate>\n' -H0  # undefined value
535
536     $ swish-e -w not dkdk -x '<somedate fmt="%Y %B %d">\n' -H0
537     %Y %B 1075353525
538
539 Now nothing is printed if the property does not exist.
540
541 3) Updated SWISH::API to croak() on invalid property names, and to return
542 undefined values for missing properties.
543
544 4) Updated swish.cgi and search.cgi to not generate warnings on undefined values
545 return as properties.  Note that swish.cgi will now die on undefined properties.
546 Previously would just display (NULL).
547
548
549 =item * Fixed segfault when generating warnings while parsing
550
551 Parser.c was incorrectly calling warning() incorrectly.
552 And -Wall was not catching this!
553
554 =item * Added check for internal property names.
555
556 Parser was not checking for use of Swish-e reserved property
557 names.
558
559    <swishrank>foo</swishrank>
560
561 This will now generate a warning.
562
563 =back
564
565 =head2 Version 2.4.1 - December 17, 2003
566
567 =over 4
568
569 =item * Added new example CGI script
570
571 search.cgi is a new skeleton CGI script that uses SWISH::API for searching.
572 It is installed in the same location as swish.cgi.
573
574 =item * Add Fuzzy access to C and Perl interfaces
575
576 Added a number of functions to the C API (and SWISH::API)
577 to access the stemmer used when indexing a given index.
578
579 =item * Commas in numbers
580
581 Added commas to summary display at end of indexing.
582
583 =item * Insert whitespace between tags
584
585 Parser.c was updated to flush the text buffer before and after
586 every (non-inline HTML) tag.
587
588 The problem was that:
589
590     foo<tag>bar</tag>baz
591
592 would index as a single word "foobarbaz".
593
594 =item * DirTree.pl
595
596 DirTree.pl was updated to work with SWISH::Filter and to work on Windows.
597 DirTree.pl is a program to fetch files from the file system and works with
598 the -S prog input method.
599
600 =item * Problem with --enable-incremental option
601
602 Fixed configure script to build incremental option.  Note that this is still
603 experimental.  But testers are welcome.
604
605 =item * headers.c bug
606
607 Mark Fletcher with the help of valgrind found a bug in headers.c
608 function SwishIndexHeaderNames used by the C API.
609
610 =item * Clarify documentation regarding search order
611
612 At the prompting of Doralyn Rossmann updated SEARCH.pod to
613 try and make the explanation of searching clearer, and to fix an error
614 in the description of nested searches.
615
616 =back
617
618 =head2 Version 2.4.0 - October 27, 2003
619
620 =over 4
621
622 =item * Note: Different Index Format
623
624 Swish-e version 2.4.0 has a different index file format from previous
625 versions of Swish-e.  Upgrading will B<require> reindexing -- version 2.4.0
626 cannot read indexes created with previous versions.
627
628 =back
629
630 =head2 Version 2.4.0 (Release Candidate 4)  September 26, 2003
631
632 =over 4
633
634 =item * robots.txt not closed correctly
635
636 When using -S http method robots.txt was not closed and that caused
637 the (last) .contents file to not be unlinked under Windows.  Windows
638 seems to think filenames are related to files.
639
640 =item * SWISH::Filter and locating programs on Windows
641
642 SWISH::Filter now scans $libexecdir in addition to the PATH for programs (such at catdoc and
643 pdftotext), and also checks for programs by adding the extensions ".exe" and ".bat" to the
644 program name.
645
646 =item * Install sample templates
647
648 The sample templates included with swish.cgi are now installed
649 in $pkgdatadir (typically /usr/local/share/swish-e).
650
651 =back
652
653 =head2 Version 2.4.0 (Release Candidate 3)  September 11, 2003
654
655 =over 4
656
657 =item * Fix parser bug meta=(foo*)
658
659 Fixed bug in query parser caused in rc2's (pr2) attempt to catch wildcards
660 errors.
661
662 =back
663
664 =head2 Version 2.4.0 (Release Candidate 2)  September 10, 2003
665
666 =over 4
667
668 =item * Indexing HTML title
669
670 Fixed a problem when these were used in combination:
671
672   MetaNames swishtitle
673   MetaNameAlias swishtitle title
674
675 That failed to correctly reset the metaname stack and indexed text under
676 the wrong metaID.
677
678 =item * Single Wildcards
679
680 Due to the way the query parser "works" a search of
681
682    "foo *"
683
684 would result in a search of "foo*".  Now that results in:
685
686    err: Single wildcard not allowed as word
687
688 =item * Fixed search parsing bug
689
690 Brad Miele reported that the word "andes" was not being found.  It was being
691 stemmed to "and" when was then considered an operator.  [moseley]
692
693 =item * Add new directive PropertyNamesSortKeyLength
694
695 PropertyNamesSortKeyLength sets the sort key length to use when sorting
696 string properties.  The default is 100 characters.  There was a hard-coded
697 100 char limit before, but that was a problem where people were not building
698 from source (Windows).  The value of this is questionable -- it's intended to
699 limit how much memory is used when sorting while indexing and searching. [moseley]
700
701 =item * Fixed sorting issues with multiple indexes and reverse sorting
702
703 Reworked much of the sorting code.  Still to do is setting the character sort order.
704 [moseley]
705
706 =item * Fixed minor memory leak
707
708 Fixed leak of not releasing memory of index file name and swish_handle
709 destroy, and fixed SwishStemWord to default to the Stemmer_en. [moseley]
710
711 Fixed libtest.c example program that was not cleaning up memory after an
712 error condition.
713
714 =item * Replaced Swish-e's Porter Stemmer with Snowball
715
716 Swish-e now has support for Snowball stemmers (http://snowball.tartarus.org/).
717 The stemmers are enabled for an index with FuzzyIndexingMode Stemming_* where "*" can be:
718
719   de, dk, en1, en2, es, fi, fr, it, nl, no, pt, ru, se
720
721 In addition, UseStemming yes or FuzzyIndexingMode Stemming_en will use the old stemmer.
722
723 =back
724
725 =head2 Version 2.4.0 (Release Candidate 1)  May 21, 2003
726
727 =over 4
728
729 =item * Security Fix: swish.cgi
730
731 The swish.cgi script was not correctly escaping HTML when searching by
732 the right combination of metanames and highlighting module.  This could
733 lead to cross-site scripting if indexing un-trusted documents. [moseley]
734
735 =item * Added Support for building a Debian Package
736
737 To build as a .deb unpack the distribution and chdir then run
738
739    $ fakeroot debian/build binary
740
741 Then install the generated .deb file with dpkg -i
742
743 =item * Use SWISH::Filter by default with spider.pl
744
745 spider.pl is installed in the libexecdir directory as well as the SWISH::Filter modules.
746 PDF, MS Word, MP3, and XML documents will be indexed automatically if the required helper
747 applications (e.g. catdoc, pdftotext) or scripts (e.g. MP3::Tag) are installed.
748
749 Swish also knows about libexecdir, so you you specify a relative path with -S prog
750 swish-e will look for the program in libexecdir.  This is mostly for spider.pl so
751 indexing only requires:
752
753     IndexDir spider.pl
754     SwishProgParameters default http://localhost/index.html
755
756 And swish-e will find spider.pl and SWISH::Filter will be used to convert docs.
757
758 =item * Fixed Document-Type bug
759
760 Document-Type was not being reset after set input from a -S prog program causing
761 the wrong parser to be used. [moseley]
762
763 =item * New Directive: PropertyNamesNoStripChars
764
765 Swish replaces all series of low ASCII chars with a single space
766 character.  This option instructs swish to store all chars in the property. [moseley]
767
768 =item * Change HTTP access defaults
769
770 Defaults used with -S http access method were changed.
771
772
773 Delay was reduced from one minute between start of each request to five seconds
774 between requests.
775
776 MaxDepth was changed from five to zero, meaning there is no limit to depth indexed by
777 default. [moseley]
778
779 =item * swishspider location and SpiderDirectory
780
781 The swishspider program is now installed in $prefix/lib/swish-e by default.  This can
782 be changed by the --libexecdir option to configure. 
783
784 The SpiderDirectory option now defaults to the value of libexecdir instead of the current
785 directory. [moseley]
786
787
788 =item * Added libtool and automake support
789
790 Replaces the build system with Autotools.  Now builds libswish-e as
791 a shared library on systems that support shared libraries.
792 The swish-e binary links against this shared library.
793 Can also build outside the source tree on platforms with GNU make. [moseley]
794
795 =item * Updates to installation
796
797 Running "make install" now installs additional files.
798 Files include the swish-e binary, the libswish-e search library, swish-e.h
799 header, documentation files, the swishspider program, and Perl modules used for the example
800 swish.cgi search script. Directories will be created if they do not already exist.
801 Installation directories can be specified at build time.
802
803 =item * Fixed bug when searching at end of inverted index
804
805 Swish was not correctly detecting the end of the inverted index
806 when searching a wildcard word that was past the last word in the index.
807 Caught by Frank Heasley. [moseley]
808
809
810 =item * Increase sort key length from 50 to 100 characters
811
812 The setting MAX_SORT_STRING_LEN in F<src/config.h> sets the max length used
813 when sorting in swish-e.  You may reduce this number to save memory while
814 sorting, or increase it if you have very long properties to sort.
815
816 =item * Remove &quot; entity from -p output
817
818 The -p option to print properties was escaping double quotes in properties
819 with the &amp;quot; entity.  -x does not do that, so inconsistent.  -p no longer
820 converts double quotes.  The user should pick a good delimiter with -d or preferably use
821 the -x method for generating output.
822
823 =item * XML parser and Windows
824
825 The XML parser was being passed the incorrect buffer length when used on Windows
826 platform causing the parser to abort with an error.
827
828 =item * Version Numbering
829
830 SWISH-E versions starting with 2.3.4 use kernel version numbering.  Versions are
831 in the form: Major.Minor.Build.  Odd minor versions are development.  Even minor
832 versions are releases.  2.3.4 would be a development version. 
833 2.4.0 would be a release version.  2.3.20 would be the 20th build of 2.3.
834
835 =item * Added RPM support
836
837 RPMs can be built with:
838
839     ./configure
840     make dist
841
842 Copy the resulting tarball to RPM's SOURCES directory and then run as a superuser:
843
844     rpmbuild -ba rpm/swish-e.spec
845
846
847 You should have swish-e packages in your RPMS/$arch directory.  [augur]
848
849 =item * Changed default perl binary location
850
851 Most perl scripts provided with SWISH-E now use /usr/bin/perl by default.
852 Note that some scripts are generated at build time, so those will look in the
853 path for the location of the perl binary.
854
855 =item * New Feature: MetaNamesRank
856
857 MetaNamesRank can be used to adjust the ranking for words based on
858 the word's MetaName.
859
860 =item * New Swish Library API and Perl Module
861
862 The Swish-e C library interface was rewritten to provide
863 better memory management and better separation of data.
864 Most indexing related code has been removed from the library.
865 A new header file is provided for the API: swish-e.h.
866
867 The Perl module SWISHE was replaced with the SWISH::API module
868 in the Swish-e distribution.
869
870 B<Previous versions of the SWISHE module will not work with this version of Swish-e.>
871
872 If you are using the SWISHE module from a previous version of Swish then you must
873 either rewrite your code to use the new SWISH::API module (highly recommended)
874 or use the replacement SWISHE module.  The replacement SWISHE module is a thin
875 interface to the SWISH::API module.  It can be downloaded from
876
877     http://swish-e.org/Download/old/SWISHE-0.03.tar.gz
878
879 =item * NoContents not working with libxml2 parser
880
881 Corrected problem when using NoContents with binary files and the HTML2 parser.
882
883 Trying to index image file names with:
884
885     IndexOnly .gif .jpeg
886     NoContents .gif .jpeg
887
888 failed to index the path names because the default parser
889 (HTML2 when libxml2 is linked with swish-e)
890 was not finding any text in the binary files. [moseley]
891
892 =item * Updates to swish.cgi
893
894 The example/swish.cgi script can now use the SWISH::API module
895 for searching an index.  Combined with mod_perl this module
896 can improve search performance considerably.
897
898 The Perl modules used with the swish.cgi script have all been moved into
899 the SWISH::* namespace.  Hence, files in the F<modules> directory were moved
900 into the F<modules::SWISH> directory.
901
902 =back
903
904 =head2 Version 2.2.3 - December 11, 2002
905
906 Multiple -L options were ORing instead of ANDing.
907 Catch by Patrick Mouret. [moseley]
908
909 =head2 Version 2.2.2 - November 14, 2002
910
911 Pass non- text/* files onto indexing code IF there is a FileFilter
912 associated with the *extension* of the URL.  Fixes the problem of not
913 being able to index, say, pdf files by using the FileFilter configuation
914 option.
915
916 Fixed bug where nulls were stripped when using FileFilter with -S prog.
917 Catch by Greg Fenton. [moseley]
918
919 =head2 Version 2.2.1 - September 26, 2002
920
921 =over 4
922
923 =item * NoContents with -S prog
924
925 Failed to use the correct default parser when using the No-Contents header
926 and libxml2 linked in. [moseley]
927
928 =item * Add tests for IRIX and sparc machines
929
930 8-byte alignment in mem_zones is is required for these machine [moseley]
931
932
933 =item * Fixed code when removing files
934
935 Was not correctly removing words from index when parser aborted [jmruiz]
936
937 =item * Merge segfault
938
939 Fixed segfault caused by trying to print null dates while merging
940 duplicate files. [moseley]
941
942 =item * Documentation patches
943
944 Spelling corrections to the SWISH-CONFIG pod page [Steve Eckert]
945
946 =item * Configure corrections
947
948 Fixed a zlib test error that used "==" in a test [Steve Eckert]
949
950 =item * Updates to VMS build
951
952 The VMS build was updated [Jean-François PIÉRONNE]
953
954 =item * MANIFEST corrections
955
956 Added missing filters and vms build file into MANIFEST [moseley]
957
958 =back
959
960 =head2 Version 2.2 - September 18, 2002
961
962
963 =over 4
964
965 =item * Default parser
966
967 Swish-e will now use the HTML2 (libxml2) parser by default if libxml2 is
968 installed and DefaultContents or IndexContents is not used.
969
970 =item * Selecting parsers
971
972 Allow HTML*, XML*, and TXT* to automatically select the libxml2-based parsers
973 if libxml2 is linked with Swish-e, otherwise fallback to the built-in parsers.
974
975 =item * SwishSpider and Filters
976
977 Filters (FileFilter directive) did not work correctly when spidering
978 with the -S http method.  A new filter system was developed and now
979 filtering of documents (e.g. pdf-E<gt>html or MSWord-E<gt>text) is handled
980 by the src/SwishSpider program.
981
982 When indexing with the -S http method only documents of content-type "text/*"
983 are indexed.  Other documents must be converted to text by using the filter system.
984
985 =item * Buffer overflow in xml.c
986
987 Fixed bug in xml.c reported by Rodney Barnett when very long words
988 were indexed. [moseley]
989
990 =item * configure script updates
991
992 Updated from _WIN32 checks to feature checks using autoconf [moseley, norris]
993
994 =item * updates to run on Alpha (Linux 2.4 (Debian 3.0))
995
996 Fixed a cast error when calling zlib, and the calls to read/write a packed longs
997 to disk. [jmruiz, moseley]
998
999 =item * COALESCE_BUFFER_MAX_SIZE
1000
1001 Some people were seeing the following error:
1002
1003     err: Buffer too short in coalesce_word_locations.
1004     Increase COALESCE_BUFFER_MAX_SIZE in config.h and rebuild.
1005
1006 This was due to indexing binary data or files with very large number of words.
1007 The best solution is to not index binary data or files with a very large number
1008 of words.
1009
1010 Swish-e will now automatically reallocate the buffer as needed.  [jmruiz]
1011
1012
1013 =back
1014
1015 =head2 Version 2.2rc1 - August 29, 2002
1016
1017 Many large changes were made internally in the code, some for performance
1018 reasons, some for feature changes and additions, and some to prepare
1019 for new features in later versions of Swish-e.
1020
1021 =over 4
1022
1023 =item * Documentation!
1024
1025 Documentation is now included in the source distribution as .pod
1026 (perldoc) files, and as HTML files.  In addition, the distribution can now
1027 generate PDF, postscript, and unix man pages from the source .pod files.
1028 See L<README|README> for more information.
1029
1030 =item * Indexing and searching speed
1031
1032 The indexing process has been imporoved.  Depending on a number of
1033 factors, you may see a significant improvement in indexing speed,
1034 especially if upgrading from version 1.x.
1035
1036 Searching speed has also been improved.  Properties are not loaded until
1037 results are displayed, and properties are pre-sorted during indexing to
1038 speed up sorting results by properties while searching.
1039
1040 =item * Properties are written to a sepearte file
1041
1042 Swish-e now stores document properties in a separate file.  This means
1043 there are now two files that make up a Swish-e index.  The default files
1044 are C<index.swish-e> and C<index.swish-e.prop>.
1045
1046 This change frees memory while indexing, allowing larger collections to
1047 be indexed in memory.
1048
1049 =item * Internal data stored as Properties
1050
1051 Pre 2.2 some internal data was stored in fixed locations within the
1052 index, namely the file name, file size, and title.  2.2 introduced new
1053 internal data such as the last modified date, and document summaries.
1054 This data is considered I<meta data> since it is data about a document.
1055
1056 Instead of adding new data to the internal structure of the index file,
1057 it was decided to use the MetaNames and PropertyNames feature of Swish-e
1058 to store this meta information.  This allows for new meta data to be added
1059 at a later time (e.g. Content-type), and provides an easy and customizable
1060 way to print results with the C<-p> switch and the new C<-x> switch.
1061 In addition, search results can now be sorted and limited by properties.
1062
1063 For example, to sort by the rank and title:
1064
1065     swish-e -w foo -s swishrank desc swishtitle asc
1066
1067
1068 =item * The header display has been slightly reorganized.
1069
1070 If you are parsing output headers in a program then you may need to
1071 adjust your code.  There's a new switch '-H' to control the level of
1072 header output when searching.
1073
1074 =item * Results are now combined when searching more than one index.
1075
1076 Swish-e now merges (and sorts) the results from multiple indexes when
1077 using C<-f> to specify more than one index.  This change effects the way
1078 maxhits (C<-m>) works.  Here's a summary of the way it works for the
1079 different versions.
1080
1081
1082     1.3.2 - MaxHits returns first N results starting from the first index.
1083             e.g. maxhits=20; 15 hits Index1, 40 hits Index2
1084             All 15 from Index1 plus first five from Index2 = 20 hits.
1085
1086     2.0.0 - MaxHits returns first N results from each index.
1087             e.g. Maxhits=20; 15 hits Index1, 40 hits Index2
1088             All 15 from Index1 plus 15 from Index2.
1089
1090     2.2.0 - Results are merged and first N results are returned.
1091             e.g. Maxhits=20; 15 hits Index1, 40 hits Index2
1092             Results are merged from each index and sorted
1093             (rank is the default sort) and only the first
1094             20 are returned.
1095
1096
1097 =item * New B<prog> document source indexing method
1098
1099 You can now use -S prog to use an external program to supply documents
1100 to Swish-e.  This external program can be used to spider web servers,
1101 index databases, or to convert any type of document into html, xml,
1102 or text, so it can be indexed by Swish-e.  Examples are given in the
1103 C<prog-bin> directory.
1104
1105 =item * The indexing parser was rewritten to be more logical.
1106
1107 TranslateCharacters now is done before WordCharacters is checked.  For example,
1108
1109     WordCharacters abcdefghijklmnopqrstuvwxyz
1110     TranslateCharacters ñ n
1111
1112 Now C<El Niño> will be indexed as El Nino (el and nino), even though C<ñ>
1113 is not listed in WordCharacters.
1114
1115 Previously, stopwords were checked after stemming and soundex conversions,
1116 as well as most of the other word checks (WordCharacters, min/max length
1117 and so on).  This meant that the stopword list probably didn't work as
1118 expected when using stemming.
1119
1120 =item * The search parser was rewritten to be more logical
1121
1122 The search parser was rewritten to correct a number of logic errors.
1123 Swish-e did not differentiate between meta names, Swish-e operators
1124 and search words when parsing the query.  This meant, for example,
1125 that metanames might be broken up by the WordCharacters setting, and
1126 that they could be stemmed.
1127
1128 Swish-e operator characters C<"*()=> can now be searched by escaping
1129 with a backslash.  For example:
1130
1131     ./swish-e -w 'this\=odd\)word'
1132
1133 will end up searching for the word C<this=odd)word>.  To search for a
1134 backslash character preceed it with a backslash.
1135
1136 Currently, searching for:
1137
1138     ./swish-e -w 'this\*'
1139
1140 is the same as a wildcard search.  This may be fixed in the future.   
1141
1142 Searching for buzzwords with those characters will still require
1143 backslashing.  This also may change to allow some un-escaped operator
1144 characters, but some will always need to be escaped (e.g. the double-quote
1145 phrase character).
1146
1147 =item * Quotes and Backslash escapes in strings
1148
1149 A bug was fixed in the C<parse_line()> function (in F<string.c>) where
1150 backslashes were not escaping the next character.  C<parse_line()> is used
1151 to parse a string of text into tokens (words).  Normally splitting is done
1152 at whitespace.  You may use quotes (single or double) to define a string
1153 (that might include whitespace) as a single parameter.  The backslash
1154 can also be used to escape the following character when *within* quotes
1155 (e.g. to escape an embedded quote character).
1156
1157     ReplaceRules append "foo bar"   <- define "foo bar" as a single word
1158     ReplaceRules append "foo\"bar"  <- escape the quotes
1159     ReplaceRules append 'foo"bar'   <- same thing
1160
1161
1162 =item * Example C<user.config> file removed.
1163
1164 Previous versions of Swish-e included a configuration file called
1165 C<user.config> which contained examples of all directives.  This has
1166 been replaced by a series of example configuration files located in the
1167 C<conf> directory.  The configuration directives are now described in
1168 L<SWISH-CONFIG|SWISH-CONFIG>.
1169
1170 =item * Ports to Win32 and VMS
1171
1172 David Norris has included the files required to build Swish-e under
1173 Windows.  See C<src/win32>.  A self-extracting Windows version is
1174 available from the Download page of the swish-e.org web site.
1175
1176 Jean-François Piéronne has provided the files required to build Swish-e
1177 under OpenVMS.  See C<src/vms> for more information.
1178
1179 =item * String properties are concatenated
1180
1181 Multiple I<string> properties of the same name in a document are now
1182 concatenated into one property.  A space character is added between
1183 the strings if needed.  A warning will be generated if multiple numeric
1184 or date properties are found in the same document, and the additional
1185 properties will be ignored.
1186
1187 Previously, properties of the same name were added to the index, but
1188 could not be retrieved.
1189
1190 To do: remove the C<next> pointer, and allow user-defined character to
1191 place between properties.
1192
1193 =item * regex type added to ReplaceRules
1194
1195 A more general purpose pattern replacement syntax.
1196
1197
1198 =item * New Parsers
1199
1200 Swish-e's XML parser was replaced with James Clark's expat XML parser
1201 library.
1202
1203 Swish-e can now use Daniel Veillard's libxml2 library for parsing HTML and
1204 XML.  This requires installation of the library before building Swish-e.
1205 See the L<INSTALL|INSTALL> document for information.  libxml2 is not
1206 required, but is strongly recommended for parsing HTML over Swish-e's
1207 internal HTML parser, and provides more features for both HTML and
1208 XML parsing.
1209
1210 =item * Support for zlib
1211
1212 Swish-e can be compiled with zlib.  This is useful for compressing large
1213 properties.  Building Swish-e with zlib is stronly recommended if you
1214 use its C<StoreDescription> feature.
1215
1216 =item * LST type of document no longer supported
1217
1218 LST allowed indexing of files that contained multiple documents.
1219
1220 =item * Temporary files
1221
1222 To improve security Swish-e now uses the C<mkstemp(3)> function to
1223 create temporary files.  Temporary files are used while indexing only.
1224 This may result in some portability issues, but the security issues
1225 were overriding.
1226
1227 (Currently this does not apply to the -S http indexing method.)
1228
1229 C<mkstemp> opens the temporary with O_EXCL|O_CREAT flags.  This prevents
1230 overwriting existing files.  In addition, the name of the file created
1231 is a lot harder to guess by attackers.  The temporary file is created
1232 with only owner permissions.
1233
1234 Please report any portability issues on the Swish-e discussion list.
1235
1236 =item * Temporary file locations
1237
1238 Swish-e now uses the environment variables C<TMPDIR>, C<TMP>, and
1239 C<TEMP> (in that order) to decide where to write temporary files.
1240 The configuration setting of L<TmpDir|SWISH-CONFIG/"item_TmpDir"> will
1241 be used if none of the environment variables are set.  Swish-e uses the
1242 current directory otherwise; there is no default temporary directory.
1243
1244 Since the environment variables override the configuration settings,
1245 a warning will be issued if you set L<TmpDir|SWISH-CONFIG/"item_TmpDir">
1246 in the configuration file and there's also an environment variable set.
1247
1248 Temporary files begin with the letters "swtmp" (which can be changed in
1249 F<config.h>), followed by two or more letters that indicate the type of
1250 temporary file, and some random characters to complete the file name.
1251 If indexing is aborted for some reason you may find these temporary
1252 files left behind.
1253
1254 =item * New Fuzzy indexing method Double Metaphone
1255
1256 Based on Lawrence Philips' Metaphone algorithm, add two
1257 new methods of creating a fuzzy index (in addition to Stemming and Soundex).
1258
1259
1260 =back
1261
1262 Changes to Configuration File Directives.  Please see
1263 L<SWISH-CONFIG|SWISH-CONFIG> for more info.
1264
1265 =over 4
1266
1267 =item * New directives: IndexContents and DefaultContents
1268
1269 The IndexContents directive assigns internal Swish-e document parsers
1270 to files based on their file type.  The DefaultContents directive
1271 assigns a parser to be used on file that are not assigned a parser with
1272 IndexContents.
1273
1274 =item * New directive: UndefinedMetaTags [error|ignore|index|auto]
1275
1276 This describes what to do when a meta tag is found in a document that
1277 is not listed in the MetaNames directive.
1278
1279 =item * New directive: IgnoreTags
1280
1281 Will ignore text with the listed tags.
1282
1283 =item * New directive: SwishProgParameters *list of words*
1284
1285 Passes words listed to the external Swish-e program when running with
1286 C<-S prog> document source method.
1287
1288 =item * New directive: ConvertHTMLEntities [yes|no]
1289
1290 Controls parsing and conversion of HTML entities.
1291
1292 =item * New directive: DontBumpPositionOnMetaTags
1293
1294 The word position is now bumped when a new metatag is found -- this is
1295 to prevent phrases from matching across meta tags.  This directive will
1296 disable this behavior for the listed tags.
1297
1298 This directive works for HTML and XML documents.
1299
1300 =item * Changed directive: IndexComments
1301
1302 This has been changed such that comments are not indexed by default.
1303
1304 =item * Changed directive: IgnoreWords
1305
1306 The builtin list of stopwords has been removed. Use of the SwishDefault
1307 word will generate a warning, and no stop words will be used.  You must
1308 now specify a list of stopwords, or specify a file of stopwords.
1309
1310 A sample file C<stopwords.txt> has been inclu