| 1 |
Filtering documents with SWISH::Filter |
|---|
| 2 |
-------------------------------------- |
|---|
| 3 |
|
|---|
| 4 |
Swish-e knows only how to parse HTML, XML, and text files. |
|---|
| 5 |
Other file types may be indexed with the help of filters. |
|---|
| 6 |
|
|---|
| 7 |
SWISH::Filter is a Perl module designed to make converting |
|---|
| 8 |
documents from one type of content to another type of content |
|---|
| 9 |
easy. It's uses a plug-in type of system where new filters |
|---|
| 10 |
can be added with little effort. |
|---|
| 11 |
|
|---|
| 12 |
SWISH::Filter (and associated plug-in filter modules) do not |
|---|
| 13 |
normally do the actual filtering. This system provides only |
|---|
| 14 |
an interface to the programs that do the filtering. |
|---|
| 15 |
|
|---|
| 16 |
For example, the Swish-e distribution includes a filter plug-in |
|---|
| 17 |
called SWISH::Filters::Pdf2HTML. For this filter to work you must |
|---|
| 18 |
install the xpdf package that includes the pdftotext and pdfinfo |
|---|
| 19 |
programs. SWISH::Filters::Pdf2HTML only provides a unified interface |
|---|
| 20 |
to this programs. |
|---|
| 21 |
|
|---|
| 22 |
The included program F<spider.pl> will use SWISH::Filter by default. |
|---|
| 23 |
This means that installing the programs that do the filter is all that |
|---|
| 24 |
is needed to start filtering documents. For example, installing the |
|---|
| 25 |
xpdf package will enable indexing of PDF file when spidering. |
|---|
| 26 |
|
|---|
| 27 |
The filter modules are in the $libexecdir/perl directory. Running swish-e |
|---|
| 28 |
-h will list the setting for $libexecdir, but is typically |
|---|
| 29 |
/usr/local/lib/swish-e if swish-e was built from source, or /usr/lib/swish-e |
|---|
| 30 |
if installed as a package. On Window $libexecdir will be set at |
|---|
| 31 |
installation time. |
|---|
| 32 |
|
|---|
| 33 |
Note that $libexecdir/perl is not normally part of Perl's @INC array. So to |
|---|
| 34 |
read documenation on a specific filter you will need to either specify the |
|---|
| 35 |
full path to the filter or set PERL5LIB. For example: |
|---|
| 36 |
|
|---|
| 37 |
export PERL5LIB=/usr/local/lib/swish-e/perl |
|---|
| 38 |
perldoc SWISH::Filter |
|---|
| 39 |
|
|---|
| 40 |
Documentation for SWISH::Filter can also be found in the html directory and |
|---|
| 41 |
at http://swish-e.org. |
|---|
| 42 |
|
|---|
| 43 |
Swish-e has another filter system. The FileFilter directive that can be used |
|---|
| 44 |
to filter documents through an external program while indexing. That system |
|---|
| 45 |
requires a separate filter setup for each type of document. See the |
|---|
| 46 |
SWISH-CONFIG page for information on that type of filtering. |
|---|
| 47 |
|
|---|
| 48 |
|
|---|
| 49 |
Testing SWISH::Filter |
|---|
| 50 |
--------------------- |
|---|
| 51 |
|
|---|
| 52 |
The program swish-filter-test in installed by default (in the same location as |
|---|
| 53 |
the swish-e binary). This program can be used to test SWISH::Filter. For example, |
|---|
| 54 |
run the command: |
|---|
| 55 |
|
|---|
| 56 |
$ swish-filter-test foo.pdf foo.txt |
|---|
| 57 |
|
|---|
| 58 |
Document foo.pdf was filtered. |
|---|
| 59 |
Document: foo.pdf |
|---|
| 60 |
Content-Type: text/html (initial was application/pdf) |
|---|
| 61 |
Parser type: HTML* |
|---|
| 62 |
|
|---|
| 63 |
Document foo.txt was not filtered. |
|---|
| 64 |
Document: foo.txt |
|---|
| 65 |
Content-Type: text/plain (initial was text/plain) |
|---|
| 66 |
Parser type: TXT* |
|---|
| 67 |
|
|---|
| 68 |
Run the command |
|---|
| 69 |
|
|---|
| 70 |
$ swish-filter-test -man |
|---|
| 71 |
|
|---|
| 72 |
for documentation. |
|---|
| 73 |
|
|---|
| 74 |
|
|---|
| 75 |
Current filters distributed with Swish-e: |
|---|
| 76 |
----------------------------------------- |
|---|
| 77 |
|
|---|
| 78 |
All of these filters require installation of helper programs and/or Perl modules. |
|---|
| 79 |
See the individual module's documentation for dependencies. |
|---|
| 80 |
|
|---|
| 81 |
SWISH::Filters::Doc2txt - converts MS Word documents to text |
|---|
| 82 |
SWISH::Filters::Pdf2HTML - converts PDF files to HTML with info tags as metanames |
|---|
| 83 |
SWISH::Filters::ID3toHTML - extracts out ID3 (v1 and v2) tags from MP3 files |
|---|
| 84 |
SWISH::Filters::XLtoHTML - converts MS Excel to HTML |
|---|
| 85 |
|
|---|
| 86 |
Filters that depend on Perl modules that are not installed will not load. |
|---|
| 87 |
Setting the environment variable FILTER_DEBUG may report helpful errors when using |
|---|
| 88 |
filters. |
|---|
| 89 |
|
|---|
| 90 |
See perldoc SWISH::Filter for instructions on creating filters. |
|---|
| 91 |
|
|---|