| 68 | | See the [% link_to_page('swish3' , 'Swish3 development page' ) %]. |
|---|
| | 68 | |
|---|
| | 69 | <p> |
|---|
| | 70 | Swish-e 3.0 (abbreviated Swish3) will be a complete overhaul of the code. |
|---|
| | 71 | You can <a href="http://dev.swish-e.org/wiki/swish3">track development progress here</a>. |
|---|
| | 72 | Major feature improvements will include: |
|---|
| | 73 | |
|---|
| | 74 | <dl> |
|---|
| | 75 | <dt>Unicode support</dt> |
|---|
| | 76 | <dd>Unicode is the <a href='http://www.unicode.org/unicode/faq/'>international standard |
|---|
| | 77 | for character encodings</a>. Swish3 will implement |
|---|
| | 78 | support for the <a href='http://www.cl.cam.ac.uk/~mgk25/unicode.html'>UTF-8</a> |
|---|
| | 79 | <a href='http://czyborra.com/utf/'>character encoding</a>, |
|---|
| | 80 | which should handle all major languages in the world (UTF-8 handles up to |
|---|
| | 81 | 2,147,483,648 unique characters). |
|---|
| | 82 | The Swish-e developers need input from non-English language experts. |
|---|
| | 83 | Please contribute to the discussion at the |
|---|
| | 84 | |
|---|
| | 85 | [% link_to_page('discuss' , 'Swish-e mailing list' ) %]. |
|---|
| | 86 | |
|---|
| | 87 | Some significant known issues include: |
|---|
| | 88 | <p /> |
|---|
| | 89 | <dl> |
|---|
| | 90 | <dt>lowercase vs. UPPERCASE</dt> |
|---|
| | 91 | <dd>Version 2.x uses <tt>tolower()</tt> to lowercase all characters |
|---|
| | 92 | before searching and indexing. Should the same approach be used for UTF-8? Will this have |
|---|
| | 93 | significant impact on usability for non-English languages? |
|---|
| | 94 | </dd> |
|---|
| | 95 | <dt>Wildcards</dt> |
|---|
| | 96 | <dd>Version 2.x uses an internal table to support wildcard searching with <tt>*</tt>. |
|---|
| | 97 | The table assumes 8-bit (non-Unicode) character encoding. That approach will likely need |
|---|
| | 98 | to be re-thought for multibyte encodings like UTF-8. |
|---|
| | 99 | </dd> |
|---|
| | 100 | <dt>Tokenizing</dt> |
|---|
| | 101 | <dd>Version 2.x uses 5 different configuration options to control how a |
|---|
| | 102 | 'word' (token) is defined. The basic assumption is that a word is defined by which characters it |
|---|
| | 103 | <i>includes</i>. That assumption is based on a manageable character set of 256 characters. |
|---|
| | 104 | However, the sheer size of UTF-8 makes that system unworkable. Instead, some kind of |
|---|
| | 105 | regular expression library will likely be used. |
|---|
| | 106 | </dd> |
|---|
| | 107 | |
|---|
| | 108 | <dt>Stemming</dt><dd>The stemmers used will need full international support. |
|---|
| | 109 | </dd> |
|---|
| | 110 | <dt>Configuration format</dt> |
|---|
| | 111 | <dd>Since Swish-e depends on a configuration file for StopWords, Character |
|---|
| | 112 | definitions, etc., the parsing of the configuration file must support UTF-8 as well. |
|---|
| | 113 | The current idea is to switch to XML-style configuration files and use Libxml2 to parse |
|---|
| | 114 | them. |
|---|
| | 115 | </dd> |
|---|
| | 116 | </dl> |
|---|
| | 117 | |
|---|
| | 118 | </dd> |
|---|
| | 119 | |
|---|
| | 120 | <dt>Incremental indexing</dt> |
|---|
| | 121 | <dd>Swish3 will support true incremental indexing. This will allow for document records |
|---|
| | 122 | to be modified, added and deleted in an existing index. This feature may or may not build |
|---|
| | 123 | on the version 2.x experimental btree/incremental feature. |
|---|
| | 124 | </dd> |
|---|
| | 125 | |
|---|
| | 126 | <dt>Scaling</dt> |
|---|
| | 127 | <dd>Swish3 will reliably scale to larger (multimillion) document collections. |
|---|
| | 128 | </dd> |
|---|
| | 129 | |
|---|
| | 130 | <dt>Indexing API</dt> |
|---|
| | 131 | <dd>Swish3 will include an indexing API in addition to the current searching API.</dd> |
|---|
| | 132 | |
|---|
| | 133 | <dt>Streamlined feature set</dt> |
|---|
| | 134 | <dd>Swish3 will not contain several features in the current version: |
|---|
| | 135 | <ul> |
|---|
| | 136 | <li>Expat parsers</li> |
|---|
| | 137 | <li><tt>-S http</tt> indexing method and related configuration options</li> |
|---|
| | 138 | <li>Older stemmers</li> |
|---|
| | 139 | <li>Current native index format</li> |
|---|
| | 140 | </ul> |
|---|
| | 141 | </dd> |
|---|
| | 142 | |
|---|
| | 143 | <dt>Alternate index backends</dt> |
|---|
| | 144 | <dd>Swish3 will offer alternate index backends using available open source libraries, |
|---|
| | 145 | such as <a href='http://xapian.org/'>Xapian</a>, |
|---|
| | 146 | <a href='http://hyperestraier.sourceforge.net/'>HyperEstraier</a>, |
|---|
| | 147 | <a href='http://incubator.apache.org/lucene4c/'>Lucene</a>, or |
|---|
| | 148 | <a href='http://www.lemurproject.org/'>Lemur</a>. |
|---|
| | 149 | </dd> |
|---|
| | 150 | |
|---|
| | 151 | </dl> |
|---|
| | 152 | </p> |
|---|
| | 153 | |
|---|