| 80 | | Swish-e 3.0 will be a complete re-write of the code to support UTF-8 (Unicode). |
|---|
| 81 | | Other planned features include: |
|---|
| 82 | | |
|---|
| 83 | | <ul> |
|---|
| 84 | | None at this time. |
|---|
| 85 | | </ul> |
|---|
| 86 | | |
|---|
| | 81 | Swish-e 3.0 (sometimes abbreviated Swish-3) will be a complete overhaul of the code. |
|---|
| | 82 | Major feature improvements will include: |
|---|
| | 83 | |
|---|
| | 84 | <dl> |
|---|
| | 85 | <dt>Unicode support</dt> |
|---|
| | 86 | <dd>Unicode is the <a href='http://www.unicode.org/unicode/faq/'>international standard |
|---|
| | 87 | for character encodings</a>. Swish-e will implement |
|---|
| | 88 | support for the <a href='http://www.cl.cam.ac.uk/~mgk25/unicode.html'>UTF-8</a> |
|---|
| | 89 | <a href='http://czyborra.com/utf/'>character encoding</a>, |
|---|
| | 90 | which should handle all major languages in the world (UTF-8 handles up to |
|---|
| | 91 | 2,147,483,648 unique characters). |
|---|
| | 92 | The Swish-e developers need input from non-English language experts. |
|---|
| | 93 | Please contribute to the discussion at the |
|---|
| | 94 | |
|---|
| | 95 | [% link_to_page('discuss' , 'Swish-e mailing list' ) %]. |
|---|
| | 96 | |
|---|
| | 97 | Some significant known issues include: |
|---|
| | 98 | <p /> |
|---|
| | 99 | <dl> |
|---|
| | 100 | <dt>lowercase vs. UPPERCASE</dt> |
|---|
| | 101 | <dd>Version 2.x uses <tt>tolower()</tt> to lowercase all characters |
|---|
| | 102 | before searching and indexing. Should the same approach be used for UTF-8? Will this have |
|---|
| | 103 | significant impact on usability for non-English languages? |
|---|
| | 104 | </dd> |
|---|
| | 105 | <dt>Wildcards</dt> |
|---|
| | 106 | <dd>Version 2.x uses an internal table to support wildcard searching with <tt>*</tt>. |
|---|
| | 107 | The table assumes 8-bit (non-Unicode) character encoding. That approach will likely need |
|---|
| | 108 | to be re-thought for multibyte encodings like UTF-8. |
|---|
| | 109 | </dd> |
|---|
| | 110 | <dt>WordCharacters</dt> |
|---|
| | 111 | <dd>Version 2.x uses 5 different configuration options to control how a |
|---|
| | 112 | 'word' is defined. The basic assumption is that a word is defined by which characters it |
|---|
| | 113 | <i>includes</i>. That assumption is based on a manageable character set of 256 characters. |
|---|
| | 114 | However, the sheer size of UTF-8 suggests that the basic assumption should be inverted: |
|---|
| | 115 | a word is defined by which characters it <i>excludes</i>. Thus, Swish-3 will likely include |
|---|
| | 116 | 3 configuration options instead of the current 5: |
|---|
| | 117 | IgnoreCharacters, IgnoreStartCharacters, and IgnoreEndCharacters. |
|---|
| | 118 | </dd> |
|---|
| | 119 | |
|---|
| | 120 | <dt>Stemming</dt><dd>The stemmers used will need full international support. |
|---|
| | 121 | </dd> |
|---|
| | 122 | <dt>Configuration format</dt> |
|---|
| | 123 | <dd>Since Swish-e depends on a configuration file for StopWords, Character |
|---|
| | 124 | definitions, etc., the parsing of the configuration file must support UTF-8 as well. |
|---|
| | 125 | The current idea is to switch to XML-style configuration files and use Libxml2 to parse |
|---|
| | 126 | them. |
|---|
| | 127 | </dd> |
|---|
| | 128 | </dl> |
|---|
| | 129 | |
|---|
| | 130 | </dd> |
|---|
| | 131 | |
|---|
| | 132 | <dt>Incremental indexing</dt> |
|---|
| | 133 | <dd>Swish-3 will support true incremental indexing. This will allow for document records |
|---|
| | 134 | to be modified, added and deleted in an existing index. This feature may or may not build |
|---|
| | 135 | on the version 2.x experimental btree/incremental feature. |
|---|
| | 136 | </dd> |
|---|
| | 137 | |
|---|
| | 138 | <dt>Scaling</dt> |
|---|
| | 139 | <dd>Swish-3 will reliably scale to larger (multimillion) document collections. |
|---|
| | 140 | </dd> |
|---|
| | 141 | |
|---|
| | 142 | <dt>Indexing API</dt> |
|---|
| | 143 | <dd>Swish-e will include an indexing API in addition to the current searching API.</dd> |
|---|
| | 144 | |
|---|
| | 145 | <dt>Streamlined feature set</dt> |
|---|
| | 146 | <dd>Swish-3 will not contain several features in the current version: |
|---|
| | 147 | <ul> |
|---|
| | 148 | <li>Expat parsers</li> |
|---|
| | 149 | <li><tt>-S http</tt> indexing method and related configuration options</li> |
|---|
| | 150 | <li>Older stemmers</li> |
|---|
| | 151 | <li>Current native index format</li> |
|---|
| | 152 | </ul> |
|---|
| | 153 | </dd> |
|---|
| | 154 | |
|---|
| | 155 | <dt>Alternate index backends</dt> |
|---|
| | 156 | <dd>Swish-3 may offer alternate index backends using available open source libraries, |
|---|
| | 157 | such as <a href='http://xapian.org/'>Xapian</a>, |
|---|
| | 158 | <a href='http://hyperestraier.sourceforge.net/'>HyperEstraier</a>, |
|---|
| | 159 | <a href='http://incubator.apache.org/lucene4c/'>Lucene</a>, or |
|---|
| | 160 | <a href='http://www.lemurproject.org/'>Lemur</a>. |
|---|
| | 161 | </dd> |
|---|
| | 162 | |
|---|
| | 163 | </dl> |
|---|