| 77 | | <a name="swish3"></a> |
|---|
| 78 | | <h3>Features planned for 3.0</h3> |
|---|
| 79 | | |
|---|
| 80 | | <p> |
|---|
| 81 | | Swish-e 3.0 (sometimes abbreviated Swish-3) will be a complete overhaul of the code. |
|---|
| 82 | | Major feature improvements will include: |
|---|
| 83 | | |
|---|
| 84 | | <dl> |
|---|
| 85 | | <dt>Unicode support</dt> |
|---|
| 86 | | <dd>Unicode is the <a href='http://www.unicode.org/unicode/faq/'>international standard |
|---|
| 87 | | for character encodings</a>. Swish-e will implement |
|---|
| 88 | | support for the <a href='http://www.cl.cam.ac.uk/~mgk25/unicode.html'>UTF-8</a> |
|---|
| 89 | | <a href='http://czyborra.com/utf/'>character encoding</a>, |
|---|
| 90 | | which should handle all major languages in the world (UTF-8 handles up to |
|---|
| 91 | | 2,147,483,648 unique characters). |
|---|
| 92 | | The Swish-e developers need input from non-English language experts. |
|---|
| 93 | | Please contribute to the discussion at the |
|---|
| 94 | | |
|---|
| 95 | | [% link_to_page('discuss' , 'Swish-e mailing list' ) %]. |
|---|
| 96 | | |
|---|
| 97 | | Some significant known issues include: |
|---|
| 98 | | <p /> |
|---|
| 99 | | <dl> |
|---|
| 100 | | <dt>lowercase vs. UPPERCASE</dt> |
|---|
| 101 | | <dd>Version 2.x uses <tt>tolower()</tt> to lowercase all characters |
|---|
| 102 | | before searching and indexing. Should the same approach be used for UTF-8? Will this have |
|---|
| 103 | | significant impact on usability for non-English languages? |
|---|
| 104 | | </dd> |
|---|
| 105 | | <dt>Wildcards</dt> |
|---|
| 106 | | <dd>Version 2.x uses an internal table to support wildcard searching with <tt>*</tt>. |
|---|
| 107 | | The table assumes 8-bit (non-Unicode) character encoding. That approach will likely need |
|---|
| 108 | | to be re-thought for multibyte encodings like UTF-8. |
|---|
| 109 | | </dd> |
|---|
| 110 | | <dt>WordCharacters</dt> |
|---|
| 111 | | <dd>Version 2.x uses 5 different configuration options to control how a |
|---|
| 112 | | 'word' is defined. The basic assumption is that a word is defined by which characters it |
|---|
| 113 | | <i>includes</i>. That assumption is based on a manageable character set of 256 characters. |
|---|
| 114 | | However, the sheer size of UTF-8 suggests that the basic assumption should be inverted: |
|---|
| 115 | | a word is defined by which characters it <i>excludes</i>. Thus, Swish-3 will likely include |
|---|
| 116 | | 3 configuration options instead of the current 5: |
|---|
| 117 | | IgnoreCharacters, IgnoreStartCharacters, and IgnoreEndCharacters. |
|---|
| 118 | | </dd> |
|---|
| 119 | | |
|---|
| 120 | | <dt>Stemming</dt><dd>The stemmers used will need full international support. |
|---|
| 121 | | </dd> |
|---|
| 122 | | <dt>Configuration format</dt> |
|---|
| 123 | | <dd>Since Swish-e depends on a configuration file for StopWords, Character |
|---|
| 124 | | definitions, etc., the parsing of the configuration file must support UTF-8 as well. |
|---|
| 125 | | The current idea is to switch to XML-style configuration files and use Libxml2 to parse |
|---|
| 126 | | them. |
|---|
| 127 | | </dd> |
|---|
| 128 | | </dl> |
|---|
| 129 | | |
|---|
| 130 | | </dd> |
|---|
| 131 | | |
|---|
| 132 | | <dt>Incremental indexing</dt> |
|---|
| 133 | | <dd>Swish-3 will support true incremental indexing. This will allow for document records |
|---|
| 134 | | to be modified, added and deleted in an existing index. This feature may or may not build |
|---|
| 135 | | on the version 2.x experimental btree/incremental feature. |
|---|
| 136 | | </dd> |
|---|
| 137 | | |
|---|
| 138 | | <dt>Scaling</dt> |
|---|
| 139 | | <dd>Swish-3 will reliably scale to larger (multimillion) document collections. |
|---|
| 140 | | </dd> |
|---|
| 141 | | |
|---|
| 142 | | <dt>Indexing API</dt> |
|---|
| 143 | | <dd>Swish-e will include an indexing API in addition to the current searching API.</dd> |
|---|
| 144 | | |
|---|
| 145 | | <dt>Streamlined feature set</dt> |
|---|
| 146 | | <dd>Swish-3 will not contain several features in the current version: |
|---|
| 147 | | <ul> |
|---|
| 148 | | <li>Expat parsers</li> |
|---|
| 149 | | <li><tt>-S http</tt> indexing method and related configuration options</li> |
|---|
| 150 | | <li>Older stemmers</li> |
|---|
| 151 | | <li>Current native index format</li> |
|---|
| 152 | | </ul> |
|---|
| 153 | | </dd> |
|---|
| 154 | | |
|---|
| 155 | | <dt>Alternate index backends</dt> |
|---|
| 156 | | <dd>Swish-3 may offer alternate index backends using available open source libraries, |
|---|
| 157 | | such as <a href='http://xapian.org/'>Xapian</a>, |
|---|
| 158 | | <a href='http://hyperestraier.sourceforge.net/'>HyperEstraier</a>, |
|---|
| 159 | | <a href='http://incubator.apache.org/lucene4c/'>Lucene</a>, or |
|---|
| 160 | | <a href='http://www.lemurproject.org/'>Lemur</a>. |
|---|
| 161 | | </dd> |
|---|
| 162 | | |
|---|
| 163 | | </dl> |
|---|
| 164 | | </p> |
|---|
| 165 | | |
|---|