Changeset 1786

Show
Ignore:
Timestamp:
10/21/05 13:20:47 (3 years ago)
Author:
karman
Message:

swish-3 planning notes

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • trunk/swish_website/src/devel/index.html

    r1748 r1786  
    7575</p> 
    7676 
     77<a name="swish3"></a> 
    7778<h3>Features planned for 3.0</h3> 
    7879 
    7980<p> 
    80 Swish-e 3.0 will be a complete re-write of the code to support UTF-8 (Unicode). 
    81 Other planned features include: 
    82  
    83 <ul> 
    84 None at this time. 
    85 </ul> 
    86  
     81Swish-e 3.0 (sometimes abbreviated Swish-3) will be a complete overhaul of the code. 
     82Major feature improvements will include: 
     83 
     84<dl> 
     85 <dt>Unicode support</dt> 
     86 <dd>Unicode is the <a href='http://www.unicode.org/unicode/faq/'>international standard  
     87 for character encodings</a>. Swish-e will implement 
     88 support for the <a href='http://www.cl.cam.ac.uk/~mgk25/unicode.html'>UTF-8</a> 
     89 <a href='http://czyborra.com/utf/'>character encoding</a>, 
     90 which should handle all major languages in the world (UTF-8 handles up to  
     91 2,147,483,648 unique characters). 
     92 The Swish-e developers need input from non-English language experts.  
     93 Please contribute to the discussion at the 
     94   
     95  [% link_to_page('discuss' , 'Swish-e mailing list' ) %]. 
     96   
     97 Some significant known issues include: 
     98 <p /> 
     99 <dl> 
     100  <dt>lowercase vs. UPPERCASE</dt> 
     101  <dd>Version 2.x uses <tt>tolower()</tt> to lowercase all characters 
     102  before searching and indexing. Should the same approach be used for UTF-8? Will this have 
     103  significant impact on usability for non-English languages?  
     104  </dd> 
     105  <dt>Wildcards</dt> 
     106  <dd>Version 2.x uses an internal table to support wildcard searching with <tt>*</tt>. 
     107  The table assumes 8-bit (non-Unicode) character encoding. That approach will likely need 
     108  to be re-thought for multibyte encodings like UTF-8. 
     109  </dd> 
     110  <dt>WordCharacters</dt> 
     111  <dd>Version 2.x uses 5 different configuration options to control how a  
     112  'word' is defined. The basic assumption is that a word is defined by which characters it 
     113  <i>includes</i>. That assumption is based on a manageable character set of 256 characters. 
     114  However, the sheer size of UTF-8 suggests that the basic assumption should be inverted: 
     115  a word is defined by which characters it <i>excludes</i>. Thus, Swish-3 will likely include 
     116  3 configuration options instead of the current 5:  
     117  IgnoreCharacters, IgnoreStartCharacters, and IgnoreEndCharacters. 
     118  </dd> 
     119   
     120  <dt>Stemming</dt><dd>The stemmers used will need full international support. 
     121  </dd> 
     122  <dt>Configuration format</dt> 
     123  <dd>Since Swish-e depends on a configuration file for StopWords, Character 
     124  definitions, etc., the parsing of the configuration file must support UTF-8 as well. 
     125  The current idea is to switch to XML-style configuration files and use Libxml2 to parse 
     126  them. 
     127  </dd> 
     128 </dl> 
     129  
     130 </dd> 
     131 
     132 <dt>Incremental indexing</dt> 
     133 <dd>Swish-3 will support true incremental indexing. This will allow for document records 
     134 to be modified, added and deleted in an existing index. This feature may or may not build 
     135 on the version 2.x experimental btree/incremental feature. 
     136 </dd> 
     137  
     138 <dt>Scaling</dt> 
     139 <dd>Swish-3 will reliably scale to larger (multimillion) document collections. 
     140 </dd> 
     141  
     142 <dt>Indexing API</dt> 
     143 <dd>Swish-e will include an indexing API in addition to the current searching API.</dd> 
     144  
     145 <dt>Streamlined feature set</dt> 
     146 <dd>Swish-3 will not contain several features in the current version: 
     147 <ul> 
     148  <li>Expat parsers</li> 
     149  <li><tt>-S http</tt> indexing method and related configuration options</li> 
     150  <li>Older stemmers</li> 
     151  <li>Current native index format</li> 
     152 </ul> 
     153 </dd> 
     154  
     155 <dt>Alternate index backends</dt> 
     156 <dd>Swish-3 may offer alternate index backends using available open source libraries, 
     157 such as <a href='http://xapian.org/'>Xapian</a>,  
     158 <a href='http://hyperestraier.sourceforge.net/'>HyperEstraier</a>, 
     159 <a href='http://incubator.apache.org/lucene4c/'>Lucene</a>, or  
     160 <a href='http://www.lemurproject.org/'>Lemur</a>. 
     161 </dd> 
     162  
     163</dl> 
    87164</p> 
    88165