Changeset 1915

Show
Ignore:
Timestamp:
02/28/07 09:37:22 (1 year ago)
Author:
karpet
Message:

add swish3 page

Files:

Legend:

Unmodified
Added
Removed
Modified
Copied
Moved
  • swish_website/src/devel/index.html

    r1885 r1915  
    7575</p> 
    7676 
    77 <a name="swish3"></a> 
    78 <h3>Features planned for 3.0</h3> 
    79  
    80 <p> 
    81 Swish-e 3.0 (sometimes abbreviated Swish-3) will be a complete overhaul of the code. 
    82 Major feature improvements will include: 
    83  
    84 <dl> 
    85  <dt>Unicode support</dt> 
    86  <dd>Unicode is the <a href='http://www.unicode.org/unicode/faq/'>international standard  
    87  for character encodings</a>. Swish-e will implement 
    88  support for the <a href='http://www.cl.cam.ac.uk/~mgk25/unicode.html'>UTF-8</a> 
    89  <a href='http://czyborra.com/utf/'>character encoding</a>, 
    90  which should handle all major languages in the world (UTF-8 handles up to  
    91  2,147,483,648 unique characters). 
    92  The Swish-e developers need input from non-English language experts.  
    93  Please contribute to the discussion at the 
    94    
    95   [% link_to_page('discuss' , 'Swish-e mailing list' ) %]. 
    96    
    97  Some significant known issues include: 
    98  <p /> 
    99  <dl> 
    100   <dt>lowercase vs. UPPERCASE</dt> 
    101   <dd>Version 2.x uses <tt>tolower()</tt> to lowercase all characters 
    102   before searching and indexing. Should the same approach be used for UTF-8? Will this have 
    103   significant impact on usability for non-English languages?  
    104   </dd> 
    105   <dt>Wildcards</dt> 
    106   <dd>Version 2.x uses an internal table to support wildcard searching with <tt>*</tt>. 
    107   The table assumes 8-bit (non-Unicode) character encoding. That approach will likely need 
    108   to be re-thought for multibyte encodings like UTF-8. 
    109   </dd> 
    110   <dt>WordCharacters</dt> 
    111   <dd>Version 2.x uses 5 different configuration options to control how a  
    112   'word' is defined. The basic assumption is that a word is defined by which characters it 
    113   <i>includes</i>. That assumption is based on a manageable character set of 256 characters. 
    114   However, the sheer size of UTF-8 suggests that the basic assumption should be inverted: 
    115   a word is defined by which characters it <i>excludes</i>. Thus, Swish-3 will likely include 
    116   3 configuration options instead of the current 5:  
    117   IgnoreCharacters, IgnoreStartCharacters, and IgnoreEndCharacters. 
    118   </dd> 
    119    
    120   <dt>Stemming</dt><dd>The stemmers used will need full international support. 
    121   </dd> 
    122   <dt>Configuration format</dt> 
    123   <dd>Since Swish-e depends on a configuration file for StopWords, Character 
    124   definitions, etc., the parsing of the configuration file must support UTF-8 as well. 
    125   The current idea is to switch to XML-style configuration files and use Libxml2 to parse 
    126   them. 
    127   </dd> 
    128  </dl> 
    129   
    130  </dd> 
    131  
    132  <dt>Incremental indexing</dt> 
    133  <dd>Swish-3 will support true incremental indexing. This will allow for document records 
    134  to be modified, added and deleted in an existing index. This feature may or may not build 
    135  on the version 2.x experimental btree/incremental feature. 
    136  </dd> 
    137   
    138  <dt>Scaling</dt> 
    139  <dd>Swish-3 will reliably scale to larger (multimillion) document collections. 
    140  </dd> 
    141   
    142  <dt>Indexing API</dt> 
    143  <dd>Swish-e will include an indexing API in addition to the current searching API.</dd> 
    144   
    145  <dt>Streamlined feature set</dt> 
    146  <dd>Swish-3 will not contain several features in the current version: 
    147  <ul> 
    148   <li>Expat parsers</li> 
    149   <li><tt>-S http</tt> indexing method and related configuration options</li> 
    150   <li>Older stemmers</li> 
    151   <li>Current native index format</li> 
    152  </ul> 
    153  </dd> 
    154   
    155  <dt>Alternate index backends</dt> 
    156  <dd>Swish-3 may offer alternate index backends using available open source libraries, 
    157  such as <a href='http://xapian.org/'>Xapian</a>,  
    158  <a href='http://hyperestraier.sourceforge.net/'>HyperEstraier</a>, 
    159  <a href='http://incubator.apache.org/lucene4c/'>Lucene</a>, or  
    160  <a href='http://www.lemurproject.org/'>Lemur</a>. 
    161  </dd> 
    162   
    163 </dl> 
    164 </p> 
    165  
    16677<hr /> 
    16778