Stemmers

toc

=Introduction= Weka supports stemming algorithms in the developer version. The stemming algorithms are located in the following package: code weka.core.stemmers code Currently, the Lovins Stemmer (+ iterated version) and support for the Snowball stemmers are included.

=Snowball stemmers= Weka contains a wrapper class for the [|Snowball] stemmers (containing the Porter stemmer and several other stemmers for different languages). The relevant class is weka.core.stemmers.SnowballStemmer.

The Snowball classes are not included, they only have to be present in the classpath. The reason for this is, that the Weka team doesn't have to watch out for new versions of the stemmers and update them.

There are **three** ways of getting hold of the Snowball stemmers: > (based on source code from 2005-10-19, compiled 2005-10-22) > Just download the file, unpack it and follow the instructions in the README file (the zip contains an [|ANT] build script for generating the jar archive). > **Note:** the //patch// target is specific to the source code from 2005-10-19.
 * 1) For Weka 3.7.x you can install an unofficial package
 * 2) You can add the **pre-compiled [[file:snowball-20051019.jar]] archive** to your classpath and you're set.
 * 1) You can **compile the stemmers yourself** with the newest sources.

=PTStemmer= [|PTStemmer] is a stemmer library for Portuguese developed by Pedro Oliveira.

In order to use this library: The source code of the wrapper project is also available:.
 * 1) you can install the unofficial package  when using Weka 3.7.x
 * 2) you just need to download the [[file:ptstemmer.jar]] (the actual stemmer library) and the [[file:ptstemmer-weka.jar]] (the wrapper to make the library available within Weka) and add them to your classpath.
 * NB:** the source code and the resulting jars are based on version 1.0 of the PTStemmer library.

=Using stemmers= The stemmers can either used
 * from commandline
 * within the (package )

Commandline
All stemmers support the following options: > for displaying a brief help > The file to process > The file to output the processed data to (default //stdout//) > Uses lowercase strings, i.e. the input is automatically converted to lower case

StringToWordVector
Just use the GenericObjectEditor to choose the right stemmer and the desired options (if the stemmer offers these).

=Adding new stemmers= You can easily add new stemmers, if you follow these guidelines (for use in the GenericObjectEditor):
 * they should be located in the package and
 * they must implement the interface.

=Links=
 * [|Snowball homepage]
 * [|ANT homepage]