Introduction

Weka supports stemming algorithms in the developer version. The stemming algorithms are located in the following package:
 weka.core.stemmers
Currently, the Lovins Stemmer (+ iterated version) and support for the Snowball stemmers are included.

Snowball stemmers

Weka contains a wrapper class for the Snowball stemmers (containing the Porter stemmer and several other stemmers for different languages). The relevant class is weka.core.stemmers.SnowballStemmer.

The Snowball classes are not included, they only have to be present in the classpath. The reason for this is, that the Weka team doesn't have to watch out for new versions of the stemmers and update them.

There are two ways of getting hold of the Snowball stemmers:
  1. You can add the pre-compiled archive to your classpath and you're set.
    (based on source code from 2005-10-19, compiled 2005-10-22)
  2. You can compile the stemmers yourself with the newest sources.
    Just download the file, unpack it and follow the instructions in the README file (the zip contains an ANT build script for generating the jar archive).
    Note: the patch target is specific to the source code from 2005-10-19.

PTStemmer

PTStemmer is a stemmer library for Portuguese developed by Pedro Oliveira.
In order to use this library, you just need to download the (the actual stemmer library) and the (the wrapper to make the library available within Weka) and add them to your classpath.
The source code of the wrapper project is also available: .
NB: the source code and the resulting jars are based on version 1.0 of the PTStemmer library.

Using stemmers

The stemmers can either used
  • from commandline
  • within the StringToWordVector (package weka.filters.unsupervised.attribute)

Commandline

All stemmers support the following options:
  • -h
    for displaying a brief help
  • -i <input-file>
    The file to process
  • -o <output-file>
    The file to output the processed data to (default stdout)
  • -l
    Uses lowercase strings, i.e. the input is automatically converted to lower case

StringToWordVector

Just use the GenericObjectEditor to choose the right stemmer and the desired options (if the stemmer offers these).

Adding new stemmers

You can easily add new stemmers, if you follow these guidelines (for use in the GenericObjectEditor):
  • they should be located in the weka.core.stemmers package and
  • they must implement the interface weka.core.stemmers.Stemmer.

Links