In the following one can find some information of how to use Weka for text categorization.

Import

Weka needs the data to be present in ARFF or XRFF format in order to perform any classification tasks.


Directories

One can transform the text files with the following tools into ARFF format (depending on the version of Weka you are using):

Example directory layout for TextDirectoryLoader:
 ...
 |
 +- text_example
    |
    +- class1
    |  |
    |  + file1.txt
    |  |
    |  + file2.txt
    |  |
    |  ...
    |
    +- class2
    |  |
    |  + another_file1.txt
    |  |
    |  + another_file2.txt
    |  |
    |  ...
The above directory structure can be turned into an ARFF file like this:
java weka.core.converters.TextDirectoryLoader -dir text_example > text_example.arff

CSV files

CSV files can be imported in Weka easily via the Weka Explorer or via commandline via the CSVLoader class:
 java weka.core.converters.CSVLoader file.csv > file.arff

By default, non-numerical attributes get imported as NOMINAL attributes, which is not necessarily desired for textual data, especially if one wants to use the StringToWordVector filter. In order to change the attribute to STRING, one can run the NominalToString filter (package weka.filters.unsupervised.attribute) on the data, specifying the attribute index or range of indices that should be converted (NB: this filter does not exclude the class attribute from conversion!). In order to retain the attribute types, one needs to save the file in ARFF or XRFF format (or in the compressed version of these formats).

Third-party tools

  • TagHelper Tools, which allows one to transform texts into vectors of stemmed or unstemmed unigrams, bigrams, part-of-speech bigrams, and some user defined features, and then saves this representation to ARFF. Currently processes English, German, and Chinese. Spanish and Portugese are in progress.

Working with textual data

Conversion

Most classifiers in Weka cannot handle String attributes. For these learning schemes one has to process the data with appropriate filters, e.g., the StringToWordVector filter which can perform TF/IDF transformation.

The StringToWordVector filter places the class attribute of the generated output data at the beginning. In case you'd to like to have it as last attribute again, you can use the Reorder filter with the following setup:
weka.filters.unsupervised.attribute.Reorder -R 2-last,first

And with the MultiFilter you can also apply both filters in one go, instead of subsequently. Makes it easier in the Explorer for instance.


Stopwords

The StringToWordVector filter can also work with a different stopword list than the built-in one (based on the Rainbow system). One can use the -stopwords option to load the external stopwords file. The format for such a stopword file is one stopword per line, lines starting with '#' are interpreted as comments and ignored.

Note: There was a bug in Weka 3.5.6 (which introduced the support of external stopwords lists), which ignored the external stopwords list. Later versions or snapshots from 21/07/2007 on will work correctly.


UTF-8

In case you are working with text files containing non-ASCII characters, e.g., Arabic, you might encounter some display problems under Windows. Java was designed to display UTF-8, which should include arabic characters. By default, Java uses code page 1252 under Windows, which garbles the display of other characters. In order to fix this, you will have to modify the java command-line with which you start up Weka (taken from this post):
  java -Dfile.encoding=utf-8 -classpath ...
The -Dfile.encoding=utf-8 tells Java to explicitly use UTF-8 encoding instead of the default CP1252.
If you are starting Weka via start menu and you use a recent version (at least 3.5.8 or 3.4.13), then you will just have to modify the fileEncoding placeholder in the RunWeka.ini accordingly.

Examples


See also


Links