Text+categorization+with+Weka

toc In the following one can find some information of how to use Weka for [|text categorization].

=Import= Weka needs the data to be present in ARFF or XRFF format in order to perform any classification tasks.



Directories
One can transform the text files with the following tools into ARFF format (depending on the version of Weka you are using): > this Java class transforms a directory of files into an ARFF file > this converter is based on the //TextDirectoryToArff// tool and located in the package
 * **TextDirectoryToArff** tool (3.4.x and >= 3.5.3)
 * **[|TextDirectoryLoader]** converter (> 3.5.3)

Example directory layout for **TextDirectoryLoader**: code ... | +- text_example |   +- class1 | |    |  + file1.txt | |    |  + file2.txt | |    |  ...    |    +- class2 | |    |  + another_file1.txt | |    |  + another_file2.txt | |    |  ... code The above directory structure can be turned into an ARFF file like this: code format="bash" java weka.core.converters.TextDirectoryLoader -dir text_example > text_example.arff code

CSV files
CSV files can be imported in Weka easily via the Weka Explorer or via commandline via the class: code format="bash" java weka.core.converters.CSVLoader file.csv > file.arff code

By default, non-numerical attributes get imported as //NOMINAL// attributes, which is not necessarily desired for textual data, especially if one wants to use the [|StringToWordVector] filter. In order to change the attribute to //STRING//, one can run the filter (package ) on the data, specifying the attribute index or range of indices that should be converted (NB: this filter does **not** exclude the class attribute from conversion!). In order to retain the attribute types, one needs to save the file in ARFF or XRFF format (or in the compressed version of these formats).

Third-party tools

 * [|TagHelper Tools], which allows one to transform texts into vectors of stemmed or unstemmed unigrams, bigrams, part-of-speech bigrams, and some user defined features, and then saves this representation to ARFF. Currently processes English, German, and Chinese. Spanish and Portugese are in progress.

=Working with textual data=

Conversion
Most classifiers in Weka cannot handle //String// attributes. For these learning schemes one has to process the data with appropriate filters, e.g., the [|StringToWordVector] filter which can perform [|TF/IDF transformation].

The filter places the class attribute of the generated output data at the beginning. In case you'd to like to have it as last attribute again, you can use the [|Reorder] filter with the following setup: weka.filters.unsupervised.attribute.Reorder -R 2-last,first

And with the [|MultiFilter] you can also apply both filters in one go, instead of subsequently. Makes it easier in the Explorer for instance.



Stopwords
The [|StringToWordVector] filter can also work with a different stopword list than the built-in one (based on the Rainbow system). One can use the option to load the external stopwords file. The format for such a stopword file is one stopword per line, lines starting with '#' are interpreted as comments and ignored.


 * Note:** There was a bug in Weka 3.5.6 (which introduced the support of external stopwords lists), which ignored the external stopwords list. Later versions or snapshots from 21/07/2007 on will work correctly.

 =UTF-8= In case you are working with text files containing non-ASCII characters, e.g., Arabic, you might encounter some display problems under Windows. Java was designed to display [|UTF-8], which should include arabic characters. By default, Java uses [|code page 1252] under Windows, which garbles the display of other characters. In order to fix this, you will have to modify the java command-line with which you start up Weka (taken from [|this] post): code format="bash" java -Dfile.encoding=utf-8 -classpath ... code The tells Java to explicitly use [|UTF-8] encoding instead of the default [|CP1252]. If you are starting Weka via start menu and you use a recent version (at least 3.5.8 or 3.4.13), then you will just have to modify the placeholder in the  accordingly.

=Examples=
 * [[file:text_example.zip]] - contains a directory structure and example files that can be imported with the converter.
 * [[file:TextCategorizationTest.java]] - uses the converter to turn a directory structure into a dataset, applies the  and builds a classifier with the filtered data.

=See also=
 * Batch filtering - for generating a test set with the same dictionary as the training set
 * [|All text categorization articles]

=Links=
 * Javadoc
 * [|StringToWordVector]
 * [|TextDirectoryLoader]