Writing+your+own+Classifier+(post+3.5.2)

toc In case you have a flash idea for a new classifier and want to write one for Weka, this HOWTO will help you developing it.

The Mindmap (, produced with [|Freemind]) helps you decide from which base classifier to start, what methods are to be implemented and general guidelines.

The base classifiers are all located in the following package: code format="text" weka.classifiers code


 * Note:** This is also covered in chapter //Extending WEKA// of the WEKA manual in versions later than 3.6.1/3.7.0 or snapshots of the stable-3.6/developer version later than 10/01/2010.

= Packages = A few comments about the different classifier sub-packages: > contains bayesian classifiers, e.g. NaiveBayes > classes related to evaluation, e.g., cost matrix > e.g., Support Vector Machines, regression algorithms, neural nets > no //offline// learning, that is done during runtime, e.g., k-NN > Meta classifiers that use a //base// classifier as input, e.g., boosting or bagging > classifiers that handle multi-instance data > various classifiers that don't fit in any another category > rule-based classifiers, e.g. ZeroR > tree classifiers, like decision trees

= Coding = In the following you'll find notes about certain implementation parts listed in the Mindmap, which need a bit more explanation.

Random number generators
In order to get repeatable experiments, one is not allowed to use //unseeded// random number generators like. Instead, one has to instantiate a object in the  method with a specific seed value. The seed value can be user supplied, of course, which all the abstract classifiers already implement.

Capabilities
Up to version 3.5.2 all classifiers could handle basically every kind of data by default, unless they were throwing an Exception (in the method). Since this behavior makes it cumbersome to introduce new attribute types, for instance (//all// classifiers have to be modified, which can't handle the new attribute type!), the general were introduced.

Base-classifier
Normal classifiers only state what kind of attributes and what kind of classes they can handle.

The method of, for instance, looks like this: code format="java" public Capabilities getCapabilities { Capabilities result = super.getCapabilities;  // returns the object from weka.classifiers.Classifier // attributes result.enable(Capability.NOMINAL_ATTRIBUTES); result.enable(Capability.NUMERIC_ATTRIBUTES); result.enable(Capability.DATE_ATTRIBUTES); result.enable(Capability.MISSING_VALUES); // class result.enable(Capability.NOMINAL_CLASS); result.enable(Capability.MISSING_CLASS_VALUES); return result; } code > By default, at least 1 instance has to be in the dataset, which does not apply for incremental classifiers. They have to lower the limit to : code format="java" result.setMinimumNumberInstances(0); code > The structure for multi-instance classifiers is always fixed to //bagID,bag-data,class//. To restrict the data to multi-instance data, add the following: code format="java" result.enable(Capability.ONLY_MULTIINSTANCE); code > Multi-instance classifiers also implement the following interface, which returns the Capabilities for the bag-data, which is just a //relational// attribute (the reason why  has to be enabled): code format="java" weka.core.MultiInstanceCapabilitiesHandler code > Since clusterer don't need a class attribute like classifiers, the following Capability has to be specified to enable datasets without a class attribute (which is already done in the superclass ): code format="java" result.enable(Capability.NO_CLASS); code
 * Special cases:**
 * **incremental classifiers**
 * **multi-instance classifiers**
 * **clusterer**

Meta-classifier
Meta-classifiers, by default, just return the capabilities of their base classifiers - in case of descendants of the, an **AND** over all the Capabilities of the base classifiers is returned.

Due to this behavior, the Capabilities depend (normally) only on the currently configured base classifier(s). To //soften// filtering for certain behavior, meta-classifiers also define so-called //Dependencies// on a per-Capability basis. These dependencies tell the filter that even though a certain capability is not supported right now, it is possible that it will be supported with a different base classifier. By default, all Capabilities are initialized as Dependencies.

, e.g., is restricted to nominal classes. For that reason it disables the Dependencies for the class: code format="java" result.disableAllClasses;              // disable all class types result.disableAllClassDependencies;    // no dependencies! result.enable(Capability.NOMINAL_CLASS); // only nominal classes allowed code

Relevant classes

 * (for multi-instance classifiers)
 * (for multi-instance classifiers)
 * (for multi-instance classifiers)

Paper reference(s)
In order to make it easy to generate a bibliography of all the algorithms in Weka, the paper references located so far in the Javadoc were extracted and placed in the code.

Classes that are based on some technical paper should implement the interface and return a customized  instance. The format used is based on [|BibTeX] and the class can either return a plain text string via the  method or a real [|BibTeX] entry via the  method. This two methods are then used to automatically update the Javadoc (see Javadoc further down) of a class.

Relevant classes:



Javadoc
Open-source software is only as good as its documentation. Hence, correct and up-to-date documentation is vital. So far most of the Javadoc was maintained manually, which made it hard to maintain, e.g., as soon as new options were added the Javadoc had to be changed accordingly, too. And that normally in several places: Over the time the documentation got out of sync, which made it frustrating determining what options were really relevant and active. Since a lot of the documentation is already available in the code itself, the next logical step was to automate the Javadoc generation as much as possible. In the following you will see how to structure your Javadoc to reduce maintainance. For this purpose special comment tags are used, where the content in between will be replaced automatically by the classes listed in Relevant classes.
 * Class description
 * method

The indentation of the generated Javadoc depends on the indentation of the of the starting comment tag.

This general layout order should be used for all classes:
 * **class description** Javadoc
 * globalinfo
 * bibtex - //if available//
 * commandline options
 * **setOptions** Javadoc
 * commandline options

General
The general description for all classes displayed in the GenericObjectEditor was already in place, with the following method: code format="java" globalInfo code The return value can be placed in the Javadoc, surrounded by the following comment tags: code format="html" will be automatically replaced code

Paper reference(s)
If available, the paper reference should also be listed in the Javadoc. Since the method should return a short version of the reference, it is sufficient to list the full [|BibTeX] documentation: code format="html" will be automatically replaced code

In case it is necessary to list the short, plain text version, too, one can use the following tags: code format="html" will be automatically replaced code

Options
To place the commandline options, use the following comment tags: code format="html" will be automatically replaced code



Relevant classes
> executes all Javadoc-producing classes > updates the globalInfo tags > updates the option tags > updates the technical tags (plain text and [|BibTeX])

= Integration = After finishing the coding stage, it's time to integrate your classifier in the Weka framework, i.e., to make it available in the Explorer, Experimenter, etc. Starting with version **3.4.4**, Weka supports an automatic discovery of derived classes in your classpath, managed by the //GenericPropertiesCreator//.

The GenericObjectEditor article shows you how to tell Weka where to find your classifier and therefore displaying it in the //GenericObjectEditor//.

= Revisions = As of 14/04/2007 (> 3.5.7), classifiers also implement the interface. This provides the functionality of obtaining the Subversion revision from within Java. Classifiers that are not part of the official Weka distribution will have to implement the method as follows, which will return a dummy revision of //1.0//: code format="java" /**  * Returns the revision string. *   * @return		the revision */ public String getRevision { return RevisionUtils.extract("$Revision: 1.0 $"); } code

= Testing = Weka provides already a test framework to ensure the basic functionality of a classifier. It is essential for the classifier to pass these tests.

General
Use the CheckClassifier class to test your classifier from commandline: code format="bash" weka.classifiers.CheckClassifier -W classname [-- additional parameters] code Only the following tests may have "no" as result, the others must have a "no (OK error message)" or "yes":
 * options
 * updateable classifier
 * weighted instances classifier
 * multi-instance classifier

Option handling
Additionally, check the **option handling** of your classifier with the following tool from commandline: code format="bash" weka.core.CheckOptionHandler -W classname [-- additional parameters] code All tests need to return //yes//.

GenericObjectEditor (> 3.5.5)
The class checks whether all the properties available in the GUI have a tooltip accompanying them and whether the  method is declared: code format="bash" weka.core.CheckGOE -W classname [-- additional parameters] code All tests, once again, need to return //yes//.

Source code (> 3.5.6)
Classifiers that implement the interface can output Java code of their model. In order to check the generated code, one should not only compile the code, but also test it with the following test class: code format="bash" weka.classifiers.CheckSource code This class takes the original Weka classifier, the generated code and the dataset used for generating the source code as parameters. It builds the Weka classifier on the dataset and compares the predictions, the ones from the Weka classifier and the ones from the generated source code, whether they are the same.

Here's an example call for and the generated class  (it wraps the actual generated code in a pseudo-classifier): code format="bash" java weka.classifiers.CheckSource \ -W "weka.classifiers.trees.Id3" \ -S weka.classifiers.WekaWrapper \ -t data.arff \ -c last code It needs to return //Tests OK!//.

Unit tests
In order to make sure that your classifier applies to the Weka criteria, you should add your classifier to the [|junit] unit test framework, i.e., by creating a Test class (starting with Weka version 3.4.6 and 3.5.1 the uses the,  and  class to run a battery of tests).

How to check out the unit test framework, you can find here.

= See also =
 * GenericObjectEditor/GenericPropertiesCreator
 * Writing your own Classifier (up to 3.5.2)
 * HOWTO extract paper references

= Links =
 * [[file:Build_classifier_353.pdf]] - MindMap for implementing a new classifier
 * [|Weka API]
 * [|Freemind]
 * [|junit]