Writing+your+own+Filter+(default)

toc
 * Note:** This is also covered in chapter //Extending WEKA// of the WEKA manual in versions later than 3.6.1/3.7.0 or snapshots of the stable-3.6/developer version later than 10/01/2010.

= General = For general information on writing filters, see the Writing your own Filter article as well.

= Implementation = The following methods are of importance for the implementation of a filter and explained in detail further down:
 * - only for Weka >=3.5.4
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions

But only the following ones need normally be modified:
 * - only for Weka >=3.5.4
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions
 * - only for Weka >3.5.7, see section Revisions

getCapabilities
Starting with Weka 3.5.4, the filters implement the interface like the classifiers. This method returns what kind of data the filter is able to process. Needs to be adapted for each individual filter.

setInputFormat(Instances)
With this format, the user tells the filter what format, i.e., attributes, the input data has. For Weka 3.5.4 and higher this method also tests, whether the filter can actually process this data. All older Weka versions or book branch versions need to check the data manually and throw fitting exceptions, e.g., not being able to handle String attributes.

If the output format of the filter, i.e., the new Instances header, can be determined based alone on this information, then the method should set the output format via and return, otherwise it has to return.

getInputFormat
This method returns an Instances object containing all currently buffered Instance objects from the input queue.

setOutputFormat(Instances)
defines the new Instances header for the output data. For filters that work on a row-basis, there shouldn't be any changes between the input and output format. But filters that work on attributes, e.g., removing, adding, modifying, will affect this format. This method must be called with the appropriate Instances object as parameter, since all Instance objects being processed will rely on the output format.

getOutputFormat
This method returns the currently set Instances object that defines the output format. In case hasn't been called yet, this method will return.

input(Instance)
The method returns  if the given Instance can be processed straight away and can be collected immediately via the  method (after adding it to the output queue via, of course). This is also the case if the first batch of data has been processed and the instance belongs to the second batch. Via one can query whether this instance is still part of the first batch or of the second.

If the Instance cannot be processed immediately, e.g., the filter needs to collect all the data first before doing some calculations, then it needs to be buffered with until  is called.

bufferInput(Instance)
In case an Instance cannot be processed immediately, one can use this method to buffer them in the input queue. All buffered Instance objects are available via the method.

push(Instance)
adds the given Instance to the output queue.

output
Returns the next Instance object from the output queue and removes it from there. In case there is no Instance available this method returns.

batchFinished
The method signifies the end of a dataset being pushed through the filter. In case of a filter that couldn't process the data of the first batch immediately, this is the place to determine what the output format will be (and set if via ) and process the actual data. The currently available data can be retrieved with the method. After processing the data, one needs to call to remove all the pending input data.

flushInput
removes all buffered Instance objects from the input queue. This method must be called after all the Instance objects have been processed in the method.

= Option handling = If the filter should be able to handle commandline options, then the weka.core.OptionHandler interface needs to be implemented. In addition to that, the following code should be added at the end of the method: code format="java" if (getInputFormat != null) setInputFormat(getInputFormat); code This will inform the filter about changes in the options and therefore reset it.

= Examples = The following examples are to illustrate the filter framework. The method can be removed for filters for the book version of Weka or if the developer version is older than 3.5.4, as well as the import of.


 * Note:** unseeded random number generators like should never be used since they will produce different results in each run and repeatable results are essential in machine learning.

BatchFilter
This simple batch filter adds a new attribute called //bla// at the end of the dataset. The rows of this attribute contain only the row's index in the data. Since the batch-filter need not see all the data before creating the output format, the sets the output format and returns  (indicating that the output format can be queried immediately). The method performs the processing of all the data. code format="java" import weka.core.*; import weka.core.Capabilities.*; public class BatchFilter extends Filter { public String globalInfo { return  "A batch filter that adds an additional attribute 'bla' at the end " + "containing the index of the processed instance. The output format " + "can be collected immediately."; }   public Capabilities getCapabilities { Capabilities result = super.getCapabilities; result.enableAllAttributes; result.enableAllClasses; result.enable(Capability.NO_CLASS); // filter doesn't need class to be set return result; }   public boolean setInputFormat(Instances instanceInfo) throws Exception { super.setInputFormat(instanceInfo); Instances outFormat = new Instances(instanceInfo, 0); outFormat.insertAttributeAt(new Attribute("bla"), outFormat.numAttributes); setOutputFormat(outFormat); return true; // output format is immediately available }   public boolean batchFinished throws Exception { if (getInputFormat = null) throw new NullPointerException("No input instance format defined"); Instances inst = getInputFormat; Instances outFormat = getOutputFormat; for (int i = 0; i < inst.numInstances; i++) { double[] newValues = new double[outFormat.numAttributes]; double[] oldValues = inst.instance(i).toDoubleArray; System.arraycopy(oldValues, 0, newValues, 0, oldValues.length); newValues[newValues.length - 1] = i;       push(new Instance(1.0, newValues)); }     flushInput; m_NewBatch = true; m_FirstBatchDone = true; return (numPendingOutput != 0); }   public static void main(String[] args) { runFilter(new BatchFilter, args); } } code

BatchFilter2
In contrast to the first batch filter, this one here cannot determine the output format immediately (the number of instances in the first batch is part of the attribute name now). This is done in the method. code format="java" import weka.core.*; import weka.core.Capabilities.*; public class BatchFilter2 extends Filter { public String globalInfo { return  "A batch filter that adds an additional attribute 'bla' at the end " + "containing the index of the processed instance. The output format " + "cannot be collected immediately."; }   public Capabilities getCapabilities { Capabilities result = super.getCapabilities; result.enableAllAttributes; result.enableAllClasses; result.enable(Capability.NO_CLASS); // filter doesn't need class to be set return result; }   public boolean batchFinished throws Exception { if (getInputFormat = null) throw new NullPointerException("No input instance format defined");

// output format still needs to be set (depends on first batch of data) if (!isFirstBatchDone) { Instances outFormat = new Instances(getInputFormat, 0); outFormat.insertAttributeAt(new Attribute( "bla-" + getInputFormat.numInstances), outFormat.numAttributes); setOutputFormat(outFormat); }     Instances inst = getInputFormat; Instances outFormat = getOutputFormat; for (int i = 0; i < inst.numInstances; i++) { double[] newValues = new double[outFormat.numAttributes]; double[] oldValues = inst.instance(i).toDoubleArray; System.arraycopy(oldValues, 0, newValues, 0, oldValues.length); newValues[newValues.length - 1] = i;       push(new Instance(1.0, newValues)); }     flushInput; m_NewBatch = true; m_FirstBatchDone = true; return (numPendingOutput != 0); }   public static void main(String[] args) { runFilter(new BatchFilter2, args); } } code

BatchFilter3
As soon as this batch filter's first batch is done, it can process Instance objects immediately in the method. It adds a new attribute which contains just a random number, but the random number generator being used is seeded with the number of instances from the first batch. code format="java" import weka.core.*; import weka.core.Capabilities.*; import java.util.Random; public class BatchFilter3 extends Filter { protected int m_Seed; protected Random m_Random; public String globalInfo { return  "A batch filter that adds an attribute 'bla' at the end " + "containing a random number. The output format cannot be collected " + "immediately."; }   public Capabilities getCapabilities { Capabilities result = super.getCapabilities; result.enableAllAttributes; result.enableAllClasses; result.enable(Capability.NO_CLASS); // filter doesn't need class to be set return result; }   public boolean input(Instance instance) throws Exception { if (getInputFormat = null) throw new NullPointerException("No input instance format defined"); if (isNewBatch) { resetQueue; m_NewBatch = false; }     if (isFirstBatchDone) convertInstance(instance); else bufferInput(instance); return isFirstBatchDone; }   public boolean batchFinished throws Exception { if (getInputFormat = null) throw new NullPointerException("No input instance format defined"); // output format still needs to be set (random number generator is seeded     // with number of instances of first batch) if (!isFirstBatchDone) { m_Seed = getInputFormat.numInstances; Instances outFormat = new Instances(getInputFormat, 0); outFormat.insertAttributeAt(new Attribute( "bla-" + getInputFormat.numInstances), outFormat.numAttributes); setOutputFormat(outFormat); }     Instances inst = getInputFormat; for (int i = 0; i < inst.numInstances; i++) { convertInstance(inst.instance(i)); }     flushInput; m_NewBatch = true; m_FirstBatchDone = true; m_Random = null; return (numPendingOutput != 0); }   protected void convertInstance(Instance instance) { if (m_Random = null) m_Random = new Random(m_Seed); double[] newValues = new double[instance.numAttributes + 1]; double[] oldValues = instance.toDoubleArray; newValues[newValues.length - 1] = m_Random.nextInt; System.arraycopy(oldValues, 0, newValues, 0, oldValues.length); push(new Instance(1.0, newValues)); }   public static void main(String[] args) { runFilter(new BatchFilter3, args); } } code

StreamFilter
This stream filter adds a random number at the end of each instance of the input data. Since this doesn't rely on having access to the full data of the first batch, the output format is accessible immediately after using. All the Instance objects are immediately processed in via the  method, which pushes them immediately to the output queue. code format="java" import weka.core.*; import weka.core.Capabilities.*; import java.util.Random; public class StreamFilter extends Filter { protected Random m_Random; public String globalInfo { return  "A stream filter that adds an attribute 'bla' at the end " + "containing a random number. The output format can be collected " + "immediately."; }   public Capabilities getCapabilities { Capabilities result = super.getCapabilities; result.enableAllAttributes; result.enableAllClasses; result.enable(Capability.NO_CLASS); // filter doesn't need class to be set return result; }   public boolean setInputFormat(Instances instanceInfo) throws Exception { super.setInputFormat(instanceInfo); Instances outFormat = new Instances(instanceInfo, 0); outFormat.insertAttributeAt(new Attribute("bla"), outFormat.numAttributes); setOutputFormat(outFormat); m_Random = new Random(1); return true; // output format is immediately available }   public boolean input(Instance instance) throws Exception { if (getInputFormat = null) throw new NullPointerException("No input instance format defined"); if (isNewBatch) { resetQueue; m_NewBatch = false; }     convertInstance(instance); return true; // can be immediately collected via output }   protected void convertInstance(Instance instance) { double[] newValues = new double[instance.numAttributes + 1]; double[] oldValues = instance.toDoubleArray; newValues[newValues.length - 1] = m_Random.nextInt; System.arraycopy(oldValues, 0, newValues, 0, oldValues.length); push(new Instance(1.0, newValues)); }   public static void main(String[] args) { runFilter(new StreamFilter, args); } } code

= See also =
 * Writing your own Filter
 * Writing your own Filter (up to 3.5.3)
 * Writing your own Filter (post 3.5.3)