Discretizing+datasets

toc Once in a while one has numeric data but wants to use classifier that handles only nominal values. In that case one needs to //discretize// the data, which can be done with the following filters: > uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion > uses simple binning

But, since discretization depends on the data which presented to the discretization algorithm, one easily end up with incompatible train and test files. The following shows how to generate compatible discretized files out of a training and a test file by using the //supervised// version of the filter.

The class takes four files as arguments: code format="java" import java.io.*; import weka.core.*; import weka.filters.Filter; import weka.filters.supervised.attribute.Discretize; /** * Shows how to generate compatible train/test sets using the Discretize * filter. * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class DiscretizeTest { /**   * loads the given ARFF file and sets the class attribute as the last * attribute. *   * @param filename    the file to load * @throws Exception if somethings goes wrong */  protected static Instances load(String filename) throws Exception { Instances      result; BufferedReader reader; reader = new BufferedReader(new FileReader(filename)); result = new Instances(reader); result.setClassIndex(result.numAttributes - 1); reader.close; return result; }  /**    * saves the data to the specified file *   * @param data        the data to save to a file * @param filename   the file to save the data to    * @throws Exception  if something goes wrong */  protected static void save(Instances data, String filename) throws Exception { BufferedWriter writer; writer = new BufferedWriter(new FileWriter(filename)); writer.write(data.toString); writer.newLine; writer.flush; writer.close; }  /**    * Takes four arguments: *  *  input train file *  input test file *  output train file *  output test file *  *   * @param args        the commandline arguments * @throws Exception if something goes wrong */  public static void main(String[] args) throws Exception { Instances    inputTrain; Instances    inputTest; Instances    outputTrain; Instances    outputTest; Discretize   filter; // load data (class attribute is assumed to be last attribute) inputTrain = load(args[0]); inputTest = load(args[1]); // setup filter filter = new Discretize; filter.setInputFormat(inputTrain); // apply filter outputTrain = Filter.useFilter(inputTrain, filter); outputTest = Filter.useFilter(inputTest,  filter); // save output save(outputTrain, args[2]); save(outputTest, args[3]); } } code The same can be achieved from the commandline with this command (**batch filtering**): code format="bash" java weka.filters.supervised.attribute.Discretize -b -i  -o  -r  -s  -c  code
 * 1) input training file
 * 2) input test file
 * 3) output training file
 * 4) output test file

= See also =
 * Manual discretization (Using the MathExpression filter)
 * Batch filtering

= Downloads =
 * [[file:DiscretizeTest.java]] ([|book], [|stable-3.6], [|developer])

= Links =
 * Javadoc
 * [|Discretize (supervised)]
 * [|Discretize (unsupervised)]