XML

toc

Weka now supports [|XML] (e**X**tensible **M**arkup **L**anguage) in several places:

= Command Line = WEKA now allows to start Classifiers and Experiments with the -xml option followed by a filename to retrieve the command line options from the XML file instead of the command line.

For such simple classifiers like e.g. J48 this looks like overkill, but as soon as one uses Meta-Classifiers or Meta-Meta-Classifiers the handling gets tricky and one spends a lot of time looking for missing quotes. With the hierarchical structure of XML files it is simple to plug in other classifiers by just exchanging tags.

The DTD for the XML options is quite simple: code format="xml" <!DOCTYPE options [   <!ELEMENT options (option)*> <!ATTLIST options type CDATA "classifier"> <!ATTLIST options value CDATA ""> <!ELEMENT option (#PCDATA | options)*> <!ATTLIST option name CDATA #REQUIRED> <!ATTLIST option type (flag | single | hyphens | quotes) "single"> ] > code The type attribute of the option tag needs some explanations. There are currently four different types of options in WEKA: > The simplest option that takes no arguments, like e.g. the flag for inversing an selection. code format="xml"  code > The option takes exactly one parameter, directly following after the option, e.g., for specifying the trainings file with. Here the parameter value is just put between the opening and closing tag. Since single is the default value for the type tag we don't need to specify it explicitly. code format="xml" somefile.arff code > Meta-Classifiers like take another classifier as option with the  option, where the options for the base classifier follow after the. And here it is where the fun starts: where to put parameters for the base classifier if the Meta-Classifier itself is a base classifier for another Meta-Classifier? > E.g., does become this: code format="xml"   0.001 code > Internally, all the options enclosed by the  tag are pushed to the end after the  if one transforms the XML into a command line string. > A Meta-Classifier like can take several  options, where each single one encloses other options in quotes (this itself can contain a Meta-Classifier!). From we then get this XML: code format="xml"   code > With the XML representation one doesn't have to worry anymore about the level of quotes one is using and therefore doesn't have to care about the correct escaping (i.e. " ... \" ... \" ...") since this is done automatically.
 * **flag**
 * **single**
 * **hyphens**
 * **quotes**

And if we now put all together we can transform this more complicated command line ( and the CLASSPATH omitted):

into XML: code format="xml"      0.001

    <option name="W" type="hyphens"> <options type="classifier" value="weka.classifiers.trees.J48"/>

<option name="B" type="quotes"> <options type="classifier" value="weka.classifiers.meta.Stacking"> <option name="B" type="quotes"> <options type="classifier" value="weka.classifiers.trees.J48"/>

test/datasets/hepatitis.arff code

> The and  attribute of the outermost  tag is not used while reading the parameters. It is merely for documentation purposes, so that one knows which class was actually started from the command line.
 * Note:**


 * Responsible Class(es):**


 * Example(s):** [[file:commandline.xml]]

= Serialization of Experiments = It is now possible to serialize the Experiments from the //WEKA Experimenter// not only in the proprietary binary format Java offers with serialization (with this you run into problems trying to read old experiments with a newer WEKA version, due to different SerialUIDs), but also in XML. There are currently two different ways to do this:

> The built-in serialization captures only the necessary informations of an experiment and doesn't serialize anything else. It's sole purpose is to save the setup of a specific experiment and can therefore not store any built models. Thanks to this limitation we'll never run into problems with mismatching SerialUIDs.
 * **built-in**

> This kind of serialization is always available and can be selected via a Filter (*.xml) in the Save/Open-Dialog of the Experimenter.

> The DTD is very simple and looks like this (for version 3.4.5): code format="xml" <!DOCTYPE object[ <!ELEMENT object (#PCDATA | object)*> <!ATTLIST object name     CDATA #REQUIRED> <!ATTLIST object class    CDATA #REQUIRED> <!ATTLIST object primitive CDATA "no"> <!ATTLIST object array    CDATA "no"> <!ATTLIST object null     CDATA "no"> <!ATTLIST object version  CDATA "3.4.5"> ]> code > Prior to versions 3.4.5 and 3.5.0 it looked like this: code format="xml" <!DOCTYPE object [   <!ELEMENT object (#PCDATA | object)*> <!ATTLIST object name     CDATA #REQUIRED> <!ATTLIST object class    CDATA #REQUIRED> <!ATTLIST object primitive CDATA "yes"> <!ATTLIST object array    CDATA "no"> ] > code > **Responsible Class(es):** >

> **for general Serialization:** > >

> **Example(s):**

> The Koala Object Markup Language (KOML) is published under the [|LGPL] and is an alternative way of serializing and derserializing Java Objects in an XML file. Like the normal serialization it serializes everything into XML via an ObjectOutputStream, including the SerialUID of each class. Even though we have the same problems with mismatching SerialUIDs it is at least possible edit the XML files by hand and replace the offending IDs with the new ones.
 * **[|KOML]**

> In order to use KOML one only has to assure that the KOML classes are in the CLASSPATH with which the Experimenter is launched. As soon as KOML is present another Filter (*.koml) will show up in the Save/Open-Dialog.

> The DTD for KOML can be found [|here].

> **Responsible Class(es):** >

> **Example(s):**

The experiment class can of course read those XML files if passed as input or output file (see options of and ).

= Serialization of Classifiers = The options for models of a classifier, for the input model and  for the output model, now also supports XML serialized files. Here we have to differentiate between two different formats:

> The built-in serialization captures only the options of a classifier but not the built model. With the one still has to provide a training file, since we only retrieve the options from the XML file. It is possible to add more options on the command line, but it is no check performed whether they collide with the ones stored in the XML file. > The file is expected to end with.
 * **built-in**

> Since the KOML serialization captures everything of a Java Object we can use it just like the normal Java serialization. > The file is expected to end with.
 * **[|KOML]**

The **built-in** serialization can be used in the **Experimenter** for loading/saving options from algorithms that have been added to a Simple Experiment. Unfortunately it is not possible to create such a hierarchical structure like mentioned in Command Line. This is because of the loss of information caused by the method of classifiers: it returns only a flat String-Array and not a tree structure.


 * Responsible Class(es):**


 * Example(s):** [[file:commandline_inputmodel.xml]] [[file:commandline_inputmodel.koml]]

= Bayesian Networks = The GraphVisualizer can save graphs into the [|Interchange Format] for Bayesian Networks (BIF). If started from command line with an XML filename as first parameter and not from the Explorer it can display the given file directly.

The DTD for BIF is this: code format="xml" <!DOCTYPE BIF [ <!ELEMENT BIF ( NETWORK )*> <!ATTLIST BIF VERSION CDATA #REQUIRED> <!ELEMENT NETWORK ( NAME, ( PROPERTY | VARIABLE | DEFINITION )* )> <!ELEMENT NAME (#PCDATA)>

<!ELEMENT VARIABLE ( NAME, ( OUTCOME | PROPERTY )* ) > <!ATTLIST VARIABLE TYPE (nature|decision|utility) "nature"> <!ELEMENT OUTCOME (#PCDATA)> <!ELEMENT DEFINITION ( FOR | GIVEN | TABLE | PROPERTY )* > <!ELEMENT FOR (#PCDATA)> <!ELEMENT GIVEN (#PCDATA)>

<!ELEMENT TABLE (#PCDATA)> <!ELEMENT PROPERTY (#PCDATA)> ]> code


 * Responsible Class(es):**


 * Example(s):** [[file:bif.xml]]

= Tools = > The XSLT script parses an XML file for the experimenter and outputs the options in two ways: > (Use  if you want to parse an XML options file that was created via the Save options... button in the Experimenter) > **Usage:** code format="bash" xsltproc options.xsl code > //Note: you can use any XSLT processor, e.g., xt; xsltproc is just one.//
 * **Experimenter options**
 * in an array-like fashion, i.e., each option on a separate line; the class is output first.
 * commandline-like, i.e., the class followed by all its parameters; at each end of a line a "\" is appended. (works only on *nix and [|Cygwin])

= Downloads =
 * KOML
 * [[file:koml12.dtd]] - local copy of the KOML DTD 1.2
 * [[file:koml_bin.zip]] - contains the KOML classes (basically a jar file)
 * [[file:koml_sources.zip]] - the KOML source code