XRFF

toc The **XRFF** (e**X**tensible attribute-**R**elation **F**ile **F**ormat) is an XML-based extension of the ARFF format.

= File extensions = > the default extension of //XRFF// files > the extension for **gzip** compressed //XRFF// files (see Compression for more details)
 * **.xrff**
 * **.xrff.gz**

= Comparison =

ARFF
In the following a snippet of the UCI dataset //iris// in **ARFF** format: code format="text" @relation iris @attribute sepallength numeric @attribute sepalwidth numeric @attribute petallength numeric @attribute petalwidth numeric @attribute class {Iris-setosa,Iris-versicolor,Iris-virginica} @data 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3,1.4,0.2,Iris-setosa ... code

XRFF
And the same dataset represented as **XRFF** file: code format="xml"  <!DOCTYPE dataset [   <!ELEMENT dataset (header,body)> <!ATTLIST dataset name CDATA #REQUIRED> <!ATTLIST dataset version CDATA "3.5.4"> <!ELEMENT header (notes?,attributes)> <!ELEMENT body (instances)> <!ELEMENT notes ANY> <!ELEMENT attributes (attribute+)> <!ELEMENT attribute (labels?,metadata?,attributes?)> <!ATTLIST attribute name CDATA #REQUIRED> <!ATTLIST attribute type (numeric|date|nominal|string|relational) #REQUIRED> <!ATTLIST attribute format CDATA #IMPLIED> <!ATTLIST attribute class (yes|no) "no"> <!ELEMENT labels (label*)> <!ELEMENT label ANY> <!ELEMENT metadata (property*)> <!ELEMENT property ANY> <!ATTLIST property name CDATA #REQUIRED> <!ELEMENT instances (instance*)> <!ELEMENT instance (value*)> <!ATTLIST instance type (normal|sparse) "normal"> <!ATTLIST instance weight CDATA #IMPLIED> <!ELEMENT value (#PCDATA|instances)*> <!ATTLIST value index CDATA #IMPLIED> <!ATTLIST value missing (yes|no) "no"> ] >       Iris-setosa Iris-versicolor Iris-virginica 5.1             3.5              1.4              0.2              Iris-setosa 4.9             3              1.4              0.2              Iris-setosa ... code

= Sparse format = The //XRFF// format also supports a sparse data representation. Even though the iris dataset does not contain sparse data, the above example will be used here to illustrate the sparse format: code format="xml" ...      5.1        3.5        1.4        0.2        Iris-setosa 4.9       3        1.4        0.2        Iris-setosa ... ... code In contrast to the //normal// data format, each sparse //instance// tag contains a //type// attribute with the value //sparse//: code format="xml" code And each //value// tag needs to specify the //index// attribute, which contains the 1-based index of this value. code format="xml" 5.1 code

 = Compression = Since the XML representation takes up considerably more space than the rather compact //ARFF// format, one can also compress the data via **gzip**. Weka automatically recognizes a file being gzip compressed, if the file's extension is //.xrff.gz// instead of //.xrff//.

The Weka Explorer now allows to load/save compressed and uncompressed XRFF files (this applies also to ARFF files).

= Additional features = In addition to all the features of the ARFF format, the //XRFF// format contains the following additional features:
 * class attribute specification
 * attribute weights
 * instance weights

Class attribute specification
Via the attribute in the attribute specification in the header, one can define which attribute should act as class attribute. A feature that can be used on the command line as well as in the Experimenter, which now can also load other data formats, and removing the limitation of the class attribute always having to be the last one.

Snippet from the iris dataset: code format="xml"  code

Attribute weights
Attribute weights are stored in an attributes meta-data tag (in the //header// section). Here's an example of the //petalwidth// attribute with a weight of 0.9: code format="xml"  0.9 code

Instance weights
Instance weights are defined via the //weight// attribute in each //instance// tag. By default, the weight is 1. Here's an example: code format="xml"  5.1    3.5     1.4     0.2     Iris-setosa code