Introduction
XML files are commonly used to configure various tools and
applications. While there are dedicated tools for generating XML files
through structured input forms, in many cases, it is beneficial to
organize data in tables linked by primary and foreign keys, especially
when dealing with large datasets or multiple instances of the same
entity.
The tab2xml
package provides a solution by enabling the
generation of XML files based on predefined templates
and structured tabular data stored in Excel (.xlsx) or
OpenDocument Spreadsheet (.ods) files. By leveraging relational
tables, users can efficiently handle and manage large-scale data entry,
ensuring consistency and ease of modification.
This approach simplifies XML creation, making it more accessible and
scalable when working with complex data structures.
Content
In addition to this introduction, the document includes:
An example based on the definition of an XML schema for
Mondrian, following the Mondrian
Documentation.
A breakdown of each schema element, detailing the corresponding
tables, their relationships, and the XML templates used for data
transformation.
The process of transforming elements to generate the
schema.
Example: Mondrian Schema
The starting point for this example is the Mondrian schema, based on
a schema example from the Mondrian
Documentation, shown below.
<Schema>
<Cube name="Sales">
<Table name="sales_fact_1997" />
<Dimension name="Gender" foreignKey="customer_id">
<Hierarchy name="Gender" hasAll="true" allMemberName="All Genders" primaryKey="customer_id">
<Table name="customer" />
<Level name="Gender" column="gender" type="String" uniqueMembers="true" />
</Hierarchy>
</Dimension>
<Dimension name="Time" foreignKey="time_id">
<Hierarchy name="Time" hasAll="false" primaryKey="time_id">
<Table name="time_by_day" />
<Level name="Year" column="the_year" type="Numeric" uniqueMembers="true" />
<Level name="Quarter" column="quarter" type="Numeric" uniqueMembers="false" />
<Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false" />
</Hierarchy>
</Dimension>
<Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###" />
<Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##" />
<Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00" />
<CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]">
<CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00" />
</CalculatedMember>
</Cube>
</Schema>
To define this schema, we can either edit the XML file directly or
use the Pentaho Schema Workbench tool. Another approach is to
break it down into tables and templates, work with the tables, and
generate the required versions directly, as shown below.
Scheme decomposition
Next, we will decompose the content of this schema into tables,
defined as sheets in the spreadsheets files (.xlsx or .ods) included in
the package, and templates.
Root of the schema template: schema_template.xml
Below is the root of the schema template.
<Schema>
{Cube}
</Schema>
The root can contain only one token (in this case,
the {Cube}
token). If multiple tokens or none are found, an
error is raised.
Tokens are defined using a name enclosed in {
and
}
.
Token names are case-insensitive.
Tokens reference either sheets in the spreadsheet or
columns in the sheet associated with each element.
In the example, a schema consists of one or more cubes.
Starting from the root, each element must have an associated
sheet and a template file, where the
sheet’s columns or other components are included.
Cube
Only one cube is defined in the schema; therefore, the table contains
only one row.
If we need to reference this element from another, we add a
primary key column. The name of this column
must consist of the sheet name followed by the _pk
suffix.
To reference elements from another table, we add a
foreign key column, named after the referenced
sheet with the _fk
suffix.
In this case, a cube has an associated table. However, since other
elements can also have associated tables, we add the foreign key to the
cube (as well as to each element with an associated table).
Regarding the template, we can include two types of tokens:
- Tokens referencing table columns: these will be
replaced with a value.
- Tokens referencing other elements (referenced by
the sheet name): these will be replaced with another template.
<Cube name="{name}">
{Table}
{Dimension}
{Measure}
{CalculatedMember}
</Cube>
When including tokens that reference other sheets, each token
must be placed on a separate line, as they will be replaced with
templates.
Table
Since the tables are referenced by other elements, we define a
primary key with values to allow referencing (as we did from the
cube).
1 |
sales_fact_1997 |
2 |
customer |
3 |
time_by_day |
The template only needs to include the table name using the specified
syntax.
Dimension
Each dimension is included in a cube and can also contain other
elements that reference it.
1 |
Gender |
customer_id |
1 |
2 |
Time |
time_id |
1 |
In the template, the sheet’s columns and the containing element
(which includes one or more hierarchies) are referenced.
<Dimension name="{name}" foreignKey="{foreignKey}">
{Hierarchy}
</Dimension>
Hierarchy
Hierarchies reference both the dimensions that contain them and their
associated tables.
1 |
Gender |
true |
allMemberName=“All Genders” |
customer_id |
1 |
2 |
2 |
Time |
false |
NA |
time_id |
2 |
3 |
In the template, we can see that it includes the table and the levels
that make up the hierarchy.
<Hierarchy name="{name}" hasAll="{hasAll}" {allMemberName} primaryKey="{primaryKey}">
{Table}
{Level}
</Hierarchy>
Level
Levels do not need to be referenced, so they do not include a primary
key. Instead, they reference the hierarchies they belong to.
Gender |
gender |
String |
true |
1 |
Year |
the_year |
Numeric |
true |
2 |
Quarter |
quarter |
Numeric |
false |
2 |
Month |
month_of_year |
Numeric |
false |
2 |
The template references the columns of the associated sheet.
<Level name="{name}" column="{column}" type="{type}" uniqueMembers="{uniqueMembers}" />
Measure
Measures are included in cubes and must reference them. Since they do
not need to be referenced, they do not require a primary key.
Unit Sales |
unit_sales |
sum |
#,### |
1 |
Store Sales |
store_sales |
sum |
#,###.## |
1 |
Store Cost |
store_cost |
sum |
#,###.00 |
1 |
The template includes the columns from the sheet.
<Measure name="{name}" column="{column}" aggregator="{aggregator}" formatString="{formatString}" />
CalculatedMember
Calculated members may have components that must be referenceable
(they have a primary key) and are included in cubes.
1 |
Profit |
Measures |
[Measures].[Store Sales] - [Measures].[Store Cost] |
1 |
The template shows the elements they contain.
<CalculatedMember name="{name}" dimension="{dimension}" formula="{formula}">
{CalculatedMemberProperty}
</CalculatedMember>
CalculatedMemberProperty
Finally, we have the properties of the calculated members they
reference.
FORMAT_STRING |
$#,##0.00 |
1 |
The template uses the columns from the corresponding sheet.
<CalculatedMemberProperty name="{name}" value="{value}" />
Generate the Schema
To generate the schema, tokens in each template are replaced with
corresponding values from a data source.
It handles both direct replacements from the associated sheet and
foreign key relationships.
library(tab2xml)
# Define file paths
source_xml <- system.file("extdata", "schema_template.xml", package = "tab2xml")
source_xlsx <- system.file("extdata", "schema.xlsx", package = "tab2xml")
temp_file <- tempfile(fileext = ".xml")
# Convert spreadsheet to XML
file <- sheet2xml(source_xlsx, source_xml, temp_file)
As shown in the example, we only need to specify:
The spreadsheet file.
The root template. The rest of the templates
must be located in the same folder as the root template and are
determined from the content of the templates being processed.
The name of the output XML file (if none is
specified, a file with the same name as the spreadsheet is created in
the same location).
The resulting output is shown below.
library(xml2)
xml_content <- readLines(file, warn = FALSE)
cat("```xml\n", paste(xml_content, collapse = "\n"), "\n```", sep = "")
<Schema>
<Cube name="Sales">
<Table name="sales_fact_1997" />
<Dimension name="Gender" foreignKey="customer_id">
<Hierarchy name="Gender" hasAll="true" allMemberName="All Genders" primaryKey="customer_id">
<Table name="customer" />
<Level name="Gender" column="gender" type="String" uniqueMembers="true" />
</Hierarchy>
</Dimension>
<Dimension name="Time" foreignKey="time_id">
<Hierarchy name="Time" hasAll="false" primaryKey="time_id">
<Table name="time_by_day" />
<Level name="Year" column="the_year" type="Numeric" uniqueMembers="true" />
<Level name="Quarter" column="quarter" type="Numeric" uniqueMembers="false" />
<Level name="Month" column="month_of_year" type="Numeric" uniqueMembers="false" />
</Hierarchy>
</Dimension>
<Measure name="Unit Sales" column="unit_sales" aggregator="sum" formatString="#,###" />
<Measure name="Store Sales" column="store_sales" aggregator="sum" formatString="#,###.##" />
<Measure name="Store Cost" column="store_cost" aggregator="sum" formatString="#,###.00" />
<CalculatedMember name="Profit" dimension="Measures" formula="[Measures].[Store Sales] - [Measures].[Store Cost]">
<CalculatedMemberProperty name="FORMAT_STRING" value="$#,##0.00" />
</CalculatedMember>
</Cube>
</Schema>
Conclusions
The tab2xml
package offers an efficient and scalable
method for generating XML files from tabular data. By using structured
templates and relational tables, the package facilitates the management
of complex datasets while maintaining consistency and accuracy.
This approach is particularly advantageous for scenarios requiring
the generation of large XML files or handling multiple entities with
defined relationships. It streamlines the transformation process,
reduces manual errors, and enhances maintainability by allowing users to
modify the underlying data without altering the XML structure.
The provided example, based on the Mondrian schema,
demonstrates how to apply this methodology to real-world use cases. This
framework can be extended to other domains requiring XML generation from
relational data, offering a flexible and robust solution for data-driven
XML creation.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.