Friday, October 01, 2004 - Posts

Performance - Loading XML Documents

With my focus on .NET technologies one of the things that I find myself doing very often is processing XML data in some form or the other. This is one of the areas that the .NET class library provides numerous options and while these options are well documented I still find many developers not fully aware of the potential impact that there choice of XML processing technology could have on the scalability of the software being developed. To that end I have chosen 5 alternatives to loading an XML document into memory for further processing.

  1) XmlDocument class
  2) Untyped DataSet
  3) Typed DataSet
  4) XmlSerializer
  5) XmlTextReader

For each of these I benchmarked the time required to create an in-memory representation of the XML document. For the XmlDocument I used the following code

XmlDocument dom = new XmlDocument();
dom.Load("c:\\orders.xml");

The untyped DataSet was equally simple

  DataSet ds = new DataSet();
  ds.ReadXml("c:\\orders.xml");

For the typed DataSet I created a XSD that represented the XML Document to be loaded from that I generated the typed DataSet and used the following code to load the DataSet

  DSOrders ds = new DSOrders();
  ds.ReadXml("c:\\orders.xml");

Then for the XmlSerializer and the XmlTextReader I needed to build a set of classes that could be populated with the data from the XML Document. For this purpose I created the following classes to represent the data.

[Serializable(), XmlRoot("data")]
public class Data
{
  [XmlElement()]
 
public Order[] orders;
}

[Serializable()]
public class Order
{
  [XmlAttribute()]
public string OrderID;
 
[XmlAttribute()]public string CustomerID;
  [XmlAttribute()]
public string EmployeeID;
  [XmlAttribute()]
public string OrderDate;
  [XmlAttribute()]
public string RequiredDate;
  [XmlAttribute()]
public string ShippedDate;
  [XmlAttribute()]
public string ShipVia
  [XmlAttribute()]
public string Freight;
  [XmlAttribute()]
public string ShipName;
  [XmlAttribute()]
public string ShipAddress;
  [XmlAttribute()]
public string ShipCity;
  [XmlAttribute()]
public string ShipPostalCode;
  [XmlAttribute()]
public string ShipCountry;
}

The above classes represent a collection of orders and as you might have guessed by now the data in the XML document is derived from the Northwinds Orders table. With the classes defined using the XmlSerializer was a simple matter as the following code snip shows:

  XmlSerializer xmlser = new XmlSerializer(typeof(Data));
  StreamReader rdr =
new StreamReader("c:\\orders_small.xml");
  Data d = (Data)xmlser.Deserialize(rdr);

And then there was the more verbose code using the XmlTextReader to load the XML data into instances of the Order class and add them to the Data class’ orders collection. For brevity I will not include that code here.

Once all the code was written I put the code through my benchmarking application. I ran each test twice first with a XML document containing 860 records and a second time with a smaller set of 10 records. Since the results where very similar I will only show the results for the larger set. The following graph shows the median time taken for each of the techniques.

Test

Time (Seconds)

XmlDocument

0.193951745263714

Untyped DataSet

0.435501439428754

Typed DataSet

0.129763140287383

XmlSerializer

0.087618677792848

XmlTextReader

0.0392390653001988

There is of course on catch with the XmlSerializer which is not depicted in the above results and that is the initial cost of instantiation. When an instance of XmlSerializer is created for a particular type a temporary assembly is created which is dedicated to performing the serialization and de-serialization of objects of that type. Since this is a once off cost I ignored this overhead and while less significant I also did the same for the other tests. Only the XmlTextReader solution had no initial setup costs.

Conclusion

For tasks where the XML schema is known upfront I feel that the XmlSerializer gives the best balance between lines of code written and performance. While the typed DataSet is a very good performer it is much less flexible than the XmlSerializer. The XmlDocument is infinitely more flexible but you pay for that in performance and memory consumption even though I ensured that all the tests loaded the entire XML into a memory structure the XmlDocument was significantly more expensive in terms of memory consumption, but I leave that for a later post.

With its short comings and all, I am a big fan of the XmlSerializer and through clever use of the XML Serialization attributes find that I can manipulate most XML document structures.

For the tests run I did not make any attempts to optimize the code that was executed, especially in the case of the XmlTextReader where I took a very straight forward approach which could easily have been improved especially with regards to memory consumption.