April 2005 - Posts

Native XML Databases, SQL Server and the future of databases

Seeing as SQL Server 2005 is coming this year, and considering some of the more prominent new features of this release are XML related, I thought it might be worth talking about XML databases, what a Native XML database is, how SQL Server offers Native XML Database-like features and what other XML databases are out there. I'm also going to add a bit of humble speculation on the path that Microsoft may take with SQL Server.

Native XML Databases (NXDs) seem to escape any formal definition, but are generally considered to be databases capable of storing XML and querying through XPath/XQuery while retaining data fidelity. That means what goes in comes out looking pretty much the same. Storing and querying XML efficiently is no easy feat: you can't just stick it in a memo field and call it an XML Database. There's storage space considerations, and performance issues relating to how fast you can scan and index the XML.

NXDs are actually quite a new concept in the database arena: XML Databases have only started nearing the mainstream in the past few years. I think this requirement grew out of frustration than anything else - the more people used XML for communication and defining file-based documents, the more they found it a pain transforming this structure into the relatively flat relational database model. And so arose the requirement for a database that doesn't require any kind of transformation, it just takes the XML, and stores it in its original structure, and lets you use XML querying languages to get at the data.

This is different, though, to storing it in its original format. NXDs generally don't store the data in a text file. Instead, they put it into a binary format that's suitable for random access, performant updating and that’s more compact. They also maintain indices, much like indices in relational databases, except the index is on a path into the XML document rather than on a particular field in a table.

While there have been a few open-source, research-driven and commercial XML Database endeavors, such as Apache's Xindice and Modis group's Sedna, SQL Server 2005 represents Microsoft's first dip into that market.

So why would programmers want an XML database? XML is spreading like wild fire. With upcoming technologies like XAML and C-Omega Microsoft have hinted that we will be looking at much tighter integration between XML and .Net languages in the future. Cw turns an extremely popular language like C# into something that makes classes and XML interchangeable, and the thought of then being able to have scalable database persistence relatively transparent to you is quite appealing. You could argue people have been doing this for years with object databases, but I think the difference is in how rich XML is compared to standard objects, and how XML databases are all about standards, such as XQuery, XUpdate and XMLDB:API.

Personally I find it much easier to map business entities to XML than I do relational databases. It just feels more natural to consider the order details a child of the order, than having them in two tables but related by an ID. In a lot of cases parent-child relationships are the only relationship you'll need between certain entities, and IMO relational databases make that more complex to manage than it needs to be. With the advent of XQuery, you can manage one-to-many and many-to-many relationships between different areas of the hierarchy - so you get the best of both worlds. Besides, XQuery is a nice language to use, and I think it's easier to read than SQL.

SQL Server's NXD features are well integrated into its more traditional database model. Rather than have the NXD database a separate entity, you can declare a column type in a table as an XML docuent, and then store XML elements in each row. The XML elements can be full documents in themselves. You can then use Xquery to query the XML from within T-SQL, and even use the syntax in stored procedures. SQL Server's integration with the relational model has caused some stir, with some arguing that SQL Server isn't a real NXD. I think, though, by most definitions it certainly is an NXD and certainly a great start compared to other NXDs on the market.

One of the issues with XQuery 1.0 is that, while it provides a rich feature set for querying XML data, it doesn't yet support any update syntax. This of course has resulted in each database implementing update syntax as a proprietary extension to XQuery, SQL Server being one of them. Luckily, though, much of the syntax is similar across products and they're mostly based on the XUpdate protocol, which itself is quite straightforward.

SQL Server lets you do an awful lot with this XML data, including constraining the contents to a specific schema (specified in XSD format) and creating indices on the data. You can even store XML in temporary variables in your T-SQL and then run queries on them, without the XML even being in the database!I 'm impressed with what I've seen of the implementation, it's certainly an indication that Microsoft are taking NXDs seriously and that this is just the first step.

Before you drop all your tables and rush to re-implement your data model as persistent XML in SQL Server, there's a few things I've found that you should be aware of:

1. It's the first release of this technology for Microsoft. This is usually enough to scare off anyone thinking of using the technology for mission critical applications. Saying that, some basic tests do show the technology to be stable and this is promising.

2. I have heard that the XML support is lacking good concurrency support. That is, if I try to update a document in a row, the entire document will be locked for the duration of the update.

3. There's no support for top-level XPaths queries: you always have to run your XPath through a SQL statement, and even then you have to reference the XML via a regular table and column. Most other Native XML databases hold purely XML so this isn't a problem. Hopefully Microsoft will change this in the future. Either that, or perhaps someone will write a gateway that will hide this from you, and make SQL Server look like a pure NXD.

That's certainly not an exhaustive list -- I'll post more as I play with the technology. If you want to learn more about the XML features in SQL Server, you can look at this MSDN article.

Now, if you're looking for a cheaper option, and perhaps a purer approach, you could look at the countless open source XML databases out there. Some encode the data into tables in a standard RDBMS, while others store their data in a proprietary binary format. I'm currently evaluating Sedna, an NXD written from scratch by the Modis research group. What's nice about this database is that it's extremely simple to install - just copy the binaries in place and run the server from the command line. There are APIs available in C, Java, Scheme and .Net (which I contributed) and it's very easy to use, quite scalable and a good tool to test out XQuery.

Whichever database you end up using, when I look at the direction that XML is going I get the feeling that ultimately this is the database model of the future - don't get me wrong, it's not quite there just yet, but in the not-too-distant future I can see traditional relational databases being put aside in favor of native XML databases that allow you to mix hierarchical and relational data together, querying the data with Xquery and communicating that information through XML without having to worry about data transformation.

The way I see it, the more people use the XML features of SQL Server and other NXDs, the more they'll dislike having to go back to tables and relations. This is why I find it so important to look into and evaluate the technology today. In future postings I'll try to share my experience with you as I further evaluate this new database technology.

Exceptions vs. Return Codes, again...

That old chestnut has arisen once again with some new ramblings from both sides.

My thoughts.. the biggest problems with using return codes instead of exceptions are:

1. You can easily forget to check the return code of a method, and merrilly continue disregarding it failed. You can't ignore an exception without a suitable catch block.

2. Handling failure is a whole lot more complex with return codes. This is basically what structured exception handling was invented to solve. People often used to solve this with nasty nested IF statements and even GOTOs. 

3. Exceptions usually contain very descriptive information about the failure. A return code is usually a number, maybe an enumerator at best. A lot more work to figure out what's going on. Objects can contain a whole lot more.

Of course we shouldn't rely on exceptions for program flow.

We should also have validation methods that tell us whether the operation can succeed, and call these prior to the actual method. This should return true or false, or maybe even return an Exception-derived object instance to describe why it would fail. If you're not calling the validation routine, then you're not expecting it to fail validation, and that's when it should throw an exception.

String.Contains + Some Other String Improvements in VS2005

There's some things just so glaringly obvious, so innately presupposed that you just assume it's there and it's an impossible task to convince your brain otherwise.

For me, this is string.Contains. For years I've typed "name.Conta…" only to stop and realize that it's not there. I begrudgingly type in "name.IndexOf("sdfsd")>-1" instead. The next time I do the same. In fact I've been using .Net since it was in early beta and yet I still haven't come to terms with the fact string.Contains is missing.

Then today, working in VS 2005, I once again accidentally type it in - and to my disbelief it's actually there! Very cool.

Couple of other things I noticed:

1. String.Split now accepts strings. That means you can split a string like “bob and harry and mary and jane“ on “and“ and you'll get just the people names.

2. String.NullOrEmpty is a new static method that returns true if the string is null or empty (““). All those times I've used code like “if (name==null || name==““) “...

That's what I call progress.

Why testing apps in the IDE is a bad idea

There's recently been some discussion (here and here) on how JIT optimizations can quite dramatically change the behavior of the garbage collector, and thus the behavior of your application. Stephen Toub pointed out that this change in GC behavior occurs because the debugger is attached, regardless of whether you are in debug or release mode. It's interesting to note how running in a debugger can change how the JIT operates.

This just goes to show how dangerous it is to do your testing in the IDE. Running from the IDE (via F5) attaches a debugger to your app. Even when you're running code in release mode, you still won't be using the same application that your users will be running because the debugger is attached and the JIT optimizer operates differently.

The bottom line is that the best way to do effective testing is to run the application the same way your users will: from the desktop or command line.

 

Garbage Collection Quiz?

John Papa posted a quiz about garbage collection, but I'm not sure I agree / understand... perhaps he could back it up with an example?

He claims that, when running the following code:

public void foo()
{
    Class1 x = new Class1();
    Class2 y = new Class2();

    y.Width = x.GetWidth();
    y.Height = y.Width;
    y.Color = "Red";
}

... x becomes eligible for collection right after the call to GetWidth, because it's not used in the rest of the method. While I think that would be cool if possible, that's a kind of undeterministic finalization from hell -- what if that's a GDI object like a window, and it just disappears because you don't use it in the rest of the function, when you expect it to be there? What about the overhead of the JITter having to determine this - is it really worth it? Can i turn it off?

However when I tested this out, with the following code, I can't see that it's right. It doesn't get marked for collection after that call according to my tests.

class Class2 : Class1 { }

class Class1
{
    public int Width;
    public int Height;
    public string Color;
    
    public int GetWidth( )
    {
        return Width;
    }

    ~Class1( )
    {
        Console.WriteLine( "Destructor for " + this.GetType( ) .Name) ;
    }
    
    public static void foo( )
    {
        Class1 x = new Class1( ) ;
        Class2 y = new Class2( ) ;

        y.Width = x.GetWidth( ) ;
        // Console.WriteLine( "Collecting") ; GC.Collect( ) ;  // #1
        y.Height = y.Width;
        y.Color = "Red";
    }

    static void Main( string[ ] args)
    {
        foo( ) ;
        // Console.WriteLine( "Collecting") ; GC.Collect( ) ;  // #2
        Console.ReadLine( ) ;
    }
}

If I run this with #2 uncommented, then both classes get destroyed at the uncommented line, as expected.

If I recomment #2 and uncomment #1, then I just get “Collecting” and no finalizing takes place.

Maybe I'm missing something... so as I said.. I'd like to see an example that demonstrates it.

It turns out that this *is* the case, but only outside of a debugger. There's a great article about this here:

http://mtaulty.com/blog/archive/2005/02/17/1496.aspx

I've definitely learnt something here!