Saturday, March 27, 2004 - Posts

Measuring Performance

In the next few weeks I will be presenting a number of sessions on code optimization in the .NET environment. And one of the key things I will be iterating is the selection of algorithms and of course benchmarking before performing optimizations. So here is what I will be saying on one of my favorite topics ‘Algorithm Selection’.

When evaluating the efficiency of an algorithm, there are a number of factors that affect the timings obtained for an iteration of the algorithm. For this reason the efficiency of an algorithm is not measured in execution time, but on an equation that is a function of the size of the data processed by the algorithm. For example, when searching a document for the occurrences of a specific word, one algorithm would be to search character for character until there is a match on the first letter. Once the first letter is found you can proceed to compare subsequent characters against the remainder of the search string to determine if there is a word match. From this we see that if the document has N characters the algorithm execution time is a function of N. This relationship is represented as O(N). Obviously there are other factors that will affect the execution time, for example the more matches there are of the first letter, the more often the algorithm will perform an alternate sequence of tests. These factors are a function of the nature of the input data and not the algorithm. For that reason are ignored when determining the relationship between the algorithm and the input data, focusing rather on the size of the input data. This approach relates the efficiency of an algorithm in the general case. In the case where the worst-case scenario is likely more prevalent it is worth understanding how the algorithm performs in this case and deciding if this is the best algorithm for the expected data.

As another example, the well-known bubble sort routine can be evaluated. The bubble sort routine iterates the data once for each item in the data, comparing and swapping neighboring items if required. For each iteration one less element is evaluated. The bounding function here is the comparison that is performed for each item. From this we can see that the bubble sort requires N(N-1)/2 or (N2-N)/2 comparisons, where N is the number of items to be sorted. This gives us what is known as the performance equation O((N2-N)/2). Using the performance equation we can calculate that for 10 items 45 comparisons are performed and 100 items will require 4950 comparisons. Looking at the growth in the number of comparisons as the number of items increases clearly the bubble sort is useless for large datasets. Typically algorithms whose performance is N2 or worse are considered unusable. By convention O notation does not place significance on any constants other than those that affect the order of the equation, for this reason the standard representation of the bubble sort algorithm would be O(N2-N). While this does not provide the true number of comparisons performed it clearly conveys the order of the algorithm performance equation, which is what has the greatest impact on performance. I like to call this a performance factor or indicator.

To contrast the above with something more usable, we will look at the performance of the ever-popular quick sort algorithm. In the general case the quick sort has a performance factor of O(N log2 N) for the general case and a worst case performance of O(N2) if the dataset is already in sorted order. Fortunately most modern quick sort implementations guard against the worst cast scenario and reduce the likely hood of the algorithm degenerating into a O(N2) function.

When comparing two algorithms there performance equations are not the only factor to evaluate. While two algorithms might have the same order, it is possible that the work performed by one algorithm is more intensive than the other. In this case two algorithms might have the same order, but one is less process intensive than the other.