Thursday, March 4, 2010

K-way merge

The classic merge (the one used in Merge Sort) takes as input some sorted lists and at each step outputs element with next smallest key thus producing sorted list that contains all the elements of the input lists.

An instance of a list is a computer representation of the mathematical concept of a finite sequence, that is, a tuple.

It is not always practical to have whole sequence in memory because of its considerable size nor to constraint sequence to be finite as only K first elements may be needed. Thus our algorithm must produce monotonically increasing (according to some comparison logic) potentially infinite sequence.

Two-way merge is the simplest variation where two lists are merged (it is named OrderedMerge to avoid confusion with EnumerableEx.Merge).

public static IEnumerable<T> OrderedMerge<T>(
  this IEnumerable<T> first,
  IEnumerable<T> second,
  IComparer<T> comparer)
  using (var e1 = first.GetEnumerator())
    using (var e2 = second.GetEnumerator())
      var c1 = e1.MoveNext();
      var c2 = e2.MoveNext();
      while (c1 && c2)
        if (comparer.Compare(e1.Current, e2.Current) < 0)
          yield return e1.Current;
          c1 = e1.MoveNext();
          yield return e2.Current;
          c2 = e2.MoveNext();
      if (c1 || c2)
        var e = c1 ? e1 : e2;
          yield return e.Current;
        } while (e.MoveNext());

This algorithm runs in O(n) where n is a number of merged elements (as we may not measure it by the number of elements in sequences because of their potential infinity).

N-way merge is a more general algorithm that allows to merge N monotonically increasing sequences into one. It is used in external sorting. Here the most general overload signature (others are omitted as you can easily create them).

public static IEnumerable<T> OrderedMerge<T>(
  this IEnumerable<IEnumerable<T>> sources,
  IComparer<T> comparer)

Naive implementation will take advantage of existing two-way merge and composition.

public static IEnumerable<T> NaiveOrderedMerge<T>(
  this IEnumerable<IEnumerable<T>> sources,
  IComparer<T> comparer)
  return sources.Aggregate((seed, curr) => seed.OrderedMerge(curr, comparer));

Lets denote merging two sequences Si and Sj where i != j and both within [0, m) (m – is the number of sequences) with (Si, Sj). Then what the code above does is (((S0, S1), S2), … Sm). This implementation is naive because fetching next smallest element takes O(m) making total running time to O(nm). We can do better than that.

Recall that in my previous post we implemented priority queue based on binary heap that allows to get next smallest element in O(log n) where n is the size of the queue. Here is a solution sketch:

  • The queue will hold non empty sequences.
  • The priority of a sequence is its next element.
  • At each step we dequeue out of the queue sequence that has smallest next element. This element is next in the merged sequence.
  • If dequeued sequence is not empty it must be enqueued back because it may contain next smallest element in the merged  sequence.

Thus we will have queue of size that doesn’t exceed m (number of sequences) and thus making total running time O(n log m).

Other interesting aspect resource management. Each sequence has associated resources that needs to be released once merged sequence is terminated normally or abnormally. Number of sequences is not known in advance. We will use solution that is described in my previous post Disposing sequence of resources.

Now let’s do it.

// Convenience overloads are not included only most general one
public static IEnumerable<T> OrderedMerge<T>(
  this IEnumerable<IEnumerable<T>> sources,
  IComparer<T> comparer)
  // Make sure sequence of ordered sequences is not null
  Contract.Requires<ArgumentNullException>(sources != null);
  // and it doesn't contain nulls
  Contract.Requires(Contract.ForAll(sources, s => s != null));
  Contract.Requires<ArgumentNullException>(comparer != null);
  // Precondition checking is done outside of iterator because
  // of its lazy nature
  return OrderedMergeHelper(sources, comparer);

private static IEnumerable<T> OrderedMergeHelper<T>(
  IEnumerable<IEnumerable<T>> sources,
  IComparer<T> elementComparer)
  // Each sequence is expected to be ordered according to 
  // the same comparison logic as elementComparer provides
  var enumerators = sources.Select(e => e.GetEnumerator());
  // Disposing sequence of lazily acquired resources as 
  // a single resource
  using (var disposableEnumerators = enumerators.AsDisposable())
    // The code below holds the following loop invariant:
    // - Priority queue contains enumerators that positioned at 
    // sequence element
    // - The queue at the top has enumerator that positioned at 
    // the smallest element of the remaining elements of all 
    // sequences

    // Ensures that only non empty sequences participate  in merge
    var nonEmpty = disposableEnumerators.Where(e => e.MoveNext());
    // Current value of enumerator is its priority 
    var comparer = new EnumeratorComparer<T>(elementComparer);
    // Use priority queue to get enumerator with smallest 
    // priority (current value)
    var queue = new PriorityQueue<IEnumerator<T>>(nonEmpty, comparer);

    // The queue is empty when all sequences are empty
    while (queue.Count > 0)
      // Dequeue enumerator that positioned at element that 
      // is next in the merged sequence
      var min = queue.Dequeue();
      yield return min.Current;
      // Advance enumerator to next value
      if (min.MoveNext())
        // If it has value that can be merged into resulting
        // sequence put it into the queue

// Provides comparison functionality for enumerators
private class EnumeratorComparer<T> : Comparer<IEnumerator<T>>
  private readonly IComparer<T> m_comparer;

  public EnumeratorComparer(IComparer<T> comparer)
    m_comparer = comparer;

  public override int Compare(
     IEnumerator<T> x, IEnumerator<T> y)
    return m_comparer.Compare(x.Current, y.Current);

It works well with infinite sequences and cases where we need only K first elements and it fetches only bare minimum out of source sequences.

Run the thing.

// Function that generates sequence of length k of random numbers
Func<int, IEnumerable<int>> gen = k => Enumerable.Range(0, k)
  .Select(l => rnd.Next(max));
// Generate sequence of random lengths and each length project
// to a sequence of that length of random numbers
var seqs = gen(count).Select(k => gen(k)
  .OrderBy(l => l).AsEnumerable());

var p = -1;
foreach (var c in seqs.OrderedMerge(Comparer<int>.Default))
  Debug.Assert(p <= c);
  p = c;


1 comment:

Ihar Voitka said...

Nice implementation. I like it.