Tuesday, August 24, 2010

External merge sort

If data to be sorted doesn’t fit into main memory external sorting is applicable. External merge sort can be separated into two phases:

  1. Create initial runs (run is a sequence of records that are in correct relative order or in other words sorted chunk of original file).
  2. Merge created runs into single sorted file.

To implement this algorithm I will use solutions from my previous posts so it may be helpful for you to look at them:

Let’s assume that M records at the same time are allowed to be loaded into main memory. One of the ways to create initial runs is to successively read M records from original file, sort them in memory and write back to disk. However we will use approach that allows us to create longer runs. It is called replacement selection.

The core structure behind this algorithm is priority queue. Taking one by one current minimum element out of the queue forms ordered sequence. And this is exactly what run stands for. The algorithm can be described as follows:

  1. Create two priority queues (that will contain items for current and next runs respectively) with capacity of M records.
  2. Prefill current run priority queue from unsorted file with M records.
  3. Create current run if there are elements in the current run priority queue:
    1. Take minimum element out of current run priority queue and append it to current run (basically write it to disk).
    2. Take next element from unsorted file (this is the replacement part of the algorithm) and compare it with just appended element.
    3. If it is less then it cannot be part of the current run (or otherwise order will be destroyed) and thus it is queued to the next  run priority queue.
    4. Otherwise it is part of the current run and it is queued to the current run priority queue.
    5. Continue steps 1 through 4 until current run priority queue is empty.
  4. Switch current and next runs priority queues and repeat step 3.

At any given moment at most M records are loaded into main memory as single written element into current run is replaced with single element from unsorted file if any (depending on comparison it either goes into current or next run).

Next step is to merge created initial runs. For the merge step we will use simplified algorithm (more advanced algorithms work with multiple physical devices to distribute runs, take into account data locality, etc.) based on k-way merge:

  1. Append created runs into a queue.
  2. Until there are more than one run in the queue:
    1. Dequeue and merge K runs into a single run and put it into the queue.
  3. Remaining run represents sorted original file.

Yeap, it is that simple. And let’s code it.

The implementation abstracts file structure and reading/writing details making algorithm more concise and easier to understand.

abstract class ExternalSorter<T>
{
	private readonly IComparer<T> m_comparer;
	private readonly int m_capacity;
	private readonly int m_mergeCount;

	protected ExternalSorter(IComparer<T> comparer, int capacity, int mergeCount)
	{
		m_comparer = comparer;
		m_capacity = capacity;
		m_mergeCount = mergeCount;
	}

	// Sorts unsorted file and returns sorted file name
	public string Sort(string unsorted)
	{
		var runs = Distribute(unsorted);
		return Merge(runs);
	}

	// Write run to disk and return created file name
	protected abstract string Write(IEnumerable<T> run);
	// Read run from file with given name
	protected abstract IEnumerable<T> Read(string name);

	// Merge step in this implementation is simpler than 
	// the one used in polyphase merge sort - it doesn't
	// take into account distribution over devices
	private string Merge(IEnumerable<string> runs)
	{
		var queue = new Queue<string>(runs);
		var runsToMerge = new List<string>(m_mergeCount);
		// Until single run is left do merge
		while (queue.Count > 1)
		{
			// Priority queue must not contain records more than 
			// required
			var count = m_mergeCount;
			while (queue.Count > 0 && count-- > 0)
				runsToMerge.Add(queue.Dequeue());
			// Perform n-way merge on selected runs where n is 
			// equal to number of physical devices with 
			// distributed runs but in our case we do not take 
			// into account them and thus n is equal to capacity
			var merged = runsToMerge.Select(Read).OrderedMerge(m_comparer);
			queue.Enqueue(Write(merged));

			runsToMerge.Clear();
		}
		// Last run represents source file sorted
		return queue.Dequeue();
	}

	// Distributes unsorted file into several sorted chunks
	// called runs (run is a sequence of records that are 
	// in correct relative order)
	private IEnumerable<string> Distribute(string unsorted)
	{
		var source = Read(unsorted);
		using (var enumerator = source.GetEnumerator())
		{
			var curr = new PriorityQueue<T>(m_comparer);
			var next = new PriorityQueue<T>(m_comparer);
			// Prefill priority queue to capacity which is used 
			// to create runs
			while (curr.Count < m_capacity && enumerator.MoveNext())
				curr.Enqueue(enumerator.Current);
			// Until unsorted source and priority queues are 
			// exhausted
			while (curr.Count > 0)
			{
				// Create next run and write it to disk
				var sorted = CreateRun(enumerator, curr, next);
				var run = Write(sorted);

				yield return run;

				Swap(ref curr, ref next);
			}
		}
	}

	private IEnumerable<T> CreateRun(IEnumerator<T> enumerator, PriorityQueue<T> curr, PriorityQueue<T> next)
	{
		while (curr.Count > 0)
		{
			var min = curr.Dequeue();
			yield return min;
			// Trying to move run to an end enumerator will 
			// result in returning false and thus current 
			// queue will simply be emptied step by step
			if (!enumerator.MoveNext())
				continue;

			// Check if current run can be extended with 
			// next element from unsorted source
			if (m_comparer.Compare(enumerator.Current, min) < 0)
			{
				// As current element is less than min in 
				// current run it may as well be less than 
				// elements that are already in the current 
				// run and thus from this element goes into 
				// next run
				next.Enqueue(enumerator.Current);
			}
			else
			{
				// Extend current run
				curr.Enqueue(enumerator.Current);
			}
		}
	}

	private static void Swap<U>(ref U a, ref U b)
	{
		var tmp = a;
		a = b;
		b = tmp;
	}
}

In the example below I created type that sorts text files containing single number per line.

class TextFileOfNumbersExternalSorter : ExternalSorter<int>
{
	public TextFileOfNumbersExternalSorter(int capacity, int mergeCount)
		: base(Comparer<int>.Default, capacity, mergeCount)
	{
	}

	protected override string Write(IEnumerable<int> run)
	{
		var file = Path.GetTempFileName();
		using (var writer = new StreamWriter(file))
		{
			run.Run(writer.WriteLine);
		}
		return file;
	}

	protected override IEnumerable<int> Read(string name)
	{
		using (var reader = new StreamReader(name))
		{
			while (!reader.EndOfStream)
				yield return Int32.Parse(reader.ReadLine());
		}
		File.Delete(name);
	}
}

That is used like this:

// capacity, mergeCount and unsortedFileName are initialized elsewhere
var sorter = new TextFileOfNumbersExternalSorter(capacity, mergeCount);
var sortedFileName = sorter.Sort(unsortedFileName);

That’s it folks!

Saturday, August 21, 2010

Minimum window that contains all characters

Like most of life's problems, this one can be solved with bending

- Bender B.Rodrigues

Let’s bend another problem. Given set of characters P and string T find minimum window in T that contains all characters in P. Applicable solution is restricted to O(length(T)) time complexity. For example, given a string T “of characters and as” and set of characters T in a form of a string “aa s” the minimum window will be “and as”.

The problem can be broken into two parts:

  • How to select window?
  • How to check that selected window contains all characters from P?

Selecting every possible window (all unique pairs (i, j) where 0 <= i <= j < length(T)) will lead to solution worse than O(length(T)^2) because you still need to check if all characters from P are within selected window. Instead we will check every possible window ending position. Thus there are at least length(T) windows to consider.

Any feasible window has length equal to or greater than length(P). Performing recheck for any considered window will result in a solution no better than O(length(T)*length(P)). Instead we need to use check results from previous iteration.

Now we need to make sure that checking if a particular character is in P is done in an optimal way. Taking into account that a particular character may appear more than once and window thus must contain appropriate number of characters. We will use hash table to map unique characters from P to their count for fast lookup.

And now let’s tie all things together.

  • Until reached the end of T move by one current window ending position.
  • Append next character to the end of previous window which to this moment doesn’t contain all necessary characters. Char to count map is used to track the number of characters left to find. Basically if character is in P count is decremented. The number may become negative meaning that there are more than required characters.
  • If unmatched character count goes to zero the window contains all required characters. However there may be redundant characters. Thus we try to compact current window. It is ok to do this as we are looking for minimum window and any window that is extended from this one won’t be better.
  • Once window is compacted compare it with the minimum one and updated it if needed.
  • If current window contains all the characters remove from it the first one simply by moving by one starting position to make sure that at each iteration previous window doesn’t contain all the characters (there is no point in appending new characters to a window that already contains all of the required ones).

Code the thing! =)

static string FindMinWindow(string t, string p)
{
	// Create char to count mapping for fast lookup
	// as some characters may appear more than once
	var charToCount = new Dictionary<char, int>();
	foreach (var c in p)
	{
		if (!charToCount.ContainsKey(c))
			charToCount.Add(c, 0);
		charToCount[c]++;
	}

	var unmatchesCount = p.Length;
	int minWindowLength = t.Length + 1, minWindowStart = -1;
	int currWindowStart = 0, currWindowEnd = 0;
	for (; currWindowEnd < t.Length; currWindowEnd++)
	{
		var c = t[currWindowEnd];
		// Skip chars that are not in P
		if (!charToCount.ContainsKey(c))
			continue;
		// Reduce unmatched characters count
		charToCount[c]--;
		if (charToCount[c] >= 0)
			// But do this only while count is positive
			// as count may go negative which means 
			// that there are more than required characters
			unmatchesCount--;

		// No complete match, so continue searching
		if (unmatchesCount > 0)
			continue;

		// Decrease window as much as possible by removing 
		// either chars that are not in T or those that 
		// are in T but there are too many of them
		c = t[currWindowStart];
		var contains = charToCount.ContainsKey(c);
		while (!contains || charToCount[c] < 0)
		{
			if (contains)
				// Return character to P
				charToCount[c]++;

			c = t[++currWindowStart];
			contains = charToCount.ContainsKey(c);
		}

		if (minWindowLength > currWindowEnd - currWindowStart + 1)
		{
			minWindowLength = currWindowEnd - currWindowStart + 1;
			minWindowStart = currWindowStart;
		}

		// Remove last char from window - it is definitely in a 
		// window because we stopped at it during decrease phase
		charToCount[c]++;
		unmatchesCount++;
		currWindowStart++;
	}

	return minWindowStart > -1 ?
	       t.Substring(minWindowStart, minWindowLength) :
	       String.Empty;
}

Every character is examined at most twice (during appending to the end and during compaction) so the whole solution has O(length(T)) time complexity assuming hash table lookup is O(1) operation.

Tuesday, August 17, 2010

Print numbers by spiral

Recently I came across simple yet interesting coding problem. So here is the deal. You are given positive integer N. Print first N ^ 2 positive integers in matrix form in a such a way that within matrix numbers form spiral starting from its center and goring clockwise. For example, for N = 5 matrix to be printed is:

21 22 23 24 25
20 7 8 9 10
19 6 1 2 11
18 5 4 3 12
17 16 15 14 13

Optimize it for speed and space.

One way you can approach it is to create N x N matrix and fill it with numbers that form spiral and then print whole matrix row by row. But this solution will be of N ^ 2 space complexity. Let’s try to reach O(1) space complexity.

The key observation here is how matrix changes when N changes by 1.

N = 1.

1

N = 2.

1 2
4 3

N = 3.

7 8 9
6 1 2
5 4 3

N = 4.

7 8 9 10
6 1 2 11
5 4 3 12
16 15 14 13

Can you see the pattern here? At every step we extend previous matrix (P) with additional column and row (C). If N is even we extend previous matrix of size N – 1 with right column and bottom row

P C
C C

and with left column and top row if it is odd

C C
C P

This leads us to naturally recursive algorithm. We have three cases:

  1. Print whole row of the current matrix (top when N is odd or bottom when N is even).
  2. Print row from previous matrix of size N - 1 first and then print value that belongs to current matrix (when N is even).
  3. Print value that belongs to current matrix and then print row from previous matrix of size N - 1 (when N is odd).
  4. Print matrix line by line.

So basically to print a row we need to know matrix size N and row index. Here goes the solution.

static void Print(int n)
{
	for(int i = 0; i < n; i++)
	{
		PrintLine(n, i);
		Console.WriteLine();
	}
}

static void PrintLine(int n, int i)
{
	// Number of integers in current matrix
	var n2 = n*n;
	// Number of itegers in previous matrix of size n - 1
	var m2 = (n - 1)*(n - 1);

	if (n % 2 == 0)
	{
		if (i == n - 1)
		{
			// n is even and we are at the last row so just 
			// print it
			for(int k = n2; k > n2 - n; k--)
			{
				PrintNum(k);
			}
		}
		else
		{
			// Print row from previous matrix of size n - 1 
			// first and then print value that belongs to current 
			// matrix. Previous matrix is at the top left corner 
			// so no need to adjust row index
			PrintLine(n - 1, i);
			// Skip all integers from previous matrix and upper 
			// ones in this columnas integers must form clockwise 
			// spiral
			PrintNum(m2 + 1 + i);
		}
	}
	else
	{
		if (i == 0)
		{
			// n is odd and we are at the first row so just 
			// print it
			for(int k = m2 + n; k <= n2; k++)
			{
				PrintNum(k);
			}
		}
		else
		{
			// Print value that belongs to current matrix and
			// then print row from previous matrix of size n - 1
			// Skip all integers from previous matric and bottom
			// ones in this column as integers must form clockwise
			// spiral
			PrintNum(m2 + n - i);
			// Previous matrix is at the bottom right corner so
			// row index must be reduced by 1
			PrintLine(n - 1, i - 1);
		}
	}
}

static void PrintNum(int n)
{
	Console.Write("{0, -4}  ", n);
}

If stack is not considered then this solution has O(1) space complexity otherwise O(N).