Sunday, April 4, 2010

Suffix array

Find all occurrences of a pattern (of length m) in a text (of length n) is quite commonly encountered string matching problem. For example, you hit Ctrl-F in your browser and type string you want to find while browser highlights every occurrence of a typed string on a page.

The naive solution is to at each iteration “shift” pattern along the text by 1 position and check if all characters of a pattern match to corresponding characters in text. This solution has O((n – m + 1)*m) complexity.

If either pattern or text is fixed it can be preprocessed to speed up the search. For example, if pattern is fixed we can use Knuth-Morris-Pratt algorithm to preprocess it in O(m) time and make search of its occurrences complexity O(n).

Fixed text that is queried many times can also be preprocessed to support fast patterns search. One way to do this is to build suffix array. The idea behind it pretty simple. It is basically a list of sorted in lexicographical order suffixes (which starts at some position inside the string and runs till the end of the string) of the subject text. For example, for the “mississippi” string we have the following:

i
ippi
issippi
ississippi
mississippi
pi
ppi
sippi
sissippi
ssippi
ssissippi

However due to strings immutability in .NET it is not practical to represent each suffix as separate string as it requires O(n^2) space. So instead starting positions of suffixes will be sorted. But why suffixes are selected in the first place? Because searching for every occurrence of a pattern is basically searching for every suffix that starts with the pattern.

Once they are sorted we can use binary search to find lower and upper bounds that enclose all suffixes that start with the pattern. Comparison of a suffix with a pattern during binary search should take into account only m (length of the pattern) characters as we are looking for suffixes that start with the pattern.

// Suffix array represents simple text indexing mechanism.
public class SuffixArray : IEnumerable<int>
{
  private const int c_lower = 0;
  private const int c_upper = -1;

  private readonly string m_text;
  private readonly int[] m_pos;
  private readonly int m_lower;
  private readonly int m_upper;

  SuffixArray(string text, int[] pos, int lower, int upper)
  {
    m_text = text;
    m_pos = pos;
    // Inclusive lower and upper boundaries define search range.
    m_lower = lower;
    m_upper = upper;
  }

  public static SuffixArray Build(string text)
  {
    Contract.Requires<ArgumentException>(!String.IsNullOrEmpty(text));

    var length = text.Length;
    // Sort starting positions of suffixes in lexicographical 
    // order.
    var pos = Enumerable.Range(0, length).ToArray();
    Array.Sort(pos, (x, y) => String.Compare(text, x, text, y, length));
    // By default all suffixes are in search range.
    return new SuffixArray(text, pos, 0, text.Length - 1);
  }

  public SuffixArray Search(string str)
  {
    Contract.Requires<ArgumentException>(!String.IsNullOrEmpty(str));

    // Search range is empty so nothing to narrow.
    if (m_lower > m_upper)
      return this;
    // Otherwise search for boundaries that enclose all 
    // suffixes that start with supplied string.
    var lower = Search(str, c_lower);
    var upper = Search(str, c_upper);
    // Once precomputed sorted suffixes positions don't change
    // but the boundaries do so that next refinement 
    // can be done within smaller range and thus faster.
    // For example, you may narrow search range to suffixes 
    // that start with "ab" and then search within this smaller
    // search range suffixes that start with "abc".
    return new SuffixArray(m_text, m_pos, lower + 1, upper);
  }

  public IEnumerator<int> GetEnumerator()
  {
    // Enumerates starting positions of suffixes that fall 
    // into search range.
    for (var i = m_lower; i <= m_upper; i++)
      yield return m_pos[i];
  }

  IEnumerator IEnumerable.GetEnumerator()
  {
    return GetEnumerator();
  }

  private int Compare(string w, int i)
  {
    // Comparison takes into account maximum length(w) 
    // characters. For example, strings "ab" and "abc" 
    // are thus considered equal.
    return String.Compare(w, 0, m_text, m_pos[i], w.Length);
  }

  private int Search(string w, int bound)
  {
    // Depending on bound value binary search results 
    // in either lower or upper boundary.
    int x = m_lower - 1, y = m_upper + 1;
    if (Compare(w, m_lower) < 0)
      return x;
    if (Compare(w, m_upper) > 0)
      return y;
    while (y - x > 1)
    {
      var m = (x + y)/2;
      // If bound equals to 0 left boundary andvances to median 
      // only // if subject is strictly greater than median and 
      // thus search results in lower bound (position that 
      // preceeds first suffix equal to or greater than 
      // subject w). Otherwise search results in upper bound 
      // (position that preceeds fisrt suffix that is greater 
      // than subject).
      if (Compare(w, m) > bound)
        x = m;
      else
        y = m;
    }
    return x;
  }
}

This implementation is simple (it has O(n^2 log n) complexity to sort and O(m log n) to search where n stands for text length and m for pattern length) and can be improved. It doesn’t take into account the fact that suffixes not arbitrary strings are sorted. On the other hand suffixes may share common prefixes and that may be used to speed up  binary search.

Here an example of narrowing the search.

var str = ...;
var sa = SuffixArray.Build(str);
string pat;
while ((pat = Console.ReadLine()) != String.Empty)
{
  sa = sa.Search(pat);
  foreach (var pos in sa)
  {
    Console.WriteLine(str.Substring(pos));
  }
}

Happy Easter, folks!

No comments: