Tuesday, April 26, 2011

Proposal for Boyer-Moore String Pattern Search Algorithm Optimization

I first encountered the Boyer-Moore string search algorithm in mid-1990s when it was mentioned in a local computer magazine discussing virus detection mechanisms and referred to Cormen's Introduction to Algorithms for more details.

Back then, DOS was king and Microsoft was omnipotent, with computer viruses as nuances in one's computing experience, the holy grail of computer programming enthusiasts in my former university (but my college alma mater is here) was anti-virus programming.

The paradox is that to learn how to create an anti-virus you have to create a virus against which it should work. Mark Ludwig's bookwas a coveted possession but the only one I got my hands on was Ralf Burger's work and I learned how to make a functional computer virus then. To fix the problem, you have to find it (fast!); this is where virus detection becomes critical, and Boyer-Moore string search algorithm is one of the best algorithms known then for this purpose.

What I have here now are some proposals on how to optimize this wonderful algorithm and possible drawbacks (modesty aside, I'm my worst critic).

For basic illustration and comparison, Boyer-Moore search algorithm is illustrated below (we search for the string EDGEKIT as example).

This is an illustration of how Boyer-Moore looks for a string pattern
What's the limitation of the basic Boyer-Moore algorithm? Based on the example above, it took the algorithm thirty (30) comparisons as denoted by the number of colored blocks to find the match for a seven (7) character long string pattern. Based on the illustration above, if a string section (i.e. ADGEKIT) is being compared against the string pattern (EDGEKIT), the search will only fail at the seventh (7th) comparison since the mismatch on the first characted from left of the string section is found.

Illustrated below is my proposal A:

After the comparison of the first character (the rightmost string element), the leftmost string element is then checked for match and only continue comparing the other string elements if leftmost string element match is true
The limitation of checking both ends of the string section is that the mismatch may be in the middle of the string section being compared against. In comparison to the basic Boyer-Moore algorithm, this proposal took 27 comparisons to find the match. The illustration below is my proposal B:

We determine a mid-character in the string pattern (For a seven character string pattern, it would be 3, just divide the length of string by 2). After comparing the rightmost string element and finding match, we check against the character aligned with the mid-character, only if there's a match will the rest of the string elements be compared
The drawback is that having to determine the mid-character before we start searching for the string pattern. On the plus side, this proposal took only 26 comparisons.
Illustrated above is proposal C, this combines the functionality of proposal A and B. If comparison of both ends and mid-character is a positive match, only then can the comparison of the other string elements proceed.
For comparison speed, this algorithm also took 27 comparisons. The major drawback would be the complexity of implementing this proposal, not to mention the limitation inherent to both proposals A and B. If the checking of the leftmost character is to be done only on positive match of the mid-character, the comparison will be less and could be the best optimization approach to the algorithm.

Another idea that came to mind is to use a particular optimization proposal on a case to case basis. For example, based on length of string pattern to search, if the length is 8 characters, we can settle for the standard algorithm, then when the string pattern to search for is 16 characters we use proposal A, then for 32 character length string pattern, we can use proposal B and for 64 character length pattern string to be sought, we can use proposal C.

Another notable paper related to Boyer-Moore string search algorithm can be found on Strother Moore (the co-author of Boyer-Moore algorithm) and Matyas Sustik's UTCS Tech Report TR-07-62.

For current general purpose string pattern search algorithm, the fast string searching with suffix trees algorithm beats Boyer-Moore search algorithm by orders of magnitude, but it's quite complex and from 1996 until now, I'm still a humble researcher on this since I read about it in Dr. Dobb's Journal.

No comments:

Post a Comment