The point here is that the purpose of these exercises are not to reinvent the wheel. They are to understand how things work, under the hood so to speak. I, for one, like that. I feel that it is the sign of a good programmer to have that sort of curiosity, to not just know how to do things but how things work. For my own selfish reasons, I find that I never really understand something until I write about it. So, acknowledging that this post might not have a huge audience, at least I'll get something out of it.
Anyway, about Horspool's algorithm. Obviously the simplest way to do a string pattern search is just to start at the beginning of both string and pattern and do a character by character comparison. Then, if a mismatch is found in any character, shift the pattern one position forward in the string and do the comparison again. Really, this is probably efficient enough for most purposes, should a programmer actually need to implement such an algorithm for some reason. For large searches through many large strings something better might be worth it, though.
Obviously (hopefully) I'm not going to be so foolish as to post such a trivial algorithm as the simple version described above, so there is an optimization, Horspool's that is. (One day I'll have an algorithm named after me.) Firstly, Horspool's starts by aligning the string and pattern as described above. Then it does it's comparison starting at the end of the pattern instead of the start. Now, the key insight is that if a mismatch is found the pattern can be shifted more than a single character. For instance if the final character, the first checked, does not match and also is not contained anywhere in the pattern then the pattern can be shifted the whole length of the pattern forward in the string. If, on the other hand, there is a mismatch but the character does exist elsewhere in the pattern then the pattern can be shifted forward the distance necessary to align that character in the pattern with the string character just checked. Furthermore this is true of any character mismatch anywhere in the string. The exact same shifting rule applies.
The algorithm will continue in this way until either all characters are found to match or the start position of the pattern within the string is such that the end of the pattern would be beyond the end of the string. in that case a "no match" indicator is returned (-1 in my version).
Note that an algorithm like this would not be worth it in code that just does low frequency single character comparisons. The simple algorithm is plenty good enough for that. In fact the computation of the shift table might erase any potential efficiency gain. For large comparisons, though, say a grep on a set of large files this algorithm could make a substantial difference. Note that the shift table is calculated only once and so can be amortized across multiple searches. Of course, as I mentioned before, in the real world we already have grep at our disposal but I'm sure whatever greppers one uses have a similar implementation to this (or maybe even more efficient versions, which exist, although those are quite a bit more complex and might be considered not worth it even for api vendors. Evidence of this is that Horpool published his algorithm after the more efficient, but similar, version. Apparently he saw a need for a simplification).
Anyway, here is my Ruby implementation. I tried to include a good many comments to make the logic easy to follow. Hopefully I succeeded. Enjoy.
class Horspool
def findIn(str, pattern)
@shiftTable=Hash.new(nil) #will contain shift amounts for all valid chars
@pattern=pattern
calcShift #buld the shift table
i=0 #pos in str
j=pattern.size-1 #pos in pattern
shift=pattern.size #initialize shift to maximum posible
while i+pattern.size-1<str.size #Pattern doesn't extend past end of str
#Use shift table lookup to shift the precomputed amount minus dist from
#end of pattern (which is (pattern.size-1)-j), if that's less then
#previously computed shift make it the current shift
newShift=@shiftTable[str[i+j,1]]-((pattern.size-1)-j)
(newShift<shift)?shift=newShift:nil
if str[i+j]==pattern[j] #str/pattern char match
j-=1 #moving backward through pattern
if j<0 #Made it all the way through pattern, must match
return i
end
else
i+=shift #Start comparing at new post in str
j=pattern.size-1 #reset j to end of pattern
shift=pattern.size
end
end
return -1 #No match
end
private
def calcShift
#Precompute to shift whole pattern length for chars not in pattern
for i in 'a'..'z'
@shiftTable[i]=@pattern.size
end
#Shift distance from end of pattern-1 for chars in pattern
for i in 0... @pattern.size-1
@shiftTable[@pattern[i,1]]=(@pattern.size-i)-1
end
end
end
if __FILE__ == $0
print Horspool.new.findIn("ericzetterbaum","zett")
end
No comments:
Post a Comment