The EZ Way: algorithms

Showing posts with label algorithms. Show all posts

2009-05-05

Horspool's Algorithm

Today, another algorithm. Admittedly this one is not likely to be of much use to many programmers as an actual implementation. Horspool's algorithm, the subject of today's post, is nothing more than a pretty efficient way to search a string for a pattern. Obviously unless your project is to write a new programming language API then you already have such facilities ready at hand. In fact I'd be willing to bet that whatever API you're using has an algorithm similar to this to do the job.

The point here is that the purpose of these exercises are not to reinvent the wheel. They are to understand how things work, under the hood so to speak. I, for one, like that. I feel that it is the sign of a good programmer to have that sort of curiosity, to not just know how to do things but how things work. For my own selfish reasons, I find that I never really understand something until I write about it. So, acknowledging that this post might not have a huge audience, at least I'll get something out of it.

Anyway, about Horspool's algorithm. Obviously the simplest way to do a string pattern search is just to start at the beginning of both string and pattern and do a character by character comparison. Then, if a mismatch is found in any character, shift the pattern one position forward in the string and do the comparison again. Really, this is probably efficient enough for most purposes, should a programmer actually need to implement such an algorithm for some reason. For large searches through many large strings something better might be worth it, though.

Obviously (hopefully) I'm not going to be so foolish as to post such a trivial algorithm as the simple version described above, so there is an optimization, Horspool's that is. (One day I'll have an algorithm named after me.) Firstly, Horspool's starts by aligning the string and pattern as described above. Then it does it's comparison starting at the end of the pattern instead of the start. Now, the key insight is that if a mismatch is found the pattern can be shifted more than a single character. For instance if the final character, the first checked, does not match and also is not contained anywhere in the pattern then the pattern can be shifted the whole length of the pattern forward in the string. If, on the other hand, there is a mismatch but the character does exist elsewhere in the pattern then the pattern can be shifted forward the distance necessary to align that character in the pattern with the string character just checked. Furthermore this is true of any character mismatch anywhere in the string. The exact same shifting rule applies.

The algorithm will continue in this way until either all characters are found to match or the start position of the pattern within the string is such that the end of the pattern would be beyond the end of the string. in that case a "no match" indicator is returned (-1 in my version).

Note that an algorithm like this would not be worth it in code that just does low frequency single character comparisons. The simple algorithm is plenty good enough for that. In fact the computation of the shift table might erase any potential efficiency gain. For large comparisons, though, say a grep on a set of large files this algorithm could make a substantial difference. Note that the shift table is calculated only once and so can be amortized across multiple searches. Of course, as I mentioned before, in the real world we already have grep at our disposal but I'm sure whatever greppers one uses have a similar implementation to this (or maybe even more efficient versions, which exist, although those are quite a bit more complex and might be considered not worth it even for api vendors. Evidence of this is that Horpool published his algorithm after the more efficient, but similar, version. Apparently he saw a need for a simplification).

Anyway, here is my Ruby implementation. I tried to include a good many comments to make the logic easy to follow. Hopefully I succeeded. Enjoy.

class Horspool
 def findIn(str, pattern)
  @shiftTable=Hash.new(nil) #will contain shift amounts for all valid chars
  @pattern=pattern
  calcShift #buld the shift table
  i=0 #pos in str
  j=pattern.size-1 #pos in pattern
  shift=pattern.size #initialize shift to maximum posible
  while i+pattern.size-1<str.size #Pattern doesn't extend past end of str
   #Use shift table lookup to shift the precomputed amount minus dist from 
   #end of pattern (which is (pattern.size-1)-j), if that's less then
   #previously computed shift make it the current shift
   newShift=@shiftTable[str[i+j,1]]-((pattern.size-1)-j)
   (newShift<shift)?shift=newShift:nil
   if str[i+j]==pattern[j] #str/pattern char match
    j-=1 #moving backward through pattern
    if j<0 #Made it all the way through pattern, must match
     return i
    end
   else
    i+=shift #Start comparing at new post in str
    j=pattern.size-1 #reset j to end of pattern
    shift=pattern.size
   end
  end
  return -1 #No match
 end

private
 
 def calcShift
  #Precompute to shift whole pattern length for chars not in pattern
  for i in 'a'..'z' 
   @shiftTable[i]=@pattern.size
  end
  #Shift distance from end of pattern-1 for chars in pattern
  for i in 0... @pattern.size-1
   @shiftTable[@pattern[i,1]]=(@pattern.size-i)-1
  end
 end
end

if __FILE__ == $0
 print Horspool.new.findIn("ericzetterbaum","zett")
end

2009-04-25

Floyd's Algorithm (Ruby)

I've got another algorithm post today. My latest is Floyd's algorithm, again in Ruby. My ultimate goal is to publish a large library of algorithms in Java, Ruby and C. Gonna have to really get moving on that project. I'll have the Java version of this guy shortly.

Floyd's algorithm is very closely related to Warshall's (as I previously presented here and here). Indeed it's sometimes called the Floyd-Warshall algorithm since Floyd apparently got the idea for it from Warshall's algorithm. It's also often called the all-pairs shortest-paths problem which is the descriptive version.

Floyd's algorithm is more commonly seen in the literature (a Google on Warshall is in fact brings this algorithm higher that Warshall's own). The reason for this is that it's useful in more applications. One often wants to know how far it is to something, say a network resource, rather than just whether or not one can get there.

The way Floyd's algorithm works is so. We start with a weight matrix of a directed graph, which is like the adjacency matrix except that it contains "weights" of edges representing distances between nodes. As with Warshall's algorithm one then iterates through intermediate nodes, call them k, to find the distance of the path between each node i,j through k. If it is smaller than the distance already calculated through a previous k, and smaller than the weight (distance) of any edge connected i to j, then the current distance in the distance matrix is replaced with the new, shorter, distance. Note that in my version I initialize the distance matrix with the weight matrix so we always compare the distance paths with, at least, the direct distance between the nodes. We might call this k=0 (or maybe k=-1 would be better given my 0 indexed arrays).

One thing I didn't note in my discussion of Warshall's algorithm is that both these algorithms are examples of dynamic programming. In dynamic programming one builds a final result from intermediate results. I this case each intermediate result is the intermediate node, k. Since we iterate through each k we build up successively all possible paths from all nodes using all possible intermediate nodes thus comparing all paths between all nodes to find the smallest.

OK, so here's the algorithm. Enjoy. (Oh, I included a randomized again, so its easy to generate lots of examples.)

class Floyd #aka all-pairs shortest paths algorithm
 def initialize(weightMatrix=nil,oo=256) #256?
  @weightMatrix=weightMatrix
  @oo=oo #needed to set a weight considered "infinite"
 end
  
 attr_accessor :oo
 
 attr_accessor :weightMatrix
 
 def getDistanceMatrix
  numNodes=@weightMatrix[0].length
  distanceMatrix=Array.new(@weightMatrix)
  
  for k in 0...numNodes
   for i in 0...numNodes
    #Optimization: if no path from i to k, no need to test k to j
    if distanceMatrix[i][k]<@oo then 
     for j in 0...numNodes
      distance = distanceMatrix[i][k] + distanceMatrix[k][j]
      if distance<distanceMatrix[i][j] then
       distanceMatrix[i][j]=distance
      end
     end
    end
   end
  end
  distanceMatrix
 end

 def showDistanceMatrix
  puts "weight"
  @weightMatrix.each{|c| print c.to_s.gsub(@oo.to_s,"oo"),"\n"}
  puts "distance"
  getDistanceMatrix.each{|c| print c.to_s.gsub(@oo.to_s,"oo"),"\n"}
 end

end
  
#tests:
if __FILE__ == $0
 floyd=Floyd.new
 oo=floyd.oo
 floyd.weightMatrix=[[0,oo,3,oo],[2,0,oo,oo],[oo,7,0,1],[6,oo,oo,0]]
 floyd.showDistanceMatrix
 dim=10 #randomize out a big graph
 weightMatrix=Array.new(dim){Array.new(dim)}
 for i in 0...dim do
  for j in 0...dim do
   if i==j then
    weightMatrix[i][j]=0
   elsif rand(dim/2)==0 then
    weightMatrix[i][j]=rand(9)+1
   else
    weightMatrix[i][j]=oo
   end
  end
 end
 floyd.weightMatrix=weightMatrix
 floyd.showDistanceMatrix
end

2009-04-19

Warshall's Algorithm in Ruby

Among my current projects is to learn Ruby. Although my background is almost entirely in statically typed languages I feel that to be a complete programmer I also need to be comfortable in a dynamically typed language. Ruby is definitely interesting. I'm not sure I'm sold on the value of dynamic typing. It seems to deny one the opportunity of letting the language itself help with the programming task, not to mention being much less amenable to static analysis and the such. Still, one might make the case that most of the problems static typing are designed to prevent are one's good programmers shouldn't make very often anyway and that one can simply code faster in statically typed languages since one doesn't need to worry about such minor issues as... knowing the type at design time. Ruby definitely allows one to be very terse with the syntax, do complex things simply and hence program fast. And we all want to program fast, I'm sure.

Anyway this post wasn't really meant to be a discussion of the relative merits of static vs. dynamic typing, or even on why Ruby is cool (although my example below does illustrate the terseness of Ruby relative to Java). It was meant to present my first program in Ruby, a translation of my Java Warshall's algorithm implementation. Actually it's not a direct translation since I added a small optimization. I realized that one doesn't need to bother with the j loop if (i,k)=0. If you can't get from i to k at all you definitely can't get form i to j through k. This needs to be added to the Java implementation, which I'll get around to eventually. Note that this improvement changes the algorithm's best case (a completely unconnected graph) time complexity to Θ(n)=n². Average and worst cases remain Θ(n)=n³.

I don't have a test case for this since I haven't yet learned how to do test cases in Ruby. OK, I'm sure I could rig one up myself, but I don't see the point. Instead I just have a script at the bottom that displays the path matrix for various adjacency matrix. I also added a little random adjacency matrix generator so that one can see more examples of the algorithm at work. One of the interesting parts of that was to choose reasonable edge densities to get a nice number of paths (using n/2 as the randomizer's cutoff seems to work well).

So, here's the algorithm:


class Warshall
 def initialize(adjacencyMatrix)
  @adjacencyMatrix=adjacencyMatrix
 end
 
 def getPathMatrix
  numNodes=@adjacencyMatrix[0].length
  pathMatrix=Array.new(@adjacencyMatrix)
  for k in 0...numNodes
   for i in 0...numNodes
    #Optimization: if no path from i to k, no need to test k to j
    if pathMatrix[i][k]==1 then 
     for j in 0...numNodes
      if pathMatrix[k][j]==1 then
       pathMatrix[i][j]=1
      end
     end
    end
   end
  end
  return pathMatrix
 end

 def showPathMatrix
  puts "adjacency"
  @adjacencyMatrix.each{|c| print c,"\n"}
  puts "path"
  getPathMatrix.each{|c| print c,"\n"}
 end 

end
#tests:
if __FILE__ == $0
 adjMatrix = [[0,1],[1,0]]
 Warshall.new(adjMatrix).showPathMatrix
 adjMatrix = [[0,1,0],[0,0,1],[0,0,0]]
 Warshall.new(adjMatrix).showPathMatrix
 adjMatrix = [[0,0,0,1],[1,0,1,1],[1,0,0,1],[1,0,1,0]]
 Warshall.new(adjMatrix).showPathMatrix
 dim=10 #randomize out a big graph
 adjMatrix=Array.new(dim){Array.new(dim)}
 for i in 0...dim do
  for j in 0...dim do
   if rand(dim/2)==0 then
    adjMatrix[i][j]=1
   else
    adjMatrix[i][j]=0
   end
  end
 end
 Warshall.new(adjMatrix).showPathMatrix
end #tests

2009-04-10

Warshall's Algorithm in Java

For fun (I have a nerdy definition of fun, I admit) I decided the other day, hanging out at Peet's Coffee, to bang out Warshall's algorithm in Java. Ah, the life of an unemployed programmer.

Before I present it, I suppose I should explain a little bit about what Warshall's algorithm does, in case there some readers who have never heard of it (like me less that a year ago). I'm not going to go deep, it being a well established and easily Googleable subject. For instance, this page does a better job than I'm likely to do in any case.

Here it is in a nutshell. Given a directed graph find out which nodes have paths between them. This is also known as the transitive closure of the graph for those who prefer the more impressive sounding mathematical terminology. For Warshall's algorithm one starts with the adjacency matrix of the graph, which is a matrix in which all nodes with direct connections (edges) between them are identified with a 1, and all other pairs of nodes with a 0. Warshall's algorithm generates from this data a new matrix with a 1 for all nodes that are connected by a path, 0 for all others. By "path" I mean that by traveling from node to adjacent node (those with an edge between them) one can arrive at the other node in a pair of nodes. Since this is a directed graph one must follow the direction of the edge in question. Think of it as a game of catch where each player only throws to certain other players. The adjacency matrix would identify which players throw to specific other players; a 1 row 1 column 2 if player 1 throws to player 2, a 0 otherwise. It's quite possible, based on this scenario, that some players never get the ball thrown to them. A fair PE teacher could implement Warshall's algorithm to identify that situation. If only more PE teachers understood algorithms the world would be a more equitable place. (One interesting side note, row 1 column 1 will always be 0 in the adjacency matrix since a player never throws the ball to himself, but it could be 1 in the path matrix since hopefully the player can get the ball back... or maybe not!)

The way Warshall's algorithm works is like this. Enumerating through each node numbered by k, starting with k=0 (i.e. node #0), decide which node pairs i,j have paths through node k. This can be determined by evaluating wither or not a path from i to k and from k to j already exists on the path matrix, which was initialized as the adjacency matrix. If it does, then the path from i to j can be added to the path matrix. Our paths will thus grow organically as we iterate through k. When we done we will have every path identified in our matrix since every path with every intermediate node will have been identified.

Anyway, now that I've probably confused everybody (Google "directed graph" to become un-befuddled) here's the Java code. I've also included a test case class which I think helps to understand it. So, without further ado:

package com.blogspot.ezetter.algorithms;

public class Warshall {
 public byte[][] getPathMatrix(byte [][] adjacencyMatrix) { 
  int numberOfNodes=adjacencyMatrix[0].length;
  byte [][] pathMatrix = new byte[numberOfNodes][numberOfNodes];
  System.arraycopy(adjacencyMatrix,0,pathMatrix,0,numberOfNodes);
  for(int k=0;k<numberOfNodes; k++) {
   for(int i=0;i<numberOfNodes; i++) {
    if (pathMatrix[i][k]==0){
     continue;
    }
    for(int j=0;j<numberOfNodes; j++) {
     if(pathMatrix[i][k]==1 &&
      pathMatrix[k][j]==1) {
       pathMatrix[i][j]=1;
     }
    }
   }
  }
  return pathMatrix;
 }
}

And the test case:

package com.blogspot.ezetter.algorithms;
import junit.framework.TestCase;

public class WarshallTests extends TestCase {
 private byte[][] adjacencyMatrix = 
  {{0,0,0,1},{1,0,1,1},{1,0,0,1},{1,0,1,0}};
 private byte[][] expectedPathMatrix = 
  {{1,0,1,1},{1,0,1,1},{1,0,1,1},{1,0,1,1}};
 private byte[][] adjacencyMatrix2 = 
  {{0,1,0},{0,0,1},{0,0,0}};
 private byte[][] expectedPathMatrix2 = 
  {{0,1,1},{0,0,1},{0,0,0}};
 
 public void testGetPathMatrix() {
  byte[][] pathMatrix = new Warshall().getPathMatrix(adjacencyMatrix);
  for(int i=0;i<adjacencyMatrix.length;i++) {
   for(int j=0;j<adjacencyMatrix.length;j++) {
    assertEquals("Path matrix at i=" + i + ", j=" + j + ":",
     expectedPathMatrix[i][j],pathMatrix[i][j]);
   }
  }   
  pathMatrix = new Warshall().getPathMatrix(adjacencyMatrix2);
  for(int i=0;i<adjacencyMatrix2.length;i++) {
   for(int j=0;j<adjacencyMatrix2.length;j++) {
    assertEquals("Path matrix at i=" + i + ", j=" + j + ":",
     expectedPathMatrix2[i][j],pathMatrix[i][j]);
   }
  }   
 }

2009-04-05

Halting "Problem" Redux

A while back I posted as an April fools joke: a Java version of the Turing Halting Problem. I recently learned that for Turing halting wasn't a "problem" at all, quite the opposite. The term was rather added by a later commentator (who's name I will post as soon as I look it up) who wanted to express the theorem in modern day computing terms, where infinite loops are generally considered bad things. For Turing, who's purpose was to determine the limits of decidability, that was no problem and in fact programs that didn't halt were considered "good" programs. If an algorithm could keep indefinitely chugging away at a problem and could give an answer, assuming infinite time, then said problem is "decidable". Computing pi, the square root of 2, 1/3... those are all decidable problems. The reason we can't get an exact answer is simply not having enough time (i.e. not having infinite time) to do so. It's not any fundamental problem with the calculation. The halting problem, on the other hand, is inherently "undecidable" since it embodies a paradox.

When Turing first posited the Universal Machine it was in order to construct an example of an undecidable problem; a Universal Machine can't determine whether or not it will halt. His purpose at the time was not to invent the computer. That use of the formulation was applied later (just a little) by many commentators, including Turing himself. So the invention of the Universal Turing Machine, of which all computers and programming languages are examples, was like so many other great discoveries an accident.

An accident that changed everything.

_{*Actually Turing's presentation was even more esoteric than that. He postulated that every machine could be designated a number that described it (not so far fetched since real computer programs are stored as binary numbers: computer programs are ultimately just numbers). So Turing was proving that no number can exist that determines if that number is itself "satisfactory", i.e. whether or not it can be computed. Remember, this was a mathematical, not a practical, problem. As fate has it, math problems have a tendency to become practical.}

2009-04-03

Proof of Euclid's GCD Algorithm

In an advanced algorithms course I recently took the first algorithm covered, as I imagine is the case in many such courses, was Euclid's greatest common divisor method. Euclid's gcd is often considered the earliest known algorithm. Does that make Euclid the first programmer? (Note: Euclid's method, invented over 2000 years ago, is still very efficient even by today's standards and is enormously more powerful then the method you probably learned in middle school.) Euclid's gcd is based on the following observation:

gcd(m,n) = gcd(n, m mod n)

This equality is repetitively applied until a final condition, gcd(m,0), is reached for which case m is taken to be the gcd of m, n.

Now, although we weren't asked to prove this for my course, at the time I was inspired to attempt it. After lengthy pondering I wasn't able and eventually gave up disappointed. Recently I revisited the problem and was able to see the solution pretty quickly. I guess my math skills are improving (or so I like to think). Below is my proof. You will have to take my word for it that I didn't Google it (I didn't, I swear.)

Starting with the final condition, which is pretty trivial, if m mod n = 0 then by definition m is divisible by n so n is the gcd. Next we need to show that we get to that final condition, i.e. that we don't get stuck at m mod n != 0. Again, this is pretty simple. m mod n < n necessarily. Since we keep feeding m mod n into the procedure as the new n, n gets smaller each iteration. Since m mod 1 = 0 for all m, we must reach 0 eventually.

The hard part for me before was solving the general equality, gcd(m,n) = gcd(n, m mod n). I now find this pretty simple too and am even a bit embarrassed that I didn't come up with it before. Observe that if m and n are divisible by x then m-n is also divisible by x: (m-n)/x =(m/x)-(n/x). Likewise, m-an is also divisible by x for any a. Since the definition of m mod n is m-an for the largest a where m-an>0 it follows that any common divisor of m and n is also a common divisor of m mod n. So if x is the gcd for m and n, it is also the gcd of m mod n (since m mod n is necessarily smaller than both m and n).

QED

If any reader sees a flaw in my reasoning, please reply. I really want to be sure of this.

2009-04-02

My April Fools Program

I meant to post this yesterday but, oldest excuse in the book, got busy. Anyway, not wanting to waste the opportunity and not willing to wait a year I'm posting it now. It is an example using a new feature of Java reflection. Here goes:

package april.fools;

public class Thp {
 
 Thp(Object input) {
  while(Class.doesHalt(this.getClass(),input));
 }
 
 public static void main(String [] args) {
  try {
   new Thp(Class.forName("april.fools.Thp"));
  } catch (ClassNotFoundException e) {
   e.printStackTrace();
  }
 }
}

I hope you appreciate it.

The EZ Way