Suppose you are given two sequences (or lists), l1 and l2, and now you would like to compare how much l1 is different from l2. How would you compare them, quantitatively?
There are many metrics to compare two sequences like l1 and l2, though the focuses are not always the same. Some compare the bag of elements contained by each sequence. Others may care more about the oscillation pattern within each sequence. There may even be someone simply compares the length difference of the two sequences.
We will, in the second half of this post, put our explanation in a scenario of natural language processing. However, comparing difference between discrete sequences (instead of vectors) could be a broader headache. You may have compared your Monday course schedule with your friends in another class, and noticed that they are different. Or you may have compared your recent phone call history with some other person’s, and noticed that they are, of course, different. Or you may also have compared sequence of book arrangement on the top level of your bookshelf with some other person’s book arrangement, and noticed that they are different. Many such sequences can be compared, and difference can be seen. But, how do you measure the difference?
Suppose you have had your books arranged in the manner of l1 on the top level of your bookshelf. From left to right, they look like below:
l1 = [A, 3, 8, Y, U, &, 5, 5, H]
Yes, you have only nine books, and among them there is a duplicated ‘5’. These nine books have fully occupied your narrow bookshelf’s top level, and each of them is with the same thickness. This means you could only put nine books on this level.
One day, your friend Lucy visited your study, and noticed your arrangement l1. Lucy was particularly attracted (… well somehow…) by l1. She went back her own home and planned to duplicate l1 in her study. Lucy has an infinitely wide bookshelf top level, but she only wanted to utilize nine slots, because she was to duplicate l1. She took a look at her top level bookshelf, there are some books already arranged there, in sequence l2. Though not necessarily different from l1, l2 looks like this:
l2 = [G, 3, U, Y, Y, H, T, #, 5, F, L, T, 0, K]
Lucy has 14 books on her top level.
Lucy wants to modify her l2 to make it an l1. What should she do?
There are some books {A, 8, &} missing, so she has to buy them beforehand. Particularly, she also needs one more ‘5’, as there are two occurrence of ‘5’ in l1.
Therefore, Lucy bought a bag of books {&:1, 8:1, 5:1, A:1} and leave the bag on her desk. Then she realized that besides collecting missing books, she needs to remove the excessive books {G, Y, T, #, F, L, T, 0, K} from her top level of shelf.
After the above preparation, Lucy started to rearrange the books.
How many steps would she need to rearrange l2 to l1? Or what could be the fastest, most straightforward way? The answer is to take all books off the shelf and put each one back on according to l1. However, by doing that, Lucy would need a desk large enough to accommodate all her 18 books (14 original one + 4 newly bought). She did not want to do that, as she just wanted to use a tiny space on her desk. (please, please don’t ask me why she does not use her ‘infinitely wide bookshelf’ :) This is because she decided to only utilize nine slots, as you had l1.)
Thus, Lucy’s rearranging strategy was to take off as few books as possible, and put on the correct ones back. Therefore, she may need a few more steps to reach the final arrangement same as l1.
Did she make it?
Yes, of course, and there are many solutions. Some take more steps, some take not as many. Some lead 10 books scattered on the desk, some may only require 9 to be there. But, none of the solutions will use fewer than 9 units of space on desk, as after all Lucy has only nine books on the shelf eventually.
We are coming to the end of Lucy’s story with her bookshelf.
Though she made it, and successfully changed l2 to something same as l1, we may want to find her a best solution, that uses least steps, and occupies least space on her desk. This optimal solution indicates the least cumbersome way to change l2 to l1. By looking for this optimal solution, we can say that we found a definition of ‘distance from l2 to l1.’ So it is with ‘distance from l1 to l2’ if we reverse the steps of this optimal solution.
Now, we found a way to measure the distance between two sequences. The pattern is that: the more steps used, the larger the distance would be; the more space used, the larger the distance would be. (However, how to integrate time and space is another issue, we may simply multiply one with another, or we can use sum of space at each time step.)
Hold on, how would we help Lucy to find an optimal solution? Is there a deterministic algorithm?
See you in the next section ><