Ever wonder what a subset is, how it is different than a list, or why they are important and how we can teach computers to look for subsets quickly?
Sub what?
Subsets are very easy to understand, and we all use them every day, even if you don’t know that you do. However, before we can talk about what a subset is, it is important to know what a normal set and a list are first.
A set is just a collection of things that has no intended order. For instance, a recipe has a set of ingredients that you need, your child may have a set of action figures, or you may have just purchased a dinnerware set on Amazon for your new kitchen. All these things have 2 things in common. One, they are made up of stuff (ingredients, action figures, plates, bowls and cups) and two, they don’t have any sort of order. You probably don’t care which plate your grab when you want to eat the last piece of apple pie.
A list is just like a set in that it is a collection of things. However, a list does have order. For instance, a recipe has a list of instructions, you may have an appointment to-do list on Saturday, or you may have a list of your favorite movies. These things are not sets because the order in which they come matters. You can’t start cooking your food until you’ve mixed in all the ingredients or go to your 3:00 PM appointment before your 11:00 AM appointment.
A subset is just a set that is a part of another set: a set that has the same things in it as another set. For instance, if have a set of students in a class such as “Fred”, “Craig” and “Anastasia”, then the set of students that have the letter “A” in their name (“Craig” and “Anastasia”) would be a subset of the whole class.
Why Do I Care?
Sets and subsets are important to us and computers in many ways. Almost every computer program or website you use has sets in it in some shape or form. A set could be the choices you have in reacting to a post on Facebook to the current YouTube channel subscriptions you are signed up for. Subsets are often filtered down or searched for results, and we need them to help find what we are looking for. When searching for socks on Amazon, we probably don’t want to have to look through all the pants as well.
However, getting a subset from a set is a lot easier than checking to see if a set is a subset of another set. For instance, getting a list of You Tube videos based on your subscriptions is a lot easier than trying to find all the videos whose tags match what you are searching for.
Another example of subset checking in action is with a thesaurus. Let’s say you are trying to create your own thesaurus, but you have multiple people working on it at the same time. You want make sure that you don’t have duplicate entries in your thesaurus, so you need to check whenever one of your workers adds a new entry in the thesaurus whether that entry has already been added. You could check each thesaurus entry to see if it has all the words already in the new entry, but that would mean that you would need to check if every thesaurus entry contained every word from your new entry. Also, at a rudimentary level, you would need to see if each word was the same by checking to see if each letter was the same! That means if you have a thesaurus with 100,000 entries in it, and on average 5 words in each thesaurus entry and 7 letters in each word, it would take you 17,000,000 calculations just to add a new entry to your thesaurus (100,000 entries * 5 words in each entry * 5 words in new entry * 7 letters in each word).
A Bird’s Eye Solution
There is a way to check for subsets much faster than this. First, assign a number to each word in our thesaurus. For example, “cool” = 1, “cold” = 2, “chilly” = 3, “warm” = 4, “hot” = 5. In this set, I think that everyone can see that col, cold and chilly are synonyms and warm and hot are synonyms. After you give every word a number, get the product for each set. Cool * cold * chilly = 6 and warm * hot = 20. Then, check the products to see if one divides the other (e.g. 3 divides 6, but not 3 does not divide 7 as 7 / 3 has a remainder of 1). If they don’t divide each other, then they are not subsets. 6 does not divide 20 and 20 does not divide 6, so we know that one is not a subset of the other.
Computers are fast at multiplying and dividing, so by doing this, instead of having to do seventeen million calculations in the example above, we would only need to do one hundred thousand calculations!
The Technical Solution
For those of you who are programmers out there, you know that there are already ways to check for subsets. The most obvious which is outlined in the code below.
Click here to view code.
This works fairly well since we are using a Hash Set. The “contains” method is an O(1) order operation, but we still need to loop through each element at least once in the “set” variable making it an 0(N) operation. This can take some time if we have a list of 100,000 sets and we want to remove any duplicates (sets that are equal or subset). This becomes an O(N*M^2) operation.
Click here to view code.
Now, we can make this quite a bit better by sorting the list of sets first, and only checking against previously checked sets. This becomes an O(N*M*Log(M)) operation.
Click here to view code.
We can make this even better by creating a new Hash Set data structure specifically designed to check for subsets. It can perform the IsSubset check in O(1) time instead of O(N)! It does this by taking the product of the first 9 to 12 bits of the Hash Code of each element split across three long variables. Then, when checking if set A is a subset of set B, it checks to see if the products of A divide the products of B. If it doesn’t we know that A is not a subset of B. If it does divide, then we still need to check each element as we are only taking the first 9 to 12 bits of the Hash Code and there could be collisions. Since most subset checks result in false, this provides a substantial speed up for equality checks that take a bit of time like strings.
There are a few limitations to this implementation. Unsigned long variables are not infinite in size, so the worst-case scenario means that we can only have 21 elements in our set before the products overflow. If overflow happens, then the product can’t be used to check subsets. On the other hand, 9 to 12 bits of each element only holds so much entropy. At most we can only get 4096 possible combinations of elements. The more entropy you get for each element, the less number of elements can exist in a set before overflow and vice versa. When creating a Hash Set with this implementation, you may choose an average set size. If possible, choose the smallest average size as it will provide more entropy and less collisions.
Click here to view code.
_______________
Joseph Van Den Berg, Interstates Application Developer