Algorithmic Analysis of an IP Address to ASN Map Function for Hadoop
Posted on Wednesday, January 19, 2011 by Adam Pridgen
This blog post is a departure from the traditional security theme, and it ventures into a little bit of applied algorithm analysis for efficiency gains. Given the lessons we learned from it, we felt it could be a benefit to others and motivate why algorithm analysis should be performed from time to time. Our lesson stems from some work I am doing with Hadoop over the weekend. I ran into an interesting problem mapping IP addresses to Autonomous System Numbers (ASN). The problem I ran into entailed searching through 170K+ ASN records to correctly associate an IP Address from my dataset to the right ASN. The problem was poor performance on each search, but after some basic algorithmic analysis, I reduced the search space for the ASN down to 32 iterations per IP address. This blog post will show some of my code, discuss some basic algorithmic analysis, and explain how I got the improved performance.
First lets show the original implementation in Java pseudo-code:
// ASNInfo class
class ASNInfo {
private Integer asn;
private SubnetInfo subnetInfo;
public ASNInfo(String cidr, String asn) {
subnetInfo = new SubnetUtils(cidr).getInfo();
this.asn = new Integer(asn);
}
Boolean inAsn(String ipaddress) {
return subnetInfo.isInRange(ipaddress);
}
Integer getAsn() {
return asn;
}
}
// 1. Bad data structure choice
// Declared somewhere in my code as a global variable
ArrayList ASN_INFOS = new ArrayList();
// function in question
public static Integer getASN(String ipAddr) {
Iterator asnIter = ASN_INFOS.iterator();
Integer result = new Integer(-1);
// 2. everytime this function is called
// we iterate over every value in the array list
// until we find the correct value
while (asnIter.hasNext()) {
ASNInfo x = asnIter.next();
// 3. Each iteration makes a call.
if (x.inAsn(ipAddr)) {
result = x.getAsn();
break;
}
}
return result;
}
The ASNInfo object is used to map a host IP address to an ASN. While running this, the Map-Reduce process was exceptionally slow. After some rudimentary algorithmic analysis, the effect of the problem should become very apparent. The primary cause of this problem is the choice of a container class. A container class is basically a programming construct that is used to hold references to instantiated objects, and in this case, the container class is an ArrayList. The reason for the poor design choice falls squarely on the intent to get a very basic implementation up and running, no matter how naive. ArrayList‘s are a poor choice because there is no efficient way to look-up elements in these objects without iterating over the entire list. Since there is no efficient search built into the ArrayList, a bad run-time performance arises because (2) we are iterating over thousands of records on each call to getASN. In addition to this iteration, (3) we are making a subsequent call to see if the IP address is in the ASN’s subnet.
Now lets talk about Big-Oh[1] (or run-time analysis) and quantify how egregious the problem becomes. The above function will be called roughly 5 million times for each input file, and we will have roughly 170K ASNInfo objects initialized. If we only had 1 record in the ASN_INFOS ArrayList, we would run in the above loop at least O(5*10^6). One record would be our best case, and we could only achieve this by either having 1 record or finding the right match on the first try every time. This expectation, in this case is not reasonable. Now, realistically, we have around 1.7*10^5 elements in our ASN list, which means we will get 1.7*10^5 elements in the ArrayList. In the worst case, we will need to search this list 1.7*10^5 times, because the worst case means our element is at the very end of the ArrayList. Given our current input file, this places our run-time analysis at O(5*1.7*10^11). 10^11 is an incredible number, and this huge number is bad!
To overcome this issue, lets take a step back and reevaluate the criteria and requirements for the problem. First, given the number of records, a more efficient method of searching or looking-up ASNs needs to be identified. The set back in this regard is that there is no way to directly map an IP address onto an ASN. Reevaluating the problem, the objective is to map a network to an IP address, so why don’t we apply a systematic mask to the IP address and attempt to map that result to an ASN. To do this, we simply mask the int/long representation of an IP address value with a sliding mask until we arrive at a valid network address. As an example, think about how routes are chosen for IPs. Approaching the problem in this manner will produce a more efficient search routine, and there is a hard upper bound on the number of searches performed to get a result, which turns out to be 24 iterations in the worst case. After looking over the ASN file, I would expect the average case to be less than 8-13 iterations, because a majority of the networks in my ASN file fall between the /24 and /19 network addresses. Now let’s revise the code a little.
// 0. Declared somewhere in my code as a global variable
HashMap ASN_INFOS = new HashMap();
// The improved function question
public static Integer getASN(String ipAddr) {
// 1. Convert the IP address to an int/long for easy manipulation
long ipInt = ipToInt(ipAddr);
long mask = 0x0FFFFFFFF,
// start with the Serial network addresses
smask = 0x0FFFFFFFE;
// 2. If the sliding mask or ipInt is 0, we hit the fixed point
while(ipInt != 0 && smask != 0x0FE000000 ){
// 3. Bitwise AND the sliding mask to the IP to get a
// potential subnet
ipInt &= smask;
// 4. if the masked int/long is in the ASN_INFOS HashMap
// then we found the potential
if (ASN_INFOS.containsKey(new Integer((int)ipInt))){
return new Integer(
// grr. blog software
ASN_INFOS.get(new Integer((int)ipInt)).getAsn()
);
}
// 5. shift the mask
smask = (smask << 1) & mask;
}
return new Integer(-1);
}
Above is the code resulting from a more proper approach. Starting at (0), the primary container is changed to a HashMap, where the keys are the Integer representation of the network address, and the items are ASNInfo instances, as shown earlier. A HashMap object is a hash table implementation, mapping a key (e.g. an int) to a specific value (e.g. ASNInfo). Keys must be unique in a hash table, but the values in the table can be arbitrary. Now, (1) whenever the getASN method is called, String representation of the IP address is converted into an int (ipInt), and this will be the basis for my key on a look-up. Another variable worth mentioning is the sliding mask (smask). This mask value is adjusted each iteration with a bit shift left, and the mask will be used to create a probable network address. At (2), there is a check to see if ipInt is zero or smask is 0x0FE000000. The minimum network address is /8 or 0x0FF000000 which is a class A. This condition indicates the search was exhausted without finding a valid result (e.g. 10.0.0.0 should fail), meaning there is no corresponding mapping of a network address for the given IP address. After that check, (3) the IP address and smask are bit wise AND’ed, which will produce a new possible network address. Next, (4) ASN_INFOS is checked to see if it contains the ipInt value. If not, (5) the sliding-bitmask is left-shifted by one bit and the left most bit is truncated off the bit string.
After its all said and done, the loop only needs execute 23 times and the ASN look-up is O(2.3*5*10^7) versus O(5*1.7*10^11) for worst case, which is an optimization by 4 orders of magnitude. Of course, other optimizations can be made in other places and the numbers applied here are conservative estimates. The only other major caveat is that the IP addresses used by the ASN records follow CIDR notation. I am not sure how this will hold up against IPv6 addresses, but I will burn that bridge when I cross it.
Citations:
1. F. Swartz. “Algorithms: Big-Oh Notation.” [Online]. Available HTTP: http://leepoint.net/notes-java/algorithms/big-oh/bigoh.html.

