Saturday, September 12, 2009

Purpose of hashCode

Have you ever thought how hashing is implemented in Java. When we call put(key, value) for storage and get(key) for reterival, this get and put API's are from HashMap/HashTable Java classes. Dive one more level, these are classical data structure, where main advantage of using them is will get O(1) or O(constant) time complexity during storage and retreival. So how it works, mapping (indexing) your key to a bucket (storage area) which stores value. Now how this will be done in an Object oriented programing language like Java where your keys can itself be object. While the Java language does not provide direct support for associative arrays -- arrays that can take object as an index. :). The presence of the hashCode() method in the root Object class clearly anticipates the use of HashMap/HashTable. This supports hashing directly in the object model, facilitates the development and use of hash-based containers.

Consider default hashCode() which is provided by Object class in Java, this method uses the 32-bit internal JVM address of the object as hashCode. But hashCode methods returns integer, on JVM it is stored as hex. Now consider HashMap as array of arrays (or combination of arrays and linked list) . When we store a key/value pair for a class a specific subarray id is utilized. Those arrays are pointed by these addresses which are in hex form and when we access any element JVM uses the same hashCode or hex to refer where the values are stored. In hashMap, hashTable keys are used to compute hashCode.

We can conclude that all objects are bind to a memory location on heap, so JVM will create 32-bit hashCode for all objects and returns to us in the form interger memory location.
We will see more by diving. Currently I will put some thoughts around, which will help to understand more why we have to overrride hashCode from Object class.

1) Is hashCode a unique value for all Objects. If its a memory address ?
2) What happens to the hashCode, in case garbage collection shifts or swaps memroy addresses.

HashCode is not a unique identifier, as there can be two objects of same class which are equal and they should return same hashCode. This kind of restriciton is forced when we Override hashCode and equals to support efficient insertion and efficient retreival. We can also return a constant value from hashCode implementaiton. We will see this by an example further.

If the object is moved in memory during garbage collection, hashCode stays constant, this default hashCode is not very useful since to lookup an object in a hashMap we need the exact same key Object by which the key/value pair orignally filled. Normally, when we go to look up we don't have original key Object itself, we have just have same data for a key.

So if we have key as String or Wrapper classes we don't have to implement hashCode, because hashCode implementation for these classes returns thier exact value in hashCode (check thier implementation to get a hang on it), If lookup (retreival) passes equals checks it will hash to correct bucket, where these values are stored.

Define now the purpose of hashCode:-
This method is used to get the hash value(an integer) of an object. What is the purpose of this hashcode/number ? This number is used to place/retrieve the object in its corresponding bucket (Used in hashing algorithms used by HashTable & Hashmap datastructures) .

What happens if hashcode () returns different value for the same object on the same run. Object will be placed in the hashtable and it is not possible to find it. Because hashcode of the key is used to place the object in the bucket. If while placing hashcode is 1 it will be placed in bucket_1. If it returns 2 while retrieval search for an object will be in bucket_2 instead of bucket_1 where it is actually placed. And hence search will result in empty(or null).

But this is not a problem between runs. First time object's hashcode would be 10 in the first run and 67 in the second run. This difference doesn't affect. But during same run the value should be always same.

No comments:

Post a Comment