hashCode()
in a little bit more detail than I had before.Consider the following Java class.
public class Person {
private String firstName;
private String surname;
public Person(String firstName, String surname) {
this.firstName = firstName;
this.surname = surname;
}
public boolean equals(Object other) {
/* proper equals checking firstName and surname */
}
public int hashCode() {
int code = surname.hashCode();
code = 31 * code + firstName.hashCode();
return code;
}
public String getFirstName() { /* getter */ }
public void setFirstName(String firstName) { /* setter */ }
public String getSurname() { /* getter */ }
public void setSurname(String surname) { /* setter */ }
}
What is wrong with the above? Ignore that the hash code may not be well designed given that names are usually A-Z only and the prime factor may not be the most efficient algorithm for calculating hash codes for our data set.
OK, so I have cheated... the problem is the setters that I glossed over.
Our hash code is not calculated on the basis of immutable values.
Let's try using it with a
HashSet
...
Person joe = new Person("Joe", "Bloggs");
Setpeople = new HashSet ();
people.add(joe);
assert people.contains(joe); // this works
assert people.contains(new Person("Joe", "Bloggs")); // this works
joe.setSurname("Smith");
assert !people.contains(new Person("Joe", "Bloggs")); // this works
assert people.contains(new Person("Joe", "Smith")); // this fails!!!
assert people.contains(joe); // this fails!!!
boolean found = false;
for (Person person: people) {
if (person == joe) {
found = true;
break;
}
}
assert found; // this works!!!
So what is going on?
When we put
joe
into the HashSet
, the HashSet
selects a bucket based on the hash code for joe
. When we ask the HashSet
if it contains a Person
, it computes the hash code of the object that it's looking for and searches only in that bucket.As our
hashCode()
is based on non-final fields, the hash code can change behind the scenes of our HashSet
(and it won't find out until it has too few buckets and needs some more to optimise searching). Thus completely invalidating the basic assumption of the HashSet
.A correct implementation for the
Person
class would either have the names as final
, or have hashCode
like so
public int hashCode() {
return 0; // we have no non-final identity fields to calculate the hashCode from
}
Finding Number of Duplicates using Hashtable and HashSet.
ReplyDeleteList list = new ArrayList();
list.add("sanya");
list.add("sanya");
list.add("sanya");
list.add("ansh");
list.add("ansh");
list.add("ansh");
list.add("bhawna");
list.add("bhawna");
list.add("sanya");
list.add("sanya");
final HashSet hashSet = new HashSet(list.size());
Hashtable dupPersonnelHash = new Hashtable();
int initialCount = 1;
for (Iterator iterator = list.iterator(); iterator.hasNext();) {
String personHash = (String) iterator.next();
if(!hashSet.add(personHash)){
if ((!dupPersonnelHash.isEmpty()) && (dupPersonnelHash.containsKey(personHash))){
//duplicate value
Integer dupVal = (Integer)dupPersonnelHash.get(personHash);
//another duplicate, add 1 to the value
dupPersonnelHash.put(personHash, new Integer(dupVal.intValue() + 1));
} else //found a new duplicate
{
dupPersonnelHash.put(personHash, new Integer(initialCount + 1));
}
}
}
Iterator itr = dupPersonnelHash.keySet().iterator();
while (itr.hasNext()) {
String key = (String)itr.next();
Integer dupCount = (Integer) dupPersonnelHash.get(key);
System.out.println("#############################");
System.out.println("******** Person " + key + " dupCount " + dupCount);
}
I'm not sure what relevance this code snippit of yours has to the issue I was illustrating.
ReplyDeleteThe issue I was illustrating is that the hashcode must be calculated based only on immutable fields of the object.
Equals must be based on at least the fields that hashcode is based on plus possibly some mutable fields.
If you let a mutable fields slip into the hashcode calculation, "bad things" can happen
Excellent post!
ReplyDeleteI just had this problem with a Hibernate persisted Set.
Many thanks!
Watch out on using a static number for hashCode() as you have put all objects into the same bucket. Lookups now must traverse the list of items in the bucket - so you have essentially removed all benefits of using a HashSet and turned it into a simple List lookup.
ReplyDelete