Identity and equivalence — 2007-10-27

Comparison is an extremely thorny subject, and one that is very hard to get right or even to understand. What is even worse is that even when we understand what is happening in one aspect of your system, we can become so blinded by details that we fail to see the same problem elsewhere.

Reg is having trouble working out how to compare things in Java. Robert Fischer admits to the same. I'm not going to try to tell them how object comparison in Java should work, I'll just try to frame the issue in such a way as to make it clear to the rest of us what the real problem is and hopefully make clear which questions are the important ones.

The short (and far too glib) explanation for the root of the problem is that we get confused about whether we should be using object semantics or value semantics when we compare things. The hard part is knowing which to use in which circumstance, and even then we might only get a partial answer.

Object semantics

Objects have something called identity. The simplest way to think about identity is that two objects, otherwise exactly the same, are distinct if they cannot occupy the same location at the same time

In computing terms, this means that they have distinct locations in memory. If two objects are in different locations they are different, if they are in the same location they are the same¹ [1This is also the basis on which a legal alibi works. Two people at different locations at the same time are clearly not the same person. The law uses object semantics when allowing an alibi.].

The only thing that we need to look at to perform an object comparison are the current locations of the objects.

Value semantics

Values are compared through an equivalence relationship. That is we define a function that inspects two things and declares them either to be the same or not.

It turns out we can define lots of equivalence relationships that we might use (for example, case sensitive or case insensitive character comparison). If we are comparing values the hard part is often determining what the right equivalence relationship is.

Objects in a value world

In value based systems like Haskell or Relational Databases we can have object semantics by internalising the location meta-data that gives us the identity or choosing a unique key to represent the object. This is a very complicated way of saying we add an identity column or some other primary key. We then have one of our equivalence relationships compare only that attribute.

The same things happens in real life. As a Norwegian I have a person nummer (personal number) which is the identity column Norway adds to its subjects. In the United States citizens get a social security number.

Of course what happens then is that we confuse the identity column with the thing identified and we not only devalue ourselves² [2Something that was already a cliché in 1967 CE when The Prisoner was made — “I am not a number! I am a free man!” — and is certainly no less of a problem now.] but also open ourselves up to all sorts of new crimes like identify theft. It isn't just software developers who struggle with this!

Values in an object world

This is much harder to do, and in truth, most OO languages fail to add any clarity to the issue. The problem is that most object oriented systems have everything as an object, or only allow a very small set of built-in value types. This means that many of the objects in an OO system are really values, and this obviously leads to a huge amount of confusion.

Because most OO languages have a very poor concept of value types they tend to tie themselves into knots when attempting comparison. This is why so much gets written about it when discussing Smalltalk, Ruby, Java etc. Equality tests like Java's .equals() defaults to object semantics and is defined in terms of value semantics and gets its hands dirty by worrying about the value representation. Ouch!

As Reg has found, the answer is generally to delegate the equivalence relationship to another class, the implementation of which is more or less difficult depending on the features of the message dispatcher you have at your disposal and how you look at the problem.

Comparing apples and oranges

Value comparison is hard, and it gets even harder the more you look at it. Compare the following strings:

THE ALGEBRAIST
The Algebraist
Algebraist, The
ALGEBRAIST, THE

Equivalence depends on whether we think case is important or not and whether we see these as plain strings or book titles³ [3I'm not even sure I want to raise the subject of normal forms in Unicode.].

Here's a final parting thought, the numbers 273, 0 and 32 are not the same in any way, but something happens when we add units: 273K, 0°C and 32°F. This raises an important point about the difference between a value and its representation which rears its ugly head even where we think it should be simple. A proper discussion about that will have to wait until I finish a more detailed article on the subject.