Hugo Kornelis (3/4/2014)
For SELECT DISTINCT, allowing ORDER BY to operate on columns not in the SELECT list would produce erratic and unpredictable behaviour. Consider this table:
Col1 | Col2
1 | 1
2 | 2
3 | 1
Now what do you expect to get returned if you allow "SELECT DISTINCT Col2 FROM TheTableAbove ORDER BY Col1;"?
Maybe I do note get something, Hugo.
IMO no matter how you slice it -- meaning ORDER or other clauses --, Col2 still has only two distinct values, right?
Tes, but which comes first with order by Col1? The Col1 values where Col2 is 1 are 3 and 1, and the Col1 value where Col2 is 2 is 2; so does 1<2 rule or does 2<3 rule? Yes, one could define the effect - several different ways, each of which would probably be unacceptable to the majority of people, so it was best to say "you can't do it if you specify distinct". Without distinct it's easy - you return as many rows as there are in the table, ordered as specified. With distinct, you can't return all the rows because there are fewer distinct Col2 values than there are rows.
Hmm... I am not a computer scientist and even less a programming language designer; however, I see that there could be a new rule introduced that would say that if you are selecting distinct values based on a column which is not in the select list, you present the first value encountered in the (implicit) full SELECT.
Unfortunately, our rather small example does not make my point obvious, but please consider -
Col1 | Col2
1 | 2
2 | 3
3 | 1
4 | 2
That would give you 2,3,1 right?
I could live with this.
Edit: making the sequence more obvious, IMO
Yes, that's one of the many approaches that would work, and in my view it's one of the better ones because it doesn't require any colision resolution mechanism. But would it be accepted?
Some people would suggest computing the average position in the full list for each value, and ordering by that (perhaps falling back on your method to resolve collisions). Others might suggest looking at whether the number of value A - value B pairs in the full list is where the value A element is before the value B element is greater than or less than the number of pairs where the opposite is the case (again resolving collisions somehow). These are just two of the methods of getting a good representation of the "average" relative position. You can complicate them by weighting elements or pairs of elements depending on how distant from the middle of the unpruned list they are, or when working with pairs by how far apart the two components are in the unpruned list.
I remember a discussion about this particular ordering problem - not in the context of databases, something completely different, way back in the 60s, where the concensus ended up being that the best procedure would be to use a random (ideally not pseudo-random, but in practice true random wasn't really going to be possible) selection of which item with each given value to retain (that would be done by looking at the items in true random order and retaining the part specified for the distinct value and the part designated for ordering for the first item with each distinct value for the specified part, sort the retained sub-items including the (unwanted) ordering part, and then throw away the unwanted part. Like your suggested method, that doesn't require any collision resolution; of course it might be difficult to live with in a relational database context, as it's certainly not deterministic.