Finding the few non-unique entries in an almost unique column

  • Hello All!

    I have a small table dealing with about 2 million student test scores:

    CREATE TABLE pupils (

    pupil INT IDENTITY (1,1) PRIMARY KEY CLUSTERED,

    firstName NVARCHAR(25) NOT NULL,

    lastName NVARCHAR(25) NOT NULL,

    pupilID CHAR(12) NOT NULL )

    Now, pupilID should be unique, but occasionally duplicate data is entered and I need to track that down. Currently I find them using:

    SELECT pupilID

    FROM pupils

    GROUP BY pupilID

    HAVING COUNT(*)>1

    ...which works, typically returning less than 100 rows, but taking nearly 10 seconds. Is there a better way? pupilID is indexed, and presumably buried in the bowels of the index it knows whether a given pupilID points to more than 1 row in the table, but I don't know any way to get at that information...

  • I can't think offhand of a better way. Can you post all the index definitions for the table please.

    You may want to consider making the index on PupilID unique. That way duplicate data can't get into the table in the first place.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for that - I probably should have made pupilID unique, but there are legitimate reasons why a school might try entering a duplicate pupilID (e.g. if a pupil moves between schools) and I decided at the time to have a small occasional clean-up job instead of dealing with it when the data was read in. Its a side-effect of inheriting an old flat-file system, then someone saying "you know, this really should be in a database. Here's a server - go work it out in your spare time" :w00t:.

    As it stands there are just 2 indexes on this table, the clustered PK on Pupil and a non-clustered, non-unique index on PupilID:

    CREATE NONCLUSTERED INDEX [IX_pupils_pupilID] ON [dbo].[pupils] ([pupilID] ASC)

    It is actually part of a larger pupils table including sex, date of birth, etc. and a second school-based ID that can be null, but the stripped-down version worked fine for working out how to do it.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply