Posted By: Terrapin
What is the best way to remove duplicate rows from a fairly large
SQL Server table (i.e. 300,000+ rows)?
The rows, of course, will not be perfect duplicates because of the existence of the
RowID identity field.
RowID int not null identity(1,1) primary key, Col1 varchar(20) not null, Col2 varchar(2048) not null, Col3 tinyint not null
Assuming no nulls, you
GROUP BY the unique columns, and
MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn’t have a row id:
DELETE FROM MyTable LEFT OUTER JOIN ( SELECT MIN(RowId) as RowId, Col1, Col2, Col3 FROM MyTable GROUP BY Col1, Col2, Col3 ) as KeepRows ON MyTable.RowId = KeepRows.RowId WHERE KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))