Joined: 12 Jan 2004 Posts: 240 Topics: 48 Location: Maryland
Posted: Fri Jan 30, 2004 9:05 am Post subject: Delete Duplicates
Hi ,
Can anyone tell me what will be the SQL like to Delete alll but One Duplicate Records in a Table ? If not SQL , then is there any other method to achieve the same ?
If following is the Data in a Table for Col1, Col2, Col3
x, y, z
x ,y, z
x, y, z
a, b, c
a, b, c
l, m, n
After the Delete it should be
x, y, z
a, b, c
l, m, n
Joined: 26 Nov 2002 Posts: 12369 Topics: 75 Location: San Jose
Posted: Fri Jan 30, 2004 9:50 am Post subject:
Vini,
Deleting the duplicates retaining the first record is not that easy with sql. However there are many other ways to delete the rows.
1. Unload the table and run a utility(SORT) to remove duplicate records and reload it again
2. code a cobol pgm which will open cursor and perform the logic of determing the duplicate and delete the record.
3. Generate sql statements using easytrieve or any other utitlity to delete the duplicate records.
Joined: 26 Nov 2002 Posts: 12369 Topics: 75 Location: San Jose
Posted: Fri Jan 30, 2004 10:17 am Post subject:
Vini,
Try this sql
Code:
DELETE
FROM TABLE A
WHERE (A.COL1,A.COL2,A.COL3) IN (SELECT B.COL1,B.COL2,B.COL3
FROM TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.COL3 = B.COL3
GROUP BY B.COL1, B.COL2, B.COL3
HAVING COUNT(*) > 1);
Joined: 12 Jan 2004 Posts: 240 Topics: 48 Location: Maryland
Posted: Mon Feb 02, 2004 5:16 pm Post subject:
kolusu,
I wanted to say that I think "l,m,n" would be the result of the Query as suggested by you. The output I want is still as the first post.
However , I could not run that Query for real as I could not find a Table I could easily create some test data in without affecting the other developers testing or involving the dba. On second thoughts I think I can use the DEPT sample db2 table . Most probably noone should have problem with that.
That must work if you have run it at your end. Its just that when I pondered on it .. I could not figure out how that could get the desired result because
HAVING COUNT(*) > 1
should return all Rows which are Duplicates and the Delete would then eliminate all of these... that would leave behind only those rows which originally had no duplciates.
Your first query did give the correct result. But I need some explanation on how it works.
Code:
I created the VINI Table:
COL1 COL2 COL3
---- ---- ----
X Y Z
X Y Z
X Y Z
A B C
A B C
L M N
I ran the SELECT command to see which rows will be deleted prior to issue the DELETE command:
SELECT * FROM VINI A
WHERE (A.COL1,A.COL2,A.COL3) IN (SELECT B.COL1,B.COL2,B.COL3
FROM VINI B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.COL3 = B.COL3
GROUP BY B.COL1, B.COL2, B.COL3
HAVING COUNT(*) > 1);
I got these rows:
COL1 COL2 COL3
---- ---- ----
X Y Z
X Y Z
X Y Z
A B C
A B C
When I change SELECT to DELETE, I expected the above 5 rows be deleted.
DELETE FROM VINI A
WHERE (A.COL1,A.COL2,A.COL3) IN (SELECT B.COL1,B.COL2,B.COL3
FROM VINI B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.COL3 = B.COL3
GROUP BY B.COL1, B.COL2, B.COL3
HAVING COUNT(*) > 1);
3 rows (not 5 rows I expected) are deleted to give the correct result:
COL1 COL2 COL3
---- ---- ----
X Y Z
A B C
L M N
I have tested your query given in Post-3(You have mentioned here that the query will keep one row for each group of duplicates).But it is deleting all the duplicate rows.
can u guide us to retain one row for each group of duplicates.....
Sorry, I understood the question wrongly.
I tested the query by Kolusu and am getting the same result cp is getting. That is, it is deleting all the duplicate rows. I don't know what is wrong here.
Joined: 30 Sep 2004 Posts: 16 Topics: 4 Location: Chennai
Posted: Mon Dec 19, 2005 4:58 pm Post subject: Delete is a setlevel operation even with correlation....
Hi Kolusu,
The reason i think all the duplicate rows get deleted even in the case of a correlated query is that the delete operation performs at a set level.
Taking the same data example
x, y, z
x, y, z
x, y, z
a, b, c
a, b, c
l, m, n
The query you posted in the post 3
Code:
DELETE
FROM TABLE A
WHERE (A.COL1,A.COL2,A.COL3) IN (SELECT B.COL1,B.COL2,B.COL3
FROM TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.COL3 = B.COL3
GROUP BY B.COL1, B.COL2, B.COL3
HAVING COUNT(*) > 1);
looks like it should work fine but when it is being processed,
1) First of all db2 picks up the x,y,z record and the correlated subquery returns that the count is 3.
2) The count is > 1 so db2 decides to delete the rows with col1=x, col2=y, and col3=z.
3) As this delete happens, DB2 deletes at a set level all the rows with col1=x, col2=y and col3=z.
4) So the next record db2 would be pointing to would be a,b,c and no longer the second occurrence of x,y,z as it is already deleted.
5) As the row a,b,c is having more than 1 occurence, it deletes all the rows of col1=a, col2=b, and col3=c at a set level.
6) Finally the row left is l,m,n. this row has a count of 1 and as it is not >1, db2 is not deleting the row and finally we are ending up with only the row l,m,n.
Note : If you insert one more l,m,n to the table and retry the query, it will delete even all the occurrences of l,m,n row also.
Only in case of selects, correlation happens at record level ad DB2 virtually processes at row level (I mean DB2 executes the correlated subquery for every row from the main query).
DB2 allows to alter table and add column so you can try the below solution.
Code:
create table prob1(col1 char(1), col2 char(1), col3 char(1));
insert into prob1 values ( 'x', 'y', 'z');
insert into prob1 values ( 'x', 'y', 'z');
insert into prob1 values ( 'x', 'y', 'z');
insert into prob1 values ( 'a', 'b', 'c');
insert into prob1 values ( 'a', 'b', 'c');
insert into prob1 values ( 'l', 'm', 'n');
alter table prob1 add rcount integer;
insert into prob1
select col1, col2, col3, count(*) from prob1 group by col1,col2,col3
;
select * from prob1;
delete from prob1 where rcount is null;
After the execution, you will be having only one occurrence of every row but the problem is we can not drop a column in db2 this maddens me. so you will be left with the rcount field for ever in the table. unless you drop and recreate the table.
Technically, the best and the fastest solution is unload, sort with xsum and reload. _________________ I dont think I would ever stop learning. - Seshu.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum