Posted: Fri Jan 07, 2005 4:15 am Post subject: remove dup but keep mapping
I want remove duplicate reocord in a data set, but i want to keep the mapping of removed record. 1 input => 2 outputs, one is from input without duplication and seq number, the other is seqnum of original file and seq num in new file.
for example:
Joined: 07 Jan 2003 Posts: 1056 Topics: 91 Location: The Blue Planet
Posted: Fri Jan 07, 2005 6:56 am Post subject:
Kolusu,
The links you posted actually prints the no. of occurences of each duplicate key but Videlord is not looking for that. He wants to insert seqnum in the file and then eliminate the duplicates on the key portion and write to a file - say SORTOUT. Later, in the other file SORTXSUM which contains the eliminated records he wants to map each duplicate key to the seqnum of its first occurance (ie. the seqnum in SORTOUT).
The output2 is the file with the same number of records as input.
The first field is the seq# of input, the 2nd field the seq# in output1.
We can use table to describe my problem again:
input - TableA (COL1, SEQNO)
output1 - TableB (COL1, SEQNO)
output2 - TableC (SEQNO1,SEQNO2)
SEQNO in TableA and TableB are generated by database
TableB Data:
Insert into tableB (COL1)
(select distinct COL1 from TableA)
TableC Data:
Insert into tableC
(select TableA.SEQNO, TableB.SEQNO from TableA, TableB
where TableA.COL1 = TableB.COL2)
I hope it described my problem clearly.
I have one idea,
Step 1. copy input1 add seqno.
2. sort input1 add seqno keepnodups
3. Splice 2 & 1 generate the mapping
It seems some complex and not efficient.
Did you have any suggestion?
Joined: 07 Jan 2003 Posts: 1056 Topics: 91 Location: The Blue Planet
Posted: Sat Jan 08, 2005 5:58 am Post subject:
videlord,
I think you already gave the solution. I don't think this could be achieved using the traditional features of sort without using SPLICE. As you said, you can get the result in 1 Step - 2 Pass using ICETOOL / SYNCTOOL.
Code:
Pass 1:
a. Use inrec and insert Seqnums
b. Sort on key and remove dups using SUM FIELDS=NONE & XSUM or DISCARD(dd)
Pass 2:
a. Use SPLICE to match the key & seqnums.
Kolusu / Frank,
your thoughts on this ? Do you feel that this could be done in a better way ?
Joined: 02 Dec 2002 Posts: 1618 Topics: 31 Location: San Jose
Posted: Sat Jan 08, 2005 11:32 am Post subject:
Quote:
The output2 is the file with the same number of records as input.
The first field is the seq# of input, the 2nd field the seq# in output1.
Charlie,
This doesn't seem to match what you actually show for output2.
Breaking it down:
>The first field is the seq# of input
Code:
Input
AAA 01
BBB 02
AAA 03
CCC 04
BBB 05
>output2 is the file with the same number of records as input.
>the 2nd field the seq# in output1.
Code:
OUTPUT1:
AAA 01
BBB 02
CCC 03
So I would expect the following for OUTPUT2:
Code:
01 01 01 from first AAA in input, 01 from AAA in output1
02 02 02 from first BBB in input, 02 from BBB in output1
03 01 03 from second AAA in input, 01 from AAA in output1
04 03 04 from CCC in input, 03 from CCC in output1
05 02 05 from second BBB in input, 02 from BBB in output1
What am I misunderstanding here? Can you please break down the input and output in the same way I did here to show how you want it done. _________________ Frank Yaeger - DFSORT Development Team (IBM)
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
DFSORT is on the Web at:
www.ibm.com/storage/dfsort
Joined: 02 Dec 2002 Posts: 1618 Topics: 31 Location: San Jose
Posted: Mon Jan 10, 2005 11:11 am Post subject:
Quote:
Oh, sorry, i think i make a mistake when i type the INPUT file, the 4th record of input should be AAA instead of CCC.
Charlie,
That still doesn't make sense. Please start over and show the CORRECT input, output1 and output2 broken down in the same way I did it, to show how you want it done. Remember: garbage in, garbage out. _________________ Frank Yaeger - DFSORT Development Team (IBM)
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
DFSORT is on the Web at:
www.ibm.com/storage/dfsort
Joined: 02 Dec 2002 Posts: 1618 Topics: 31 Location: San Jose
Posted: Tue Jan 11, 2005 11:07 am Post subject:
Ok, now it makes sense. But it will require multiple passes. I don't see any easier way to do it. _________________ Frank Yaeger - DFSORT Development Team (IBM)
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
DFSORT is on the Web at:
www.ibm.com/storage/dfsort
Last edited by Frank Yaeger on Tue Jan 11, 2005 4:47 pm; edited 1 time in total
Joined: 07 Jan 2003 Posts: 1056 Topics: 91 Location: The Blue Planet
Posted: Wed Jan 12, 2005 1:07 am Post subject:
Kolusu,
Excellent Solution. Gr8 job.
Code:
The only difference here is that we do not have the Header. So we create a header for every record using sections and header3 parm.
I was thinking of the same - using headers, but could not guess a way to insert a header ! This logic could be used in many cases. I have a similar situation - matching a unique key against a file with dup keys. This logic will come in handy in those cases.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum