Joined: 16 May 2007 Posts: 6 Topics: 2 Location: United Kingdom
Posted: Thu Jul 05, 2007 7:07 am Post subject: Parse in DFSORT
I have a number of requirements for parsing comma separated value(CSV) files sourced from other platforms on the mainframe, and vice versa.
The Smart DFSORT Tricks documents Deconstruct and reconstruct csv records, so I am now trying to evaluate if this is a good option as opposed to SELCOPY or PL/1. COBOL is not an option.
The latter approaches (SELCOPY, PL/1) would mean developing utilities (CSV->formatted, formatted-> CSV), such that they are supplied data structures (to support multi record files). Implementing a new transformation requirement would then mean simply supplying the data structure(s).
Using the DFSORT option would mean creating new DFSORT statements to parse the file in question, but without the once off utility development overhead.
I have tried some simple parsing, but have hit an issue that is not blindingly obvious from the DFSORT documentation.
The basic rules of the files are
Quotes will surround data values that contain commas or quotes.
Quotes in data values will be doubled up, but count as one.
The example I have tried looks like this, where the first two bytes are the record type
01,"THIS,IS RECORD ONE"
02,"THIS,IS RECORD TWO","WITH EXTRA DATA""ADDED"""
in an attempt to get this
01THIS,IS RECORD ONE
02THIS,IS RECORD TWO WITH EXTRA DATA"ADDED"
These statements (excuse the format, which looks fine in the post window)
Joined: 16 May 2007 Posts: 6 Topics: 2 Location: United Kingdom
Posted: Thu Jul 05, 2007 8:14 am Post subject:
Kolusu,
Pairs of quotes removed (", as opposed to apostrophe ') , to be replaced by one.
A double quote represents one quote of real data when they are delimited by a pair of quotes. Delimiting quotes are required when the value can have a comma in it. (which makes csv files easy to parse and explain !)
i.e.
from this
02,"THIS,IS RECORD TWO","WITH EXTRA DATA""ADDED"""
to this (as the quote is real data, not a form of delimeter)
02THIS,IS RECORD TWO WITH EXTRA DATA"ADDED"
not this
02THIS,IS RECORD TWO WITH EXTRA DATA""ADDED"" _________________ Gordon Taylor
Joined: 20 Oct 2006 Posts: 1411 Topics: 26 Location: germany
Posted: Thu Jul 05, 2007 9:52 am Post subject:
gordon,
ok, real problem is that the input files are not going to change. thus my suggestion was dumb, even if it would work.
how about parsing on both INREC and OUTREC?
your real problem is that you are parsing for commas outside double-quote pairs and then are parsing double-double-quotes to become single-double-quotes.
It should be possible without having to identify the record type.
No, I have no idea how to do it with DFSORT. Is REXX not an allowable solution? _________________ Dick Brenholtz
American living in Varel, Germany
Joined: 02 Dec 2002 Posts: 1618 Topics: 31 Location: San Jose
Posted: Thu Jul 05, 2007 10:30 am Post subject:
Gordon,
The two example lines you gave is not much to go on in discerning all of the variations you need to handle or the "rules" to follow.
I assume that the record contains one or two strings enclosed in quotes. Can there be more than two strings in a record?
Does the 01 mean there's only one string?
Does the 02 mean there's two strings?
Can there be an 03, etc?
It appears that string2 can have a value enclosed in quotes within it - can it have more than one value enclosed in quotes? Can string1 have one or more values enclosed in quotes?
It would help if you would give an example with more of the possible variations and explain the rules for what can be in these records.
Otherwise anything we come up with might not work for all of your variations and so would be a waste of time. _________________ Frank Yaeger - DFSORT Development Team (IBM)
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
DFSORT is on the Web at:
www.ibm.com/storage/dfsort
Joined: 16 May 2007 Posts: 6 Topics: 2 Location: United Kingdom
Posted: Fri Jul 06, 2007 5:43 am Post subject:
Thanks for feedback guys, I appreciate your effort.
Maybe a little bit more background is appropriate here. If the opening gambit of mine was quite big as it was!! So....
The problem is I have a number of files (30+ and rising) going between systems on different platforms.
Given that
- mainframe files are typically fixed in format so that they can be mapped in PL/1, COBOL or whatever, with ease
- the files from/to the other platforms are comma separated values and records by their nature are any size in length
there is a need for transformation process.
These files will often have mutliple records, although each record of a given type will have a predefined number of fields.
My initial thoughts were to have a csv->fixed and fixed->csv utility built, that could handle multiple record types. It would use file structures like products such as File Manager or File/Aid to know what to map the csv values to(/from). To implement a new file would then simply mean defining the file structure to the utility. This involves a one off development cost, plus the file definition and a bit of testing per file.
I tended to think that this is becoming a common problem and something must be out there in the big wide world, when I came across the Smart DFSORT tricks.
It was in trying the DFSORT approach that I was coming across the issues outlined above, and wondered if there was an easy way out. The current issue is that this could appear
field1 value,field2 value,........,"String with, and "" in it",......
The string with quotes is to appear in the formatted file as
String with, and " in it
padded with spaces to the required string length.
Any number of these strings can appear anywhere in a record. I can probably request can all strings that can potentially have a " or , in it are always surrounded by quotes so simplifying the rules (a bit).
I suspect the INREC OUTREC is the approach. The INREC to parse the fields between commas (where the commas are not in quotes), then the OUTREC removing the leading and trailing quote, and somehow the double quotes in the string being transformed into one.
As I say I haven't looked at converting things like -123.45 into packed decimal yet.
I think I am probably talking myself out of DFSORT, unless the solution is really simple. I say that because implementing a new file transformation could be quite a technical task in terms of DFSORT statements.
I would appreciate your advice on this, but please do not invest too much time on it!
Joined: 16 May 2007 Posts: 6 Topics: 2 Location: United Kingdom
Posted: Fri Jul 06, 2007 9:19 am Post subject:
dbzTHEdinosauer,
Not quite, consider the following
00,20070706
01,-123.45,Text with no punctuation,789,"Text with ""lots"", of stuff"
01,-77.90,Text with nowt,1,"Text with ""some"" stuff"
99,4
This is a file of four records, of three different types,
Code:
00-header 2 fields (record type id)
01-detail 5 fields
99-trailer 2 field (type,record count)
So looking at the record type 01. The map,and therefore values may look like this
Code:
REC_TYPE CHAR (2) 01
AMOUNT1 FIXED DEC(9,2) 123.45 (-ve sign)
TEXT_1 CHAR(30) Text with no punctuation
AMOUNT2 FIXED(5) 789
TEXT_2 CHAR (50) Text with "lots", of stuff
(having trouble with the format again!)
So quotes are not the primary delimeter, the comma is. (There will not always be ",")
If a comma or quote can be in a value, quotes are round the whole string.
A quote in a value has to be doubled in the csv file.
To reiterate, I think the INREC statement in my first example with PAIR=QUOTE does the main bit, parse values between commas (allowing for commas in values)
But the next bit is to get rid of any leading and trailing quote,
Then replacing double quote in a value with one quote.
Shift that value left 1 space (if leading quote found) without losing embeded spaces.
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum