MVSFORUMS.com

gordonstaylor · Posted: Thu Jul 05, 2007 7:07 am Post subject: Parse in DFSORT

I have a number of requirements for parsing comma separated value(CSV) files sourced from other platforms on the mainframe, and vice versa.

The Smart DFSORT Tricks documents Deconstruct and reconstruct csv records, so I am now trying to evaluate if this is a good option as opposed to SELCOPY or PL/1. COBOL is not an option.

The latter approaches (SELCOPY, PL/1) would mean developing utilities (CSV->formatted, formatted-> CSV), such that they are supplied data structures (to support multi record files). Implementing a new transformation requirement would then mean simply supplying the data structure(s).

Using the DFSORT option would mean creating new DFSORT statements to parse the file in question, but without the once off utility development overhead.

I have tried some simple parsing, but have hit an issue that is not blindingly obvious from the DFSORT documentation.

The basic rules of the files are
Quotes will surround data values that contain commas or quotes.
Quotes in data values will be doubled up, but count as one.

The example I have tried looks like this, where the first two bytes are the record type

01,"THIS,IS RECORD ONE"
02,"THIS,IS RECORD TWO","WITH EXTRA DATA""ADDED"""

in an attempt to get this

01THIS,IS RECORD ONE
02THIS,IS RECORD TWO WITH EXTRA DATA"ADDED"

These statements (excuse the format, which looks fine in the post window)

kolusu · Posted: Thu Jul 05, 2007 7:40 am Post subject:

gordonstaylor,

Do you just want the double quotes to be removed?

Kolusu
_________________
Kolusu
www.linkedin.com/in/kolusu

gordonstaylor · Posted: Thu Jul 05, 2007 8:14 am Post subject:

Kolusu,

Pairs of quotes removed (", as opposed to apostrophe ') , to be replaced by one.

A double quote represents one quote of real data when they are delimited by a pair of quotes. Delimiting quotes are required when the value can have a comma in it. (which makes csv files easy to parse and explain Smile

!)
i.e.
from this
02,"THIS,IS RECORD TWO","WITH EXTRA DATA""ADDED"""
to this (as the quote is real data, not a form of delimeter)
02THIS,IS RECORD TWO WITH EXTRA DATA"ADDED"
not this
02THIS,IS RECORD TWO WITH EXTRA DATA""ADDED""
_________________
Gordon Taylor

dbzTHEdinosauer · Posted: Thu Jul 05, 2007 8:21 am Post subject:

looks to me, that the record has been built incorrectly. If you wanted this

gordonstaylor · Posted: Thu Jul 05, 2007 8:39 am Post subject:

dbzTHEdinosauer,

The value "ADDED" is not a separate field.

The field value is
WITH EXTRA DATA"ADDED"

So in a csv file, in case it has commas in the value, quotes in the value are double up hence
"WITH EXTRA DATA""ADDED"""

Thanks
_________________
Gordon Taylor

dbzTHEdinosauer · Posted: Thu Jul 05, 2007 9:52 am Post subject:

gordon,

ok, real problem is that the input files are not going to change. thus my suggestion was dumb, even if it would work.

how about parsing on both INREC and OUTREC?

your real problem is that you are parsing for commas outside double-quote pairs and then are parsing double-double-quotes to become single-double-quotes.

It should be possible without having to identify the record type.

No, I have no idea how to do it with DFSORT. Is REXX not an allowable solution?
_________________
Dick Brenholtz
American living in Varel, Germany

Frank Yaeger · Posted: Thu Jul 05, 2007 10:30 am Post subject:

Gordon,

The two example lines you gave is not much to go on in discerning all of the variations you need to handle or the "rules" to follow.

I assume that the record contains one or two strings enclosed in quotes. Can there be more than two strings in a record?

Does the 01 mean there's only one string?
Does the 02 mean there's two strings?
Can there be an 03, etc?

It appears that string2 can have a value enclosed in quotes within it - can it have more than one value enclosed in quotes? Can string1 have one or more values enclosed in quotes?

It would help if you would give an example with more of the possible variations and explain the rules for what can be in these records.

Otherwise anything we come up with might not work for all of your variations and so would be a waste of time.
_________________
Frank Yaeger - DFSORT Development Team (IBM)
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
DFSORT is on the Web at:
www.ibm.com/storage/dfsort

gordonstaylor · Posted: Fri Jul 06, 2007 5:43 am Post subject:

Thanks for feedback guys, I appreciate your effort.

Maybe a little bit more background is appropriate here. If the opening gambit of mine was quite big as it was!! So....

The problem is I have a number of files (30+ and rising) going between systems on different platforms.

Given that
- mainframe files are typically fixed in format so that they can be mapped in PL/1, COBOL or whatever, with ease
- the files from/to the other platforms are comma separated values and records by their nature are any size in length
there is a need for transformation process.

These files will often have mutliple records, although each record of a given type will have a predefined number of fields.

My initial thoughts were to have a csv->fixed and fixed->csv utility built, that could handle multiple record types. It would use file structures like products such as File Manager or File/Aid to know what to map the csv values to(/from). To implement a new file would then simply mean defining the file structure to the utility. This involves a one off development cost, plus the file definition and a bit of testing per file.

I tended to think that this is becoming a common problem and something must be out there in the big wide world, when I came across the Smart DFSORT tricks.

It was in trying the DFSORT approach that I was coming across the issues outlined above, and wondered if there was an easy way out. The current issue is that this could appear

field1 value,field2 value,........,"String with, and "" in it",......

The string with quotes is to appear in the formatted file as
String with, and " in it
padded with spaces to the required string length.

Any number of these strings can appear anywhere in a record. I can probably request can all strings that can potentially have a " or , in it are always surrounded by quotes so simplifying the rules (a bit).

I suspect the INREC OUTREC is the approach. The INREC to parse the fields between commas (where the commas are not in quotes), then the OUTREC removing the leading and trailing quote, and somehow the double quotes in the string being transformed into one.

As I say I haven't looked at converting things like -123.45 into packed decimal yet.

I think I am probably talking myself out of DFSORT, unless the solution is really simple. I say that because implementing a new file transformation could be quite a technical task in terms of DFSORT statements.

I would appreciate your advice on this, but please do not invest too much time on it!

Now if only File Manager could do this!

Many Thanks
_________________
Gordon Taylor

dbzTHEdinosauer · Posted: Fri Jul 06, 2007 6:52 am Post subject:

The CSV to ? rules seem to be:

The first and last double-quotes are a record demarcation.
a double-quote comma double-quote combination is a fields separator.
- replace the 3 char combo with 1 space - sqz
after that
- any double double-quote is a single double-quote.
- any comma is treated as a char - i.e. left alone

a breeze in REXX or C, fun with COBOL though cumbersome and without record layouts, and doable in SORT without needing to deal with the first 2 char.

the ? to CSV, possibly the reverse but I doubt it.

I love BBCode Twisted Evil

_________________
Dick Brenholtz
American living in Varel, Germany

gordonstaylor · Posted: Fri Jul 06, 2007 9:19 am Post subject:

dbzTHEdinosauer,

Not quite, consider the following

00,20070706
01,-123.45,Text with no punctuation,789,"Text with ""lots"", of stuff"
01,-77.90,Text with nowt,1,"Text with ""some"" stuff"
99,4
This is a file of four records, of three different types,

dbzTHEdinosauer · Posted: Fri Jul 06, 2007 9:27 am Post subject:

gordon,

I understand now, and it seems that you have a leg on solving the problem.
_________________
Dick Brenholtz
American living in Varel, Germany