MVSFORUMS.com Forum Index MVSFORUMS.com
A Community of and for MVS Professionals
 
 FAQFAQ   SearchSearch   Quick Manuals   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Parse in DFSORT

 
Post new topic   Reply to topic   printer-friendly view    MVSFORUMS.com Forum Index -> Utilities
View previous topic :: View next topic  
Author Message
gordonstaylor
Beginner


Joined: 16 May 2007
Posts: 6
Topics: 2
Location: United Kingdom

PostPosted: Thu Jul 05, 2007 7:07 am    Post subject: Parse in DFSORT Reply with quote

I have a number of requirements for parsing comma separated value(CSV) files sourced from other platforms on the mainframe, and vice versa.

The Smart DFSORT Tricks documents Deconstruct and reconstruct csv records, so I am now trying to evaluate if this is a good option as opposed to SELCOPY or PL/1. COBOL is not an option.

The latter approaches (SELCOPY, PL/1) would mean developing utilities (CSV->formatted, formatted-> CSV), such that they are supplied data structures (to support multi record files). Implementing a new transformation requirement would then mean simply supplying the data structure(s).

Using the DFSORT option would mean creating new DFSORT statements to parse the file in question, but without the once off utility development overhead.

I have tried some simple parsing, but have hit an issue that is not blindingly obvious from the DFSORT documentation.

The basic rules of the files are
Quotes will surround data values that contain commas or quotes.
Quotes in data values will be doubled up, but count as one.

The example I have tried looks like this, where the first two bytes are the record type

01,"THIS,IS RECORD ONE"
02,"THIS,IS RECORD TWO","WITH EXTRA DATA""ADDED"""


in an attempt to get this

01THIS,IS RECORD ONE
02THIS,IS RECORD TWO WITH EXTRA DATA"ADDED"


These statements (excuse the format, which looks fine in the post window)

Code:

 INREC IFTHEN=(WHEN=(1,2,CH,EQ,C'01'),                       
               PARSE=(%00=(ENDBEFR=C',',PAIR=QUOTE,FIXLEN=2),       
                            %01=(ENDBEFR=C',',PAIR=QUOTE,FIXLEN=30)),     
               BUILD=(%00,%01)),                                   
           IFTHEN=(WHEN=(1,2,CH,EQ,C'02'),                       
               PARSE=(%10=(ENDBEFR=C',',PAIR=QUOTE,FIXLEN=2),       
                            %11=(ENDBEFR=C',',PAIR=QUOTE,FIXLEN=30),     
                            %12=(ENDBEFR=C',',PAIR=QUOTE,FIXLEN=30)),     
               BUILD=(%10,%11,%12))                       


Gives me
01"THIS,IS RECORD ONE"
02"THIS,IS RECORD TWO" "WITH EXTRA DATA""ADDED"""


i.e. the quotes remain

A different approach, more specific field by field and without the PAIR=QUOTE

Code:

INREC IFTHEN=(WHEN=(1,2,CH,EQ,C'01'),           
              PARSE=(%00=(ENDBEFR=C',"',FIXLEN=2),   
                           %01=(ENDBEFR=C'" ',FIXLEN=30)), 
              BUILD=(%00,%01)),                       
          IFTHEN=(WHEN=(1,2,CH,EQ,C'02'),           
             PARSE=(%10=(ENDBEFR=C',"',FIXLEN=2),   
                          %11=(ENDBEFR=C'","',FIXLEN=30), 
                          %12=(ENDBEFR=C'" ',FIXLEN=30)), 
             BUILD=(%10,%11,%12))                   


Gives me
01THIS,IS RECORD ONE
02THIS,IS RECORD TWO WITH EXTRA DATA""ADDED""


So I have solved part of the problem, the quotes around the data value have now gone, but am left with the double quote that I want to become one!

Am I pushing my luck/expectations in this regard, given I haven't looked at creating packed decimal values yet?

Any guidance would be much appreciated.
_________________
Gordon Taylor
Back to top
View user's profile Send private message
kolusu
Site Admin
Site Admin


Joined: 26 Nov 2002
Posts: 12378
Topics: 75
Location: San Jose

PostPosted: Thu Jul 05, 2007 7:40 am    Post subject: Reply with quote

gordonstaylor,

Do you just want the double quotes to be removed?

Kolusu
_________________
Kolusu
www.linkedin.com/in/kolusu
Back to top
View user's profile Send private message Send e-mail Visit poster's website
gordonstaylor
Beginner


Joined: 16 May 2007
Posts: 6
Topics: 2
Location: United Kingdom

PostPosted: Thu Jul 05, 2007 8:14 am    Post subject: Reply with quote

Kolusu,

Pairs of quotes removed (", as opposed to apostrophe ') , to be replaced by one.

A double quote represents one quote of real data when they are delimited by a pair of quotes. Delimiting quotes are required when the value can have a comma in it. (which makes csv files easy to parse and explain Smile !)
i.e.
from this
02,"THIS,IS RECORD TWO","WITH EXTRA DATA""ADDED"""
to this (as the quote is real data, not a form of delimeter)
02THIS,IS RECORD TWO WITH EXTRA DATA"ADDED"
not this
02THIS,IS RECORD TWO WITH EXTRA DATA""ADDED""
_________________
Gordon Taylor
Back to top
View user's profile Send private message
dbzTHEdinosauer
Supermod


Joined: 20 Oct 2006
Posts: 1411
Topics: 26
Location: germany

PostPosted: Thu Jul 05, 2007 8:21 am    Post subject: Reply with quote

looks to me, that the record has been built incorrectly. If you wanted this
Code:
02THIS,IS RECORD TWO WITH EXTRA DATA"ADDED"

it should have been coded as this
Code:
02,"THIS,IS RECORD TWO","WITH EXTRA DATA""""ADDED"""

_________________
Dick Brenholtz
American living in Varel, Germany
Back to top
View user's profile Send private message
gordonstaylor
Beginner


Joined: 16 May 2007
Posts: 6
Topics: 2
Location: United Kingdom

PostPosted: Thu Jul 05, 2007 8:39 am    Post subject: Reply with quote

dbzTHEdinosauer,

The value "ADDED" is not a separate field.

The field value is
WITH EXTRA DATA"ADDED"

So in a csv file, in case it has commas in the value, quotes in the value are double up hence
"WITH EXTRA DATA""ADDED"""

Thanks
_________________
Gordon Taylor
Back to top
View user's profile Send private message
dbzTHEdinosauer
Supermod


Joined: 20 Oct 2006
Posts: 1411
Topics: 26
Location: germany

PostPosted: Thu Jul 05, 2007 9:52 am    Post subject: Reply with quote

gordon,

ok, real problem is that the input files are not going to change. thus my suggestion was dumb, even if it would work.

how about parsing on both INREC and OUTREC?

your real problem is that you are parsing for commas outside double-quote pairs and then are parsing double-double-quotes to become single-double-quotes.

It should be possible without having to identify the record type.

No, I have no idea how to do it with DFSORT. Is REXX not an allowable solution?
_________________
Dick Brenholtz
American living in Varel, Germany
Back to top
View user's profile Send private message
Frank Yaeger
Sort Forum Moderator
Sort Forum Moderator


Joined: 02 Dec 2002
Posts: 1618
Topics: 31
Location: San Jose

PostPosted: Thu Jul 05, 2007 10:30 am    Post subject: Reply with quote

Gordon,

The two example lines you gave is not much to go on in discerning all of the variations you need to handle or the "rules" to follow.

I assume that the record contains one or two strings enclosed in quotes. Can there be more than two strings in a record?

Does the 01 mean there's only one string?
Does the 02 mean there's two strings?
Can there be an 03, etc?

It appears that string2 can have a value enclosed in quotes within it - can it have more than one value enclosed in quotes? Can string1 have one or more values enclosed in quotes?

It would help if you would give an example with more of the possible variations and explain the rules for what can be in these records.

Otherwise anything we come up with might not work for all of your variations and so would be a waste of time.
_________________
Frank Yaeger - DFSORT Development Team (IBM)
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
DFSORT is on the Web at:
www.ibm.com/storage/dfsort
Back to top
View user's profile Send private message Send e-mail Visit poster's website
gordonstaylor
Beginner


Joined: 16 May 2007
Posts: 6
Topics: 2
Location: United Kingdom

PostPosted: Fri Jul 06, 2007 5:43 am    Post subject: Reply with quote

Thanks for feedback guys, I appreciate your effort.

Maybe a little bit more background is appropriate here. If the opening gambit of mine was quite big as it was!! So....

The problem is I have a number of files (30+ and rising) going between systems on different platforms.

Given that
- mainframe files are typically fixed in format so that they can be mapped in PL/1, COBOL or whatever, with ease
- the files from/to the other platforms are comma separated values and records by their nature are any size in length
there is a need for transformation process.

These files will often have mutliple records, although each record of a given type will have a predefined number of fields.

My initial thoughts were to have a csv->fixed and fixed->csv utility built, that could handle multiple record types. It would use file structures like products such as File Manager or File/Aid to know what to map the csv values to(/from). To implement a new file would then simply mean defining the file structure to the utility. This involves a one off development cost, plus the file definition and a bit of testing per file.

I tended to think that this is becoming a common problem and something must be out there in the big wide world, when I came across the Smart DFSORT tricks.

It was in trying the DFSORT approach that I was coming across the issues outlined above, and wondered if there was an easy way out. The current issue is that this could appear

field1 value,field2 value,........,"String with, and "" in it",......

The string with quotes is to appear in the formatted file as
String with, and " in it
padded with spaces to the required string length.

Any number of these strings can appear anywhere in a record. I can probably request can all strings that can potentially have a " or , in it are always surrounded by quotes so simplifying the rules (a bit).

I suspect the INREC OUTREC is the approach. The INREC to parse the fields between commas (where the commas are not in quotes), then the OUTREC removing the leading and trailing quote, and somehow the double quotes in the string being transformed into one.

As I say I haven't looked at converting things like -123.45 into packed decimal yet.

I think I am probably talking myself out of DFSORT, unless the solution is really simple. I say that because implementing a new file transformation could be quite a technical task in terms of DFSORT statements.

I would appreciate your advice on this, but please do not invest too much time on it!

Now if only File Manager could do this!

Many Thanks
_________________
Gordon Taylor
Back to top
View user's profile Send private message
dbzTHEdinosauer
Supermod


Joined: 20 Oct 2006
Posts: 1411
Topics: 26
Location: germany

PostPosted: Fri Jul 06, 2007 6:52 am    Post subject: Reply with quote

The CSV to ? rules seem to be:
  1. The first and last double-quotes are a record demarcation.
      remove them- sqz
  2. a double-quote comma double-quote combination is a fields separator.
      There will always be a field before the 3-char separator and a field after.
    • replace the 3 char combo with 1 space - sqz
  3. after that
    • any double double-quote is a single double-quote.
        remove 1 double-quote - sqz
    • any comma is treated as a char - i.e. left alone


a breeze in REXX or C, fun with COBOL though cumbersome and without record layouts, and doable in SORT without needing to deal with the first 2 char.

the ? to CSV, possibly the reverse but I doubt it.

I love BBCode Twisted Evil
_________________
Dick Brenholtz
American living in Varel, Germany
Back to top
View user's profile Send private message
gordonstaylor
Beginner


Joined: 16 May 2007
Posts: 6
Topics: 2
Location: United Kingdom

PostPosted: Fri Jul 06, 2007 9:19 am    Post subject: Reply with quote

dbzTHEdinosauer,

Not quite, consider the following

00,20070706
01,-123.45,Text with no punctuation,789,"Text with ""lots"", of stuff"
01,-77.90,Text with nowt,1,"Text with ""some"" stuff"
99,4
This is a file of four records, of three different types,
Code:
00-header   2 fields (record type id)
01-detail     5 fields
99-trailer    2 field (type,record count)


So looking at the record type 01. The map,and therefore values may look like this

Code:
REC_TYPE     CHAR (2)                 01
AMOUNT1      FIXED DEC(9,2)       123.45 (-ve sign)
TEXT_1         CHAR(30)                Text with no punctuation
AMOUNT2      FIXED(5)                 789
TEXT_2         CHAR (50)               Text with "lots", of stuff


(having trouble with the format again!)

So quotes are not the primary delimeter, the comma is. (There will not always be ",")
If a comma or quote can be in a value, quotes are round the whole string.
A quote in a value has to be doubled in the csv file.

To reiterate, I think the INREC statement in my first example with PAIR=QUOTE does the main bit, parse values between commas (allowing for commas in values)

But the next bit is to get rid of any leading and trailing quote,
Then replacing double quote in a value with one quote.
Shift that value left 1 space (if leading quote found) without losing embeded spaces.

Thanks again
_________________
Gordon Taylor
Back to top
View user's profile Send private message
dbzTHEdinosauer
Supermod


Joined: 20 Oct 2006
Posts: 1411
Topics: 26
Location: germany

PostPosted: Fri Jul 06, 2007 9:27 am    Post subject: Reply with quote

gordon,

I understand now, and it seems that you have a leg on solving the problem.
_________________
Dick Brenholtz
American living in Varel, Germany
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic   printer-friendly view    MVSFORUMS.com Forum Index -> Utilities All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


MVSFORUMS
Powered by phpBB © 2001, 2005 phpBB Group