Looking For The Right Way With Regular Expression With Groups In Different Order

February 26, 2024 Post a Comment

I am trying to parse many cobol copybooks using python. I have this regex expression that I have modified from one provided in cobol.py: ^(?P\d{2})\s+(?P\S

Solution 1:

PyParsing (https://github.com/pyparsing/pyparsing) is a good module to easily build grammars. You can build a basic Copybook grammar and parse it using PyParsing. You would have to then post process to retain the tree-like structure that is represented by the two-digit level fields.

Also take a look at the Copybook package (https://github.com/zalmane/copybook) which uses PyParsing.

Solution 2:

cb2xml

You should look at cb2xml. It will parse a Cobol Copybook and create a Xml file. You can then process the Xml in python or any language. The cb2xml package has basic examples of processing the Xml in python + other languages.

Cobol:

01 Ams-Vendor.
       03 Brand               Pic x(3).
       03 Location-details.
          05 Location-Number  Pic 9(4).
          05 Location-Type    Pic XX.
          05 Location-Name    Pic X(35).
       03Address-Details.
          05 actual-address.
             10Address-1     Pic X(40).
             10Address-2     Pic X(40).
             10Address-3     Pic X(35).
          05 Postcode         Pic 9(4).
          05 Empty            pic x(6).
          05 State            Pic XXX.
       03 Location-Active     Pic X.

Output from cb2xml:

?xml version="1.0" encoding="UTF-8" standalone="no"?>
<copybookfilename="cbl2xml_Test110.cbl"><itemdisplay-length="173"level="01"name="Ams-Vendor"position="1"storage-length="173"><itemdisplay-length="3"level="03"name="Brand"picture="x(3)"position="1"storage-length="3"/><itemdisplay-length="41"level="03"name="Location-details"position="4"storage-length="41"><itemdisplay-length="4"level="05"name="Location-Number"numeric="true"picture="9(4)"position="4"storage-length="4"/><itemdisplay-length="2"level="05"name="Location-Type"picture="XX"position="8"storage-length="2"/><itemdisplay-length="35"level="05"name="Location-Name"picture="X(35)"position="10"storage-length="35"/></item><itemdisplay-length="128"level="03"name="Address-Details"position="45"storage-length="128"><itemdisplay-length="115"level="05"name="actual-address"position="45"storage-length="115"><itemdisplay-length="40"level="10"name="Address-1"picture="X(40)"position="45"storage-length="40"/><itemdisplay-length="40"level="10"name="Address-2"picture="X(40)"position="85"storage-length="40"/><itemdisplay-length="35"level="10"name="Address-3"picture="X(35)"position="125"storage-length="35"/></item><itemdisplay-length="4"level="05"name="Postcode"numeric="true"picture="9(4)"position="160"storage-length="4"/><itemdisplay-length="6"level="05"name="Empty"picture="x(6)"position="164"storage-length="6"/><itemdisplay-length="3"level="05"name="State"picture="XXX"position="170"storage-length="3"/></item><itemdisplay-length="1"level="03"name="Location-Active"picture="X"position="173"storage-length="1"/></item></copybook>

An interesting application of cb2xml is described in Dynamically Reading COBOL Redefines with C#

CobolToCsv

The CobolToCsv package will convert a Cobol-Data-File to a Csv file. Limitations:

Redefines / Multi-Record files are not handled
Fairly limited range of Cobol Compilers support (Mainframe, Gnu Cobol, Fujitsu-Cobol).

Cobol2Csv should be able handle Text files (+ Comp-3). It may handle some of your files.

Solution 3:

Although an actual parser like PLY or parsely would be best for this if you have to use regex can't you just add another OCCURS group with a different key?. e.g.

"""
03  AMOUNT-BREAKDOWN        PICTURE 9(8)V99  VALUE ZEROES.
03  AMOUNT-BREAKDOWN-X REDEFINES AMOUNT-BREAKDOWN.
05  FILLER              PICTURE X(3)     VALUE "DEC".
03  MONTH REDEFINES MONTH-TAB  PICTURE X(3) OCCURS 12 TIMES.
03  SUB                 PICTURE 99    VALUE 0.
03  NUMBER-HOLD.
05  NUMB-HOLD       PICTURE X  OCCURS 11 TIMES.
05  FILLER              PICTURE X(5)     VALUE "TEN".
03  DIGIT-TAB2 REDEFINES DIGIT-TAB1.
05  DIGIT-TABLE         OCCURS 10   PICTURE X(5).
03  WK-TEN-MILLION          PICTURE X(5)     VALUE SPACES.
"""import re
for line in __doc__.split("\n"):
    iflen(line) < 1: continue
    m = re.match(
        "^(?P<level>\d{2})\s+(?P<name>\S+).*?""(\s+INDEXED BY\s+(?P<indexed_by>\S+))?.*?""(\s+REDEFINES\s+(?P<redefines>\S+))?.*?""(\s+OCCURS\s+(?P<occurs1>\d+).?( TIMES)?)?.*?"# <-- occurs1"(\s+PIC(TURE)?\s+(?P<pic>\S+))?.*?""(\s+OCCURS\s+(?P<occurs>\d+).?( TIMES)?)?.*?""((?P<comp>)\s+COMP\S+)?.*?""(\s+VALUE\s+(?P<value>\S+).*)?""\.$", line)
    if m:
        print m.groups()

Try it online!

Sample output:

('03', 'AMOUNT-BREAKDOWN', None, None, None, None, None, None, None, '        PICTURE 9(8)V99', 'TURE', '9(8)V99', None, None, None, None, None, '  VALUE ZEROES', 'ZEROES')
('03', 'AMOUNT-BREAKDOWN-X', None, None, ' REDEFINES AMOUNT-BREAKDOWN', 'AMOUNT-BREAKDOWN', None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'FILLER', None, None, None, None, None, None, None, '              PICTURE X(3)', 'TURE', 'X(3)', None, None, None, None, None, '     VALUE "DEC"', '"DEC"')
('03', 'MONTH', None, None, ' REDEFINES MONTH-TAB', 'MONTH-TAB', None, None, None, '  PICTURE X(3)', 'TURE', 'X(3)', ' OCCURS 12 ', '12', None, None, None, None, None)
('03', 'SUB', None, None, None, None, None, None, None, '                 PICTURE 99', 'TURE', '99', None, None, None, None, None, '    VALUE 0', '0')
('03', 'NUMBER-HOLD', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'NUMB-HOLD', None, None, None, None, None, None, None, '       PICTURE X', 'TURE', 'X', '  OCCURS 11 ', '11', None, None, None, None, None)
('05', 'FILLER', None, None, None, None, None, None, None, '              PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, '     VALUE "TEN"', '"TEN"')
('03', 'DIGIT-TAB2', None, None, ' REDEFINES DIGIT-TAB1', 'DIGIT-TAB1', None, None, None, None, None, None, None, None, None, None, None, None, None)
('05', 'DIGIT-TABLE', None, None, None, None, '         OCCURS 10 ', '10', None, '  PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, None, None)
('03', 'WK-TEN-MILLION', None, None, None, None, None, None, None, '          PICTURE X(5)', 'TURE', 'X(5)', None, None, None, None, None, '     VALUE SPACES', 'SPACES')

Python Guru