Parsing The Uspto Bulk Xml Files Using Python
import xml.etree.ElementTree as ET import csv import re import codecs import io xml = open('ipa110106.xml') line_num=0 f = open('workfile.xml', 'w') for line in xml: line_nu
Solution 1:
Current PTO XML files are valid XML if you split them at the XML declaration and process each publication separately. I would expect trying to process them all at once to use a very large amount of memory. Either way, the replacements you are doing aren't needed.
My solution was to create a class that owns the zipfile (for others that might not know, the data is a zip file containing one file that contains the concatenated XML files) and has a function that yields each XML file in turn. I then use ET.XML()
to process these files.
Post a Comment for "Parsing The Uspto Bulk Xml Files Using Python"