I was given a 256MB XML file parse and extract details from recently at work. The idea is to use it to prepare data that we'd like to send to out backend API.
The XML format looks like this:
<DataKind1>
<thing1>hello</thing1>
<thing2>world</thing2>
</DataKind1>
<DataKind2>
<other1>hello</other1>
<other2>hello</other2>
</DataKind2>
<DataKind3>
<other3>hello</other3>
<other4>hello</other4>
</DataKind3>
So you can see that its an XML file that contains different types/kinds of data - identified by its top-level node name.
And basically, these kinds of data repeat themselves throughout the file for 256Mb. And I want to use my data analysis skills to splice and chop up the specific data I want(specifically grouping some data) before sending it up.
So What I wanted to do was get each kind of data into a data frame, this meant of course that I needed to parse the XML and get it into Series objects and then import those series objects into creating a new data frame of all the series objects - one for each field in the data kind. I'm probably not explaining it very well and maybe this code might be more insightful:
import sys, getopt, xml.etree.ElementTree as ET, pandas as pd, time
from lxml import etree
import time
start = time.time()
holding_file = "HoldingsSummary.xml"
program_name = "AdjustHolding.py"
def print_el(el):
print(etree.tostring(el , pretty_print=True))
def el_to_dict(el):
dict = {}
for child in el:
dict[child.tag] = child.text
return dict
all_top_level_elements = etree.parse(holding_file).getroot()
dataKinds = {};
headersFor = {};
for element in all_top_level_elements:
el_dict = el_to_dict(element)
if not element.tag in dataKinds:
dataKinds[element.tag] = [];
headersFor[element.tag] = el_dict.keys()
else:
dataKinds[element.tag].append(el_dict.values())
dfs = {};
for name in dataKinds.keys():
tuple_rows = [tuple(lst) for lst in dataKinds[name]]
cols = headersFor[name]
if(len(tuple_rows) > 0):
df = pd.DataFrame(tuple_rows, columns=cols)
dfs[name] = df
# Each kind of data can now be delt with by fetching the created data frame for it.
dk1_df = dfs['DataKind1']
print(person_df.head())
I guess the key thing for me is that I can now work with pandas data frames instead of the XML for each data frame, making higher level data analysis techniques available to me such as setting the series datatypes, running aggregations and grouping.
There are limitations, of course, I don't need to parse multi-level XML nodes, however, if I needed to do that I'd re-define el_to_dict(element) to do more work than just expecting at most one child element of the top level node, and further parsing the remaining child tree structure effectively flattening it like it do up above.
Well, the next piece of work starts on Monday, where I'll group by a specific field in the data kind and prepare some API requests for sending that data up to the cloud.