Parsing big XML files efficiently
up vote
3
down vote
favorite
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
add a comment |
up vote
3
down vote
favorite
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37
1
Did you check this?
– Andersson
Nov 22 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?
Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]
df = pd.DataFrame(data)
print(df)
Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?
xml example:
<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>
python lxml
python lxml
asked Nov 22 at 12:13
Chrisvdberge
5211822
5211822
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37
1
Did you check this?
– Andersson
Nov 22 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05
add a comment |
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37
1
Did you check this?
– Andersson
Nov 22 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37
1
1
Did you check this?
– Andersson
Nov 22 at 12:41
Did you check this?
– Andersson
Nov 22 at 12:41
1
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
add a comment |
up vote
0
down vote
accepted
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:
from lxml import etree, objectify
import pandas as pd
file = 'some_huge_file.xml'
time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =
if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})
print(df)
answered Nov 23 at 12:04
Chrisvdberge
5211822
5211822
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430768%2fparsing-big-xml-files-efficiently%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29
speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37
1
Did you check this?
– Andersson
Nov 22 at 12:41
1
Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31
those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05