Parsing big XML files efficiently











up vote
3
down vote

favorite












I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>









share|improve this question






















  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
    – Dominique
    Nov 22 at 12:29










  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
    – Chrisvdberge
    Nov 22 at 12:37






  • 1




    Did you check this?
    – Andersson
    Nov 22 at 12:41






  • 1




    Possible duplicate of xml parsing in python for big data
    – stovfl
    Nov 22 at 13:31










  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
    – Chrisvdberge
    Nov 23 at 12:05















up vote
3
down vote

favorite












I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>









share|improve this question






















  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
    – Dominique
    Nov 22 at 12:29










  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
    – Chrisvdberge
    Nov 22 at 12:37






  • 1




    Did you check this?
    – Andersson
    Nov 22 at 12:41






  • 1




    Possible duplicate of xml parsing in python for big data
    – stovfl
    Nov 22 at 13:31










  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
    – Chrisvdberge
    Nov 23 at 12:05













up vote
3
down vote

favorite









up vote
3
down vote

favorite











I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>









share|improve this question













I'm dealing with XML files that are 4GB+ in size and wondering how I can best parse them. Right now I run into memory issues and looking for a way to not load the whole file in memory and go through it in batches perhaps?



Current code is using lxml and iterating over the repeating elements. Namespaces are cleared up front:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'
if file.lower().endswith('.xml'):

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(mvnFile, parser)
root = tree.getroot()

####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
data = [{
'Element1': tp.findtext('element1'),
'Element2': tp.findtext('element2'),
'Element3': tp.findtext('element3'),
}
for tp in tree.xpath('//mainelement/subelement')]

df = pd.DataFrame(data)
print(df)


Furthermore I need to do some splitting of the values of the elements as they are space-separated. However, I only need specific values so I'm wondering if I can do this somehow within the parsing instead of splitting the columns on space afterwards?



xml example:



<mainelement>
<subelement tc="00:00:00:000" ms="0">
<element1>0.861668 0.496888 0.000000 0.000000 0.867815</element1>
<element2>0.043423 0.509801 -0.111990 -0.070212 0.126711</element2>
<element3>-0.001501 0.008416 0.000098 0.005241 0.005301</element3>
</subelement>
<subelement tc="00:00:00:001" ms="1">
<element1>-0.503814 0.005664 -0.070326 -0.860926 -0.503720</element1>
<element2>-0.044658 0.046381 0.909291 -0.033390 0.049348</element2>
<element3>-0.000000 -0.000000 -0.000000 -0.005217 0.007849</element3>
</subelement>
<subelement tc="00:00:00:002" ms="2">
<element1> -0.861173 0.503578 -0.007163 0.056031 0.862101</element1>
<element2>0.371398 1.325794 -0.030966 0.059466 1.388910</element2>
<element3>-0.010139 0.001048 0.026847 -0.010139 0.001048</element3>
</subelement>
<subelement tc="00:00:00:003" ms="3">
<element1>0.856813 0.494664 0.003921 0.023356 0.868762</element1>
<element2>-0.030966 0.059466 1.388910 -0.152636 -0.008650</element2>
<element3>0.001048 0.026847 -0.010139 0.001048 0.035846</element3>
</subelement>
</mainelement>






python lxml






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 22 at 12:13









Chrisvdberge

5211822




5211822












  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
    – Dominique
    Nov 22 at 12:29










  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
    – Chrisvdberge
    Nov 22 at 12:37






  • 1




    Did you check this?
    – Andersson
    Nov 22 at 12:41






  • 1




    Possible duplicate of xml parsing in python for big data
    – stovfl
    Nov 22 at 13:31










  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
    – Chrisvdberge
    Nov 23 at 12:05


















  • I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
    – Dominique
    Nov 22 at 12:29










  • speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
    – Chrisvdberge
    Nov 22 at 12:37






  • 1




    Did you check this?
    – Andersson
    Nov 22 at 12:41






  • 1




    Possible duplicate of xml parsing in python for big data
    – stovfl
    Nov 22 at 13:31










  • those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
    – Chrisvdberge
    Nov 23 at 12:05
















I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29




I don't know about Python XML handling, but in the Java world there are two types: SAX and DOM. DOM means "Document Object Model" and it means that the whole XML file gets loaded into memory, so that queries can be done afterwards. Once loaded, it is very quick, but it might consume huge amounts of memory. SAX however runs over your XML, and when a certain tag, attribute, content is reached, an event might be launched. This might take quite a while but it almost takes no memory. So, when you say "efficient", do you mean speed or memory related efficiency?
– Dominique
Nov 22 at 12:29












speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37




speed at this point is not important I'd say. I just need to get the data into a database for now, and if I try to read all elements and all values I need, python will just crash or error (and MacOS starts force quitting applications ;) ).
– Chrisvdberge
Nov 22 at 12:37




1




1




Did you check this?
– Andersson
Nov 22 at 12:41




Did you check this?
– Andersson
Nov 22 at 12:41




1




1




Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31




Possible duplicate of xml parsing in python for big data
– stovfl
Nov 22 at 13:31












those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05




those links were quite helpful. The possible duplicate pointed in the right direction but didn't provided a clear concrete answer, so I added the code I came up with as answer to this question for clarity
– Chrisvdberge
Nov 23 at 12:05












1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



from lxml import etree, objectify
import pandas as pd

file = 'some_huge_file.xml'

time =
data1_element1_x =
data1_element1_y =
data1_element2_x =
data1_element2_y =
data2_element1_x =
data2_element1_y =
data2_element2_x =
data2_element2_y =

if file.lower().endswith('.xml'):
for event, elem in etree.iterparse(file):
if elem.tag == "subelement":
time.append(elem.get('tc'))
for child in elem:
if child.tag == "element1":
split_data = child.text.split(" ")
data1_element1_x.append(float(split_data[0]))
data1_element1_y.append(float(split_data[1]))
data2_element1_x.append(float(split_data[2]))
data2_element1_y.append(float(split_data[3]))
elif child.tag == "element2":
split_data = child.text.split(" ")
data1_element2_x.append(float(split_data[0]))
data1_element2_y.append(float(split_data[1]))
data2_element2_x.append(float(split_data[2]))
data2_element2_y.append(float(split_data[3]))
elem.clear()
df = pd.DataFrame({
'Time':time,
'Data1_element1_x': data1_element1_x,
'Data1_element1_y': data1_element1_y,
'Data1_element2_x': data1_element2_x,
'Data1_element2_y': data1_element2_y,
'Data2_element1_x': data2_element1_x,
'Data2_element1_y': data2_element1_y,
'Data2_element2_x': data2_element2_x,
'Data2_element2_y': data2_element2_y
})

print(df)





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430768%2fparsing-big-xml-files-efficiently%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote



    accepted










    Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



    from lxml import etree, objectify
    import pandas as pd

    file = 'some_huge_file.xml'

    time =
    data1_element1_x =
    data1_element1_y =
    data1_element2_x =
    data1_element2_y =
    data2_element1_x =
    data2_element1_y =
    data2_element2_x =
    data2_element2_y =

    if file.lower().endswith('.xml'):
    for event, elem in etree.iterparse(file):
    if elem.tag == "subelement":
    time.append(elem.get('tc'))
    for child in elem:
    if child.tag == "element1":
    split_data = child.text.split(" ")
    data1_element1_x.append(float(split_data[0]))
    data1_element1_y.append(float(split_data[1]))
    data2_element1_x.append(float(split_data[2]))
    data2_element1_y.append(float(split_data[3]))
    elif child.tag == "element2":
    split_data = child.text.split(" ")
    data1_element2_x.append(float(split_data[0]))
    data1_element2_y.append(float(split_data[1]))
    data2_element2_x.append(float(split_data[2]))
    data2_element2_y.append(float(split_data[3]))
    elem.clear()
    df = pd.DataFrame({
    'Time':time,
    'Data1_element1_x': data1_element1_x,
    'Data1_element1_y': data1_element1_y,
    'Data1_element2_x': data1_element2_x,
    'Data1_element2_y': data1_element2_y,
    'Data2_element1_x': data2_element1_x,
    'Data2_element1_y': data2_element1_y,
    'Data2_element2_x': data2_element2_x,
    'Data2_element2_y': data2_element2_y
    })

    print(df)





    share|improve this answer

























      up vote
      0
      down vote



      accepted










      Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



      from lxml import etree, objectify
      import pandas as pd

      file = 'some_huge_file.xml'

      time =
      data1_element1_x =
      data1_element1_y =
      data1_element2_x =
      data1_element2_y =
      data2_element1_x =
      data2_element1_y =
      data2_element2_x =
      data2_element2_y =

      if file.lower().endswith('.xml'):
      for event, elem in etree.iterparse(file):
      if elem.tag == "subelement":
      time.append(elem.get('tc'))
      for child in elem:
      if child.tag == "element1":
      split_data = child.text.split(" ")
      data1_element1_x.append(float(split_data[0]))
      data1_element1_y.append(float(split_data[1]))
      data2_element1_x.append(float(split_data[2]))
      data2_element1_y.append(float(split_data[3]))
      elif child.tag == "element2":
      split_data = child.text.split(" ")
      data1_element2_x.append(float(split_data[0]))
      data1_element2_y.append(float(split_data[1]))
      data2_element2_x.append(float(split_data[2]))
      data2_element2_y.append(float(split_data[3]))
      elem.clear()
      df = pd.DataFrame({
      'Time':time,
      'Data1_element1_x': data1_element1_x,
      'Data1_element1_y': data1_element1_y,
      'Data1_element2_x': data1_element2_x,
      'Data1_element2_y': data1_element2_y,
      'Data2_element1_x': data2_element1_x,
      'Data2_element1_y': data2_element1_y,
      'Data2_element2_x': data2_element2_x,
      'Data2_element2_y': data2_element2_y
      })

      print(df)





      share|improve this answer























        up vote
        0
        down vote



        accepted







        up vote
        0
        down vote



        accepted






        Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



        from lxml import etree, objectify
        import pandas as pd

        file = 'some_huge_file.xml'

        time =
        data1_element1_x =
        data1_element1_y =
        data1_element2_x =
        data1_element2_y =
        data2_element1_x =
        data2_element1_y =
        data2_element2_x =
        data2_element2_y =

        if file.lower().endswith('.xml'):
        for event, elem in etree.iterparse(file):
        if elem.tag == "subelement":
        time.append(elem.get('tc'))
        for child in elem:
        if child.tag == "element1":
        split_data = child.text.split(" ")
        data1_element1_x.append(float(split_data[0]))
        data1_element1_y.append(float(split_data[1]))
        data2_element1_x.append(float(split_data[2]))
        data2_element1_y.append(float(split_data[3]))
        elif child.tag == "element2":
        split_data = child.text.split(" ")
        data1_element2_x.append(float(split_data[0]))
        data1_element2_y.append(float(split_data[1]))
        data2_element2_x.append(float(split_data[2]))
        data2_element2_y.append(float(split_data[3]))
        elem.clear()
        df = pd.DataFrame({
        'Time':time,
        'Data1_element1_x': data1_element1_x,
        'Data1_element1_y': data1_element1_y,
        'Data1_element2_x': data1_element2_x,
        'Data1_element2_y': data1_element2_y,
        'Data2_element1_x': data2_element1_x,
        'Data2_element1_y': data2_element1_y,
        'Data2_element2_x': data2_element2_x,
        'Data2_element2_y': data2_element2_y
        })

        print(df)





        share|improve this answer












        Based on the links you posted in the comments I came up with the following to iterate and split more efficiently that works fine:



        from lxml import etree, objectify
        import pandas as pd

        file = 'some_huge_file.xml'

        time =
        data1_element1_x =
        data1_element1_y =
        data1_element2_x =
        data1_element2_y =
        data2_element1_x =
        data2_element1_y =
        data2_element2_x =
        data2_element2_y =

        if file.lower().endswith('.xml'):
        for event, elem in etree.iterparse(file):
        if elem.tag == "subelement":
        time.append(elem.get('tc'))
        for child in elem:
        if child.tag == "element1":
        split_data = child.text.split(" ")
        data1_element1_x.append(float(split_data[0]))
        data1_element1_y.append(float(split_data[1]))
        data2_element1_x.append(float(split_data[2]))
        data2_element1_y.append(float(split_data[3]))
        elif child.tag == "element2":
        split_data = child.text.split(" ")
        data1_element2_x.append(float(split_data[0]))
        data1_element2_y.append(float(split_data[1]))
        data2_element2_x.append(float(split_data[2]))
        data2_element2_y.append(float(split_data[3]))
        elem.clear()
        df = pd.DataFrame({
        'Time':time,
        'Data1_element1_x': data1_element1_x,
        'Data1_element1_y': data1_element1_y,
        'Data1_element2_x': data1_element2_x,
        'Data1_element2_y': data1_element2_y,
        'Data2_element1_x': data2_element1_x,
        'Data2_element1_y': data2_element1_y,
        'Data2_element2_x': data2_element2_x,
        'Data2_element2_y': data2_element2_y
        })

        print(df)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 23 at 12:04









        Chrisvdberge

        5211822




        5211822






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53430768%2fparsing-big-xml-files-efficiently%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Trompette piccolo

            Slow SSRS Report in dynamic grouping and multiple parameters

            Simon Yates (cyclisme)