How to extract text from between html tags?











up vote
1
down vote

favorite












I have a some html elements from which I want to extract the text. So the html is like



<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

</pre>


where I want to extract the text as



ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in<module>()


I found an answer to that issue here, but it does not work for me. Complete example code



from bs4 import BeautifulSoup as BSHTML

bs = BSHTML("""<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>
</pre>""")
print bs.font.contents[0].strip()


where I get the following error:



Traceback (most recent call last):
File "invest.py", line 13, in <module>
print bs.font.contents[0].strip()
AttributeError: 'NoneType' object has no attribute 'contents'


Anything I am missing? Version of beautifulsoap: 4.6.0










share|improve this question


























    up vote
    1
    down vote

    favorite












    I have a some html elements from which I want to extract the text. So the html is like



    <pre>
    <span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
    <span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

    </pre>


    where I want to extract the text as



    ZeroDivisionErrorTraceback (most recent call last)
    <ipython-input-2-0f9f90da76dc> in<module>()


    I found an answer to that issue here, but it does not work for me. Complete example code



    from bs4 import BeautifulSoup as BSHTML

    bs = BSHTML("""<pre>
    <span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
    <span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>
    </pre>""")
    print bs.font.contents[0].strip()


    where I get the following error:



    Traceback (most recent call last):
    File "invest.py", line 13, in <module>
    print bs.font.contents[0].strip()
    AttributeError: 'NoneType' object has no attribute 'contents'


    Anything I am missing? Version of beautifulsoap: 4.6.0










    share|improve this question
























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I have a some html elements from which I want to extract the text. So the html is like



      <pre>
      <span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
      <span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

      </pre>


      where I want to extract the text as



      ZeroDivisionErrorTraceback (most recent call last)
      <ipython-input-2-0f9f90da76dc> in<module>()


      I found an answer to that issue here, but it does not work for me. Complete example code



      from bs4 import BeautifulSoup as BSHTML

      bs = BSHTML("""<pre>
      <span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
      <span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>
      </pre>""")
      print bs.font.contents[0].strip()


      where I get the following error:



      Traceback (most recent call last):
      File "invest.py", line 13, in <module>
      print bs.font.contents[0].strip()
      AttributeError: 'NoneType' object has no attribute 'contents'


      Anything I am missing? Version of beautifulsoap: 4.6.0










      share|improve this question













      I have a some html elements from which I want to extract the text. So the html is like



      <pre>
      <span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
      <span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

      </pre>


      where I want to extract the text as



      ZeroDivisionErrorTraceback (most recent call last)
      <ipython-input-2-0f9f90da76dc> in<module>()


      I found an answer to that issue here, but it does not work for me. Complete example code



      from bs4 import BeautifulSoup as BSHTML

      bs = BSHTML("""<pre>
      <span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
      <span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>
      </pre>""")
      print bs.font.contents[0].strip()


      where I get the following error:



      Traceback (most recent call last):
      File "invest.py", line 13, in <module>
      print bs.font.contents[0].strip()
      AttributeError: 'NoneType' object has no attribute 'contents'


      Anything I am missing? Version of beautifulsoap: 4.6.0







      python html beautifulsoup






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 at 10:41









      Alex

      13.5k36122242




      13.5k36122242
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          Do you want all the text content of that pre block?



          print bs.pre.text


          Returns:



          ZeroDivisionErrorTraceback (most recent call last)
          <ipython-input-2-0f9f90da76dc> in <module>()





          share|improve this answer





















          • Ah perfect! Thanks
            – Alex
            Nov 22 at 10:55










          • But it is not a general solution of getting EVERY text, no matter what tags I have...
            – Alex
            Nov 22 at 10:59










          • What do you mean?
            – drec4s
            Nov 22 at 11:00










          • When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
            – Alex
            Nov 22 at 11:03










          • Then just do bs.text
            – drec4s
            Nov 22 at 11:04


















          up vote
          0
          down vote













          The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:



          contents = bs.find_all(text=True)
          for c in contents:
          print(c) # replace this with whatever you're trying to do


          Output:



          ZeroDivisionError
          Traceback (most recent call last)

          <ipython-input-2-0f9f90da76dc>
          in
          <module>
          ()


          Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.



          If you just want the contents as one long string, you can get that by just using bs.text



          'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'





          share|improve this answer























          • Does not quite work. I do not get the phrase Traceback (most recent call last)...
            – Alex
            Nov 22 at 10:51










          • Also, it does not preserve new lines!
            – Alex
            Nov 22 at 10:52










          • Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
            – johnpaton
            Nov 22 at 10:57













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53429107%2fhow-to-extract-text-from-between-html-tags%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          Do you want all the text content of that pre block?



          print bs.pre.text


          Returns:



          ZeroDivisionErrorTraceback (most recent call last)
          <ipython-input-2-0f9f90da76dc> in <module>()





          share|improve this answer





















          • Ah perfect! Thanks
            – Alex
            Nov 22 at 10:55










          • But it is not a general solution of getting EVERY text, no matter what tags I have...
            – Alex
            Nov 22 at 10:59










          • What do you mean?
            – drec4s
            Nov 22 at 11:00










          • When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
            – Alex
            Nov 22 at 11:03










          • Then just do bs.text
            – drec4s
            Nov 22 at 11:04















          up vote
          2
          down vote



          accepted










          Do you want all the text content of that pre block?



          print bs.pre.text


          Returns:



          ZeroDivisionErrorTraceback (most recent call last)
          <ipython-input-2-0f9f90da76dc> in <module>()





          share|improve this answer





















          • Ah perfect! Thanks
            – Alex
            Nov 22 at 10:55










          • But it is not a general solution of getting EVERY text, no matter what tags I have...
            – Alex
            Nov 22 at 10:59










          • What do you mean?
            – drec4s
            Nov 22 at 11:00










          • When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
            – Alex
            Nov 22 at 11:03










          • Then just do bs.text
            – drec4s
            Nov 22 at 11:04













          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          Do you want all the text content of that pre block?



          print bs.pre.text


          Returns:



          ZeroDivisionErrorTraceback (most recent call last)
          <ipython-input-2-0f9f90da76dc> in <module>()





          share|improve this answer












          Do you want all the text content of that pre block?



          print bs.pre.text


          Returns:



          ZeroDivisionErrorTraceback (most recent call last)
          <ipython-input-2-0f9f90da76dc> in <module>()






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 22 at 10:55









          drec4s

          1,5372621




          1,5372621












          • Ah perfect! Thanks
            – Alex
            Nov 22 at 10:55










          • But it is not a general solution of getting EVERY text, no matter what tags I have...
            – Alex
            Nov 22 at 10:59










          • What do you mean?
            – drec4s
            Nov 22 at 11:00










          • When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
            – Alex
            Nov 22 at 11:03










          • Then just do bs.text
            – drec4s
            Nov 22 at 11:04


















          • Ah perfect! Thanks
            – Alex
            Nov 22 at 10:55










          • But it is not a general solution of getting EVERY text, no matter what tags I have...
            – Alex
            Nov 22 at 10:59










          • What do you mean?
            – drec4s
            Nov 22 at 11:00










          • When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
            – Alex
            Nov 22 at 11:03










          • Then just do bs.text
            – drec4s
            Nov 22 at 11:04
















          Ah perfect! Thanks
          – Alex
          Nov 22 at 10:55




          Ah perfect! Thanks
          – Alex
          Nov 22 at 10:55












          But it is not a general solution of getting EVERY text, no matter what tags I have...
          – Alex
          Nov 22 at 10:59




          But it is not a general solution of getting EVERY text, no matter what tags I have...
          – Alex
          Nov 22 at 10:59












          What do you mean?
          – drec4s
          Nov 22 at 11:00




          What do you mean?
          – drec4s
          Nov 22 at 11:00












          When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
          – Alex
          Nov 22 at 11:03




          When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
          – Alex
          Nov 22 at 11:03












          Then just do bs.text
          – drec4s
          Nov 22 at 11:04




          Then just do bs.text
          – drec4s
          Nov 22 at 11:04












          up vote
          0
          down vote













          The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:



          contents = bs.find_all(text=True)
          for c in contents:
          print(c) # replace this with whatever you're trying to do


          Output:



          ZeroDivisionError
          Traceback (most recent call last)

          <ipython-input-2-0f9f90da76dc>
          in
          <module>
          ()


          Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.



          If you just want the contents as one long string, you can get that by just using bs.text



          'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'





          share|improve this answer























          • Does not quite work. I do not get the phrase Traceback (most recent call last)...
            – Alex
            Nov 22 at 10:51










          • Also, it does not preserve new lines!
            – Alex
            Nov 22 at 10:52










          • Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
            – johnpaton
            Nov 22 at 10:57

















          up vote
          0
          down vote













          The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:



          contents = bs.find_all(text=True)
          for c in contents:
          print(c) # replace this with whatever you're trying to do


          Output:



          ZeroDivisionError
          Traceback (most recent call last)

          <ipython-input-2-0f9f90da76dc>
          in
          <module>
          ()


          Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.



          If you just want the contents as one long string, you can get that by just using bs.text



          'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'





          share|improve this answer























          • Does not quite work. I do not get the phrase Traceback (most recent call last)...
            – Alex
            Nov 22 at 10:51










          • Also, it does not preserve new lines!
            – Alex
            Nov 22 at 10:52










          • Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
            – johnpaton
            Nov 22 at 10:57















          up vote
          0
          down vote










          up vote
          0
          down vote









          The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:



          contents = bs.find_all(text=True)
          for c in contents:
          print(c) # replace this with whatever you're trying to do


          Output:



          ZeroDivisionError
          Traceback (most recent call last)

          <ipython-input-2-0f9f90da76dc>
          in
          <module>
          ()


          Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.



          If you just want the contents as one long string, you can get that by just using bs.text



          'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'





          share|improve this answer














          The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:



          contents = bs.find_all(text=True)
          for c in contents:
          print(c) # replace this with whatever you're trying to do


          Output:



          ZeroDivisionError
          Traceback (most recent call last)

          <ipython-input-2-0f9f90da76dc>
          in
          <module>
          ()


          Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.



          If you just want the contents as one long string, you can get that by just using bs.text



          'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 22 at 11:01

























          answered Nov 22 at 10:50









          johnpaton

          29516




          29516












          • Does not quite work. I do not get the phrase Traceback (most recent call last)...
            – Alex
            Nov 22 at 10:51










          • Also, it does not preserve new lines!
            – Alex
            Nov 22 at 10:52










          • Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
            – johnpaton
            Nov 22 at 10:57




















          • Does not quite work. I do not get the phrase Traceback (most recent call last)...
            – Alex
            Nov 22 at 10:51










          • Also, it does not preserve new lines!
            – Alex
            Nov 22 at 10:52










          • Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
            – johnpaton
            Nov 22 at 10:57


















          Does not quite work. I do not get the phrase Traceback (most recent call last)...
          – Alex
          Nov 22 at 10:51




          Does not quite work. I do not get the phrase Traceback (most recent call last)...
          – Alex
          Nov 22 at 10:51












          Also, it does not preserve new lines!
          – Alex
          Nov 22 at 10:52




          Also, it does not preserve new lines!
          – Alex
          Nov 22 at 10:52












          Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
          – johnpaton
          Nov 22 at 10:57






          Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
          – johnpaton
          Nov 22 at 10:57




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53429107%2fhow-to-extract-text-from-between-html-tags%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Catalogne

          Violoncelliste

          Héron pourpré