How to extract text from between html tags?

up vote
1
down vote

favorite

I have a some html elements from which I want to extract the text. So the html is like

<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>



</pre>

where I want to extract the text as

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in<module>()

I found an answer to that issue here, but it does not work for me. Complete example code

from bs4 import BeautifulSoup as BSHTML



bs = BSHTML("""<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

</pre>""")

print bs.font.contents[0].strip()

where I get the following error:

Traceback (most recent call last):

  File "invest.py", line 13, in <module>

    print bs.font.contents[0].strip()

AttributeError: 'NoneType' object has no attribute 'contents'

Anything I am missing? Version of beautifulsoap: 4.6.0

asked Nov 22 at 10:41

Alex

13.5k36122242

add a comment |

up vote
1
down vote

favorite

I have a some html elements from which I want to extract the text. So the html is like

<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>



</pre>

where I want to extract the text as

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in<module>()

I found an answer to that issue here, but it does not work for me. Complete example code

from bs4 import BeautifulSoup as BSHTML



bs = BSHTML("""<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

</pre>""")

print bs.font.contents[0].strip()

where I get the following error:

Traceback (most recent call last):

  File "invest.py", line 13, in <module>

    print bs.font.contents[0].strip()

AttributeError: 'NoneType' object has no attribute 'contents'

Anything I am missing? Version of beautifulsoap: 4.6.0

asked Nov 22 at 10:41

Alex

13.5k36122242

add a comment |

up vote
1
down vote

favorite

I have a some html elements from which I want to extract the text. So the html is like

<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>



</pre>

where I want to extract the text as

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in<module>()

I found an answer to that issue here, but it does not work for me. Complete example code

from bs4 import BeautifulSoup as BSHTML



bs = BSHTML("""<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

</pre>""")

print bs.font.contents[0].strip()

where I get the following error:

Traceback (most recent call last):

  File "invest.py", line 13, in <module>

    print bs.font.contents[0].strip()

AttributeError: 'NoneType' object has no attribute 'contents'

Anything I am missing? Version of beautifulsoap: 4.6.0

asked Nov 22 at 10:41

Alex

13.5k36122242

I have a some html elements from which I want to extract the text. So the html is like

<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>



</pre>

where I want to extract the text as

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in<module>()

I found an answer to that issue here, but it does not work for me. Complete example code

from bs4 import BeautifulSoup as BSHTML



bs = BSHTML("""<pre>

<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)

<span class="ansi-green-fg">&lt;ipython-input-2-0f9f90da76dc&gt;</span> in <span class="ansi-cyan-fg">&lt;module&gt;</span><span class="ansi-blue-fg">()</span>

</pre>""")

print bs.font.contents[0].strip()

where I get the following error:

Traceback (most recent call last):

  File "invest.py", line 13, in <module>

    print bs.font.contents[0].strip()

AttributeError: 'NoneType' object has no attribute 'contents'

Anything I am missing? Version of beautifulsoap: 4.6.0

python html beautifulsoup

asked Nov 22 at 10:41

Alex

13.5k36122242

asked Nov 22 at 10:41

Alex

13.5k36122242

asked Nov 22 at 10:41

Alex

13.5k36122242

asked Nov 22 at 10:41

Alex

13.5k36122242

asked Nov 22 at 10:41

Alex

13.5k36122242

add a comment |

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

Do you want all the text content of that pre block?

print bs.pre.text

Returns:

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in <module>()

answered Nov 22 at 10:55

drec4s

1,5372621

Ah perfect! Thanks
– Alex
Nov 22 at 10:55

But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59

What do you mean?
– drec4s
Nov 22 at 11:00

When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03

Then just do bs.text
– drec4s
Nov 22 at 11:04

|
show 1 more comment

up vote
0
down vote

The .font in your code sample refers to the HTML tag . Since you are instead looking to all the text from your document, you can use something like this:

contents = bs.find_all(text=True)

for c in contents:

    print(c)  # replace this with whatever you're trying to do

Output:

ZeroDivisionError

Traceback (most recent call last)



<ipython-input-2-0f9f90da76dc>

 in

<module>

()

Currently bs.font is None because you are parsing a document that doesn't contain any  tags.

If you just want the contents as one long string, you can get that by just using bs.text

'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'

edited Nov 22 at 11:01

answered Nov 22 at 10:50

johnpaton

29516

Does not quite work. I do not get the phrase Traceback (most recent call last)...
– Alex
Nov 22 at 10:51

Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52

Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53429107%2fhow-to-extract-text-from-between-html-tags%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

Do you want all the text content of that pre block?

print bs.pre.text

Returns:

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in <module>()

answered Nov 22 at 10:55

drec4s

1,5372621

Ah perfect! Thanks
– Alex
Nov 22 at 10:55

But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59

What do you mean?
– drec4s
Nov 22 at 11:00

When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03

Then just do bs.text
– drec4s
Nov 22 at 11:04

|
show 1 more comment

up vote
2
down vote

accepted

Do you want all the text content of that pre block?

print bs.pre.text

Returns:

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in <module>()

answered Nov 22 at 10:55

drec4s

1,5372621

Ah perfect! Thanks
– Alex
Nov 22 at 10:55

But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59

What do you mean?
– drec4s
Nov 22 at 11:00

When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03

Then just do bs.text
– drec4s
Nov 22 at 11:04

|
show 1 more comment

up vote
2
down vote

accepted

Do you want all the text content of that pre block?

print bs.pre.text

Returns:

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in <module>()

answered Nov 22 at 10:55

drec4s

1,5372621

Do you want all the text content of that pre block?

print bs.pre.text

Returns:

ZeroDivisionErrorTraceback (most recent call last)

<ipython-input-2-0f9f90da76dc> in <module>()

answered Nov 22 at 10:55

drec4s

1,5372621

answered Nov 22 at 10:55

drec4s

1,5372621

answered Nov 22 at 10:55

drec4s

1,5372621

answered Nov 22 at 10:55

drec4s

1,5372621

Ah perfect! Thanks
– Alex
Nov 22 at 10:55

But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59

What do you mean?
– drec4s
Nov 22 at 11:00

When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03

Then just do bs.text
– drec4s
Nov 22 at 11:04

|
show 1 more comment

Ah perfect! Thanks
– Alex
Nov 22 at 10:55

But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59

What do you mean?
– drec4s
Nov 22 at 11:00

When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03

Then just do bs.text
– drec4s
Nov 22 at 11:04

Ah perfect! Thanks
– Alex
Nov 22 at 10:55

But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59

What do you mean?
– drec4s
Nov 22 at 11:00

When I would not have the pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03

Then just do bs.text
– drec4s
Nov 22 at 11:04

|
show 1 more comment

up vote
0
down vote

The .font in your code sample refers to the HTML tag . Since you are instead looking to all the text from your document, you can use something like this:

contents = bs.find_all(text=True)

for c in contents:

    print(c)  # replace this with whatever you're trying to do

Output:

ZeroDivisionError

Traceback (most recent call last)



<ipython-input-2-0f9f90da76dc>

 in

<module>

()

Currently bs.font is None because you are parsing a document that doesn't contain any  tags.

If you just want the contents as one long string, you can get that by just using bs.text

'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'

edited Nov 22 at 11:01

answered Nov 22 at 10:50

johnpaton

29516

Does not quite work. I do not get the phrase Traceback (most recent call last)...
– Alex
Nov 22 at 10:51

Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52

Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57

add a comment |

up vote
0
down vote

The .font in your code sample refers to the HTML tag . Since you are instead looking to all the text from your document, you can use something like this:

contents = bs.find_all(text=True)

for c in contents:

    print(c)  # replace this with whatever you're trying to do

Output:

ZeroDivisionError

Traceback (most recent call last)



<ipython-input-2-0f9f90da76dc>

 in

<module>

()

Currently bs.font is None because you are parsing a document that doesn't contain any  tags.

If you just want the contents as one long string, you can get that by just using bs.text

'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'

edited Nov 22 at 11:01

answered Nov 22 at 10:50

johnpaton

29516

Does not quite work. I do not get the phrase Traceback (most recent call last)...
– Alex
Nov 22 at 10:51

Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52

Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57

add a comment |

up vote
0
down vote

The .font in your code sample refers to the HTML tag . Since you are instead looking to all the text from your document, you can use something like this:

contents = bs.find_all(text=True)

for c in contents:

    print(c)  # replace this with whatever you're trying to do

Output:

ZeroDivisionError

Traceback (most recent call last)



<ipython-input-2-0f9f90da76dc>

 in

<module>

()

Currently bs.font is None because you are parsing a document that doesn't contain any  tags.

If you just want the contents as one long string, you can get that by just using bs.text

'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'

edited Nov 22 at 11:01

answered Nov 22 at 10:50

johnpaton

29516

The .font in your code sample refers to the HTML tag . Since you are instead looking to all the text from your document, you can use something like this:

contents = bs.find_all(text=True)

for c in contents:

    print(c)  # replace this with whatever you're trying to do

Output:

ZeroDivisionError

Traceback (most recent call last)



<ipython-input-2-0f9f90da76dc>

 in

<module>

()

Currently bs.font is None because you are parsing a document that doesn't contain any  tags.

If you just want the contents as one long string, you can get that by just using bs.text

'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'

edited Nov 22 at 11:01

answered Nov 22 at 10:50

johnpaton

29516

edited Nov 22 at 11:01

answered Nov 22 at 10:50

johnpaton

29516

answered Nov 22 at 10:50

johnpaton

29516

answered Nov 22 at 10:50

johnpaton

29516

Does not quite work. I do not get the phrase Traceback (most recent call last)...
– Alex
Nov 22 at 10:51

Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52

Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57

add a comment |

Does not quite work. I do not get the phrase Traceback (most recent call last)...
– Alex
Nov 22 at 10:51

Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52

Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57

Does not quite work. I do not get the phrase Traceback (most recent call last)...
– Alex
Nov 22 at 10:51

Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52

Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi