How to extract text from between html tags?
up vote
1
down vote
favorite
I have a some html elements from which I want to extract the text. So the html is like
<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>
where I want to extract the text as
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in<module>()
I found an answer to that issue here, but it does not work for me. Complete example code
from bs4 import BeautifulSoup as BSHTML
bs = BSHTML("""<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>""")
print bs.font.contents[0].strip()
where I get the following error:
Traceback (most recent call last):
File "invest.py", line 13, in <module>
print bs.font.contents[0].strip()
AttributeError: 'NoneType' object has no attribute 'contents'
Anything I am missing? Version of beautifulsoap: 4.6.0
python html beautifulsoup
add a comment |
up vote
1
down vote
favorite
I have a some html elements from which I want to extract the text. So the html is like
<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>
where I want to extract the text as
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in<module>()
I found an answer to that issue here, but it does not work for me. Complete example code
from bs4 import BeautifulSoup as BSHTML
bs = BSHTML("""<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>""")
print bs.font.contents[0].strip()
where I get the following error:
Traceback (most recent call last):
File "invest.py", line 13, in <module>
print bs.font.contents[0].strip()
AttributeError: 'NoneType' object has no attribute 'contents'
Anything I am missing? Version of beautifulsoap: 4.6.0
python html beautifulsoup
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a some html elements from which I want to extract the text. So the html is like
<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>
where I want to extract the text as
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in<module>()
I found an answer to that issue here, but it does not work for me. Complete example code
from bs4 import BeautifulSoup as BSHTML
bs = BSHTML("""<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>""")
print bs.font.contents[0].strip()
where I get the following error:
Traceback (most recent call last):
File "invest.py", line 13, in <module>
print bs.font.contents[0].strip()
AttributeError: 'NoneType' object has no attribute 'contents'
Anything I am missing? Version of beautifulsoap: 4.6.0
python html beautifulsoup
I have a some html elements from which I want to extract the text. So the html is like
<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>
where I want to extract the text as
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in<module>()
I found an answer to that issue here, but it does not work for me. Complete example code
from bs4 import BeautifulSoup as BSHTML
bs = BSHTML("""<pre>
<span class="ansi-red-fg">ZeroDivisionError</span>Traceback (most recent call last)
<span class="ansi-green-fg"><ipython-input-2-0f9f90da76dc></span> in <span class="ansi-cyan-fg"><module></span><span class="ansi-blue-fg">()</span>
</pre>""")
print bs.font.contents[0].strip()
where I get the following error:
Traceback (most recent call last):
File "invest.py", line 13, in <module>
print bs.font.contents[0].strip()
AttributeError: 'NoneType' object has no attribute 'contents'
Anything I am missing? Version of beautifulsoap: 4.6.0
python html beautifulsoup
python html beautifulsoup
asked Nov 22 at 10:41
Alex
13.5k36122242
13.5k36122242
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
Do you want all the text content of that pre block?
print bs.pre.text
Returns:
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in <module>()
Ah perfect! Thanks
– Alex
Nov 22 at 10:55
But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59
What do you mean?
– drec4s
Nov 22 at 11:00
When I would not have thepretags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed inpre, infontor whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03
Then just dobs.text
– drec4s
Nov 22 at 11:04
|
show 1 more comment
up vote
0
down vote
The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:
contents = bs.find_all(text=True)
for c in contents:
print(c) # replace this with whatever you're trying to do
Output:
ZeroDivisionError
Traceback (most recent call last)
<ipython-input-2-0f9f90da76dc>
in
<module>
()
Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.
If you just want the contents as one long string, you can get that by just using bs.text
'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'
Does not quite work. I do not get the phraseTraceback (most recent call last)...
– Alex
Nov 22 at 10:51
Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52
Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Do you want all the text content of that pre block?
print bs.pre.text
Returns:
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in <module>()
Ah perfect! Thanks
– Alex
Nov 22 at 10:55
But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59
What do you mean?
– drec4s
Nov 22 at 11:00
When I would not have thepretags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed inpre, infontor whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03
Then just dobs.text
– drec4s
Nov 22 at 11:04
|
show 1 more comment
up vote
2
down vote
accepted
Do you want all the text content of that pre block?
print bs.pre.text
Returns:
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in <module>()
Ah perfect! Thanks
– Alex
Nov 22 at 10:55
But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59
What do you mean?
– drec4s
Nov 22 at 11:00
When I would not have thepretags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed inpre, infontor whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03
Then just dobs.text
– drec4s
Nov 22 at 11:04
|
show 1 more comment
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Do you want all the text content of that pre block?
print bs.pre.text
Returns:
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in <module>()
Do you want all the text content of that pre block?
print bs.pre.text
Returns:
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-2-0f9f90da76dc> in <module>()
answered Nov 22 at 10:55
drec4s
1,5372621
1,5372621
Ah perfect! Thanks
– Alex
Nov 22 at 10:55
But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59
What do you mean?
– drec4s
Nov 22 at 11:00
When I would not have thepretags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed inpre, infontor whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03
Then just dobs.text
– drec4s
Nov 22 at 11:04
|
show 1 more comment
Ah perfect! Thanks
– Alex
Nov 22 at 10:55
But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59
What do you mean?
– drec4s
Nov 22 at 11:00
When I would not have thepretags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed inpre, infontor whatever. I do not care about what tags are around. I just want the enclosed text...
– Alex
Nov 22 at 11:03
Then just dobs.text
– drec4s
Nov 22 at 11:04
Ah perfect! Thanks
– Alex
Nov 22 at 10:55
Ah perfect! Thanks
– Alex
Nov 22 at 10:55
But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59
But it is not a general solution of getting EVERY text, no matter what tags I have...
– Alex
Nov 22 at 10:59
What do you mean?
– drec4s
Nov 22 at 11:00
What do you mean?
– drec4s
Nov 22 at 11:00
When I would not have the
pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...– Alex
Nov 22 at 11:03
When I would not have the
pre tags your approach would not work. I do not care about the tag, I just want the enclosed text. If they are enclosed in pre, in font or whatever. I do not care about what tags are around. I just want the enclosed text...– Alex
Nov 22 at 11:03
Then just do
bs.text– drec4s
Nov 22 at 11:04
Then just do
bs.text– drec4s
Nov 22 at 11:04
|
show 1 more comment
up vote
0
down vote
The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:
contents = bs.find_all(text=True)
for c in contents:
print(c) # replace this with whatever you're trying to do
Output:
ZeroDivisionError
Traceback (most recent call last)
<ipython-input-2-0f9f90da76dc>
in
<module>
()
Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.
If you just want the contents as one long string, you can get that by just using bs.text
'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'
Does not quite work. I do not get the phraseTraceback (most recent call last)...
– Alex
Nov 22 at 10:51
Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52
Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57
add a comment |
up vote
0
down vote
The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:
contents = bs.find_all(text=True)
for c in contents:
print(c) # replace this with whatever you're trying to do
Output:
ZeroDivisionError
Traceback (most recent call last)
<ipython-input-2-0f9f90da76dc>
in
<module>
()
Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.
If you just want the contents as one long string, you can get that by just using bs.text
'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'
Does not quite work. I do not get the phraseTraceback (most recent call last)...
– Alex
Nov 22 at 10:51
Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52
Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57
add a comment |
up vote
0
down vote
up vote
0
down vote
The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:
contents = bs.find_all(text=True)
for c in contents:
print(c) # replace this with whatever you're trying to do
Output:
ZeroDivisionError
Traceback (most recent call last)
<ipython-input-2-0f9f90da76dc>
in
<module>
()
Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.
If you just want the contents as one long string, you can get that by just using bs.text
'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'
The .font in your code sample refers to the HTML tag <font>. Since you are instead looking to all the text from your document, you can use something like this:
contents = bs.find_all(text=True)
for c in contents:
print(c) # replace this with whatever you're trying to do
Output:
ZeroDivisionError
Traceback (most recent call last)
<ipython-input-2-0f9f90da76dc>
in
<module>
()
Currently bs.font is None because you are parsing a document that doesn't contain any <font> tags.
If you just want the contents as one long string, you can get that by just using bs.text
'nZeroDivisionErrorTraceback (most recent call last)n<ipython-input-2-0f9f90da76dc> in <module>()n'
edited Nov 22 at 11:01
answered Nov 22 at 10:50
johnpaton
29516
29516
Does not quite work. I do not get the phraseTraceback (most recent call last)...
– Alex
Nov 22 at 10:51
Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52
Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57
add a comment |
Does not quite work. I do not get the phraseTraceback (most recent call last)...
– Alex
Nov 22 at 10:51
Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52
Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57
Does not quite work. I do not get the phrase
Traceback (most recent call last)...– Alex
Nov 22 at 10:51
Does not quite work. I do not get the phrase
Traceback (most recent call last)...– Alex
Nov 22 at 10:51
Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52
Also, it does not preserve new lines!
– Alex
Nov 22 at 10:52
Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57
Ah, I see I misinterpreted your document. Edited to grab everything (and also retain newlines).
– johnpaton
Nov 22 at 10:57
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53429107%2fhow-to-extract-text-from-between-html-tags%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown