Convert unicode with utf-8 string as content to str
up vote
10
down vote
favorite
I'm using pyquery to parse a page:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()
but what I get in content
is a unicode string with utf-8 encoded content:
u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'
how could I convert it to str
without lost the content?
to make it clear:
I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
python utf-8 python-2.x mojibake pyquery
|
show 2 more comments
up vote
10
down vote
favorite
I'm using pyquery to parse a page:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()
but what I get in content
is a unicode string with utf-8 encoded content:
u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'
how could I convert it to str
without lost the content?
to make it clear:
I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
python utf-8 python-2.x mojibake pyquery
You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59
1
Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00
@MarkusUnterwaditzer if I printcontent
, I just get some strange strings
– wong2
Jan 26 '13 at 18:01
wouldcontent.encode('utf-8')
do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03
1
@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25
|
show 2 more comments
up vote
10
down vote
favorite
up vote
10
down vote
favorite
I'm using pyquery to parse a page:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()
but what I get in content
is a unicode string with utf-8 encoded content:
u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'
how could I convert it to str
without lost the content?
to make it clear:
I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
python utf-8 python-2.x mojibake pyquery
I'm using pyquery to parse a page:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()
but what I get in content
is a unicode string with utf-8 encoded content:
u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'
how could I convert it to str
without lost the content?
to make it clear:
I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
python utf-8 python-2.x mojibake pyquery
python utf-8 python-2.x mojibake pyquery
edited Sep 8 at 14:03
Martijn Pieters♦
696k12924062245
696k12924062245
asked Jan 26 '13 at 17:55
wong2
13.7k2999147
13.7k2999147
You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59
1
Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00
@MarkusUnterwaditzer if I printcontent
, I just get some strange strings
– wong2
Jan 26 '13 at 18:01
wouldcontent.encode('utf-8')
do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03
1
@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25
|
show 2 more comments
You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59
1
Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00
@MarkusUnterwaditzer if I printcontent
, I just get some strange strings
– wong2
Jan 26 '13 at 18:01
wouldcontent.encode('utf-8')
do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03
1
@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25
You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59
You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59
1
1
Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00
Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00
@MarkusUnterwaditzer if I print
content
, I just get some strange strings– wong2
Jan 26 '13 at 18:01
@MarkusUnterwaditzer if I print
content
, I just get some strange strings– wong2
Jan 26 '13 at 18:01
would
content.encode('utf-8')
do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.– Markus Unterwaditzer
Jan 26 '13 at 18:03
would
content.encode('utf-8')
do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.– Markus Unterwaditzer
Jan 26 '13 at 18:03
1
1
@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25
@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25
|
show 2 more comments
1 Answer
1
active
oldest
votes
up vote
25
down vote
accepted
If you have a unicode
value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
For your example this gives me:
>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQuery
uses either requests
or urllib
to retrieve the HTML, and in the case of requests
, uses the .text
attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type
header alone, or if that information is not available, uses latin-1
for this (for text responses, but HTML is a text response). You can override this by passing in an encoding
argument:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.
I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53
Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54
Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02
Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07
thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
25
down vote
accepted
If you have a unicode
value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
For your example this gives me:
>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQuery
uses either requests
or urllib
to retrieve the HTML, and in the case of requests
, uses the .text
attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type
header alone, or if that information is not available, uses latin-1
for this (for text responses, but HTML is a text response). You can override this by passing in an encoding
argument:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.
I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53
Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54
Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02
Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07
thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27
add a comment |
up vote
25
down vote
accepted
If you have a unicode
value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
For your example this gives me:
>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQuery
uses either requests
or urllib
to retrieve the HTML, and in the case of requests
, uses the .text
attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type
header alone, or if that information is not available, uses latin-1
for this (for text responses, but HTML is a text response). You can override this by passing in an encoding
argument:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.
I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53
Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54
Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02
Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07
thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27
add a comment |
up vote
25
down vote
accepted
up vote
25
down vote
accepted
If you have a unicode
value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
For your example this gives me:
>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQuery
uses either requests
or urllib
to retrieve the HTML, and in the case of requests
, uses the .text
attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type
header alone, or if that information is not available, uses latin-1
for this (for text responses, but HTML is a text response). You can override this by passing in an encoding
argument:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.
If you have a unicode
value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':
content = content.encode('latin1')
because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.
For your example this gives me:
>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表
PyQuery
uses either requests
or urllib
to retrieve the HTML, and in the case of requests
, uses the .text
attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type
header alone, or if that information is not available, uses latin-1
for this (for text responses, but HTML is a text response). You can override this by passing in an encoding
argument:
dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
at which point you'd not have to re-encode at all.
edited Sep 8 at 14:08
answered Jan 26 '13 at 18:18
Martijn Pieters♦
696k12924062245
696k12924062245
I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53
Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54
Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02
Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07
thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27
add a comment |
I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53
Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54
Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02
Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07
thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27
I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53
I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53
Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54
Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54
Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02
Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02
Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07
Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07
thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27
thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f14539807%2fconvert-unicode-with-utf-8-string-as-content-to-str%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59
1
Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00
@MarkusUnterwaditzer if I print
content
, I just get some strange strings– wong2
Jan 26 '13 at 18:01
would
content.encode('utf-8')
do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.– Markus Unterwaditzer
Jan 26 '13 at 18:03
1
@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25