Convert unicode with utf-8 string as content to str











up vote
10
down vote

favorite
5












I'm using pyquery to parse a page:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()


but what I get in content is a unicode string with utf-8 encoded content:



u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'


how could I convert it to str without lost the content?



to make it clear:



I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'



not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'










share|improve this question
























  • You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
    – Markus Unterwaditzer
    Jan 26 '13 at 17:59






  • 1




    Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
    – Markus Unterwaditzer
    Jan 26 '13 at 18:00










  • @MarkusUnterwaditzer if I print content, I just get some strange strings
    – wong2
    Jan 26 '13 at 18:01










  • would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
    – Markus Unterwaditzer
    Jan 26 '13 at 18:03






  • 1




    @aychedee: No it won't, that would double encode the UTF-8 data.
    – Martijn Pieters
    Jan 26 '13 at 18:25















up vote
10
down vote

favorite
5












I'm using pyquery to parse a page:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()


but what I get in content is a unicode string with utf-8 encoded content:



u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'


how could I convert it to str without lost the content?



to make it clear:



I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'



not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'










share|improve this question
























  • You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
    – Markus Unterwaditzer
    Jan 26 '13 at 17:59






  • 1




    Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
    – Markus Unterwaditzer
    Jan 26 '13 at 18:00










  • @MarkusUnterwaditzer if I print content, I just get some strange strings
    – wong2
    Jan 26 '13 at 18:01










  • would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
    – Markus Unterwaditzer
    Jan 26 '13 at 18:03






  • 1




    @aychedee: No it won't, that would double encode the UTF-8 data.
    – Martijn Pieters
    Jan 26 '13 at 18:25













up vote
10
down vote

favorite
5









up vote
10
down vote

favorite
5






5





I'm using pyquery to parse a page:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()


but what I get in content is a unicode string with utf-8 encoded content:



u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'


how could I convert it to str without lost the content?



to make it clear:



I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'



not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'










share|improve this question















I'm using pyquery to parse a page:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()


but what I get in content is a unicode string with utf-8 encoded content:



u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'


how could I convert it to str without lost the content?



to make it clear:



I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'



not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'







python utf-8 python-2.x mojibake pyquery






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 8 at 14:03









Martijn Pieters

696k12924062245




696k12924062245










asked Jan 26 '13 at 17:55









wong2

13.7k2999147




13.7k2999147












  • You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
    – Markus Unterwaditzer
    Jan 26 '13 at 17:59






  • 1




    Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
    – Markus Unterwaditzer
    Jan 26 '13 at 18:00










  • @MarkusUnterwaditzer if I print content, I just get some strange strings
    – wong2
    Jan 26 '13 at 18:01










  • would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
    – Markus Unterwaditzer
    Jan 26 '13 at 18:03






  • 1




    @aychedee: No it won't, that would double encode the UTF-8 data.
    – Martijn Pieters
    Jan 26 '13 at 18:25


















  • You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
    – Markus Unterwaditzer
    Jan 26 '13 at 17:59






  • 1




    Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
    – Markus Unterwaditzer
    Jan 26 '13 at 18:00










  • @MarkusUnterwaditzer if I print content, I just get some strange strings
    – wong2
    Jan 26 '13 at 18:01










  • would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
    – Markus Unterwaditzer
    Jan 26 '13 at 18:03






  • 1




    @aychedee: No it won't, that would double encode the UTF-8 data.
    – Martijn Pieters
    Jan 26 '13 at 18:25
















You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59




You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59




1




1




Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00




Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00












@MarkusUnterwaditzer if I print content, I just get some strange strings
– wong2
Jan 26 '13 at 18:01




@MarkusUnterwaditzer if I print content, I just get some strange strings
– wong2
Jan 26 '13 at 18:01












would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03




would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03




1




1




@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters
Jan 26 '13 at 18:25




@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters
Jan 26 '13 at 18:25












1 Answer
1






active

oldest

votes

















up vote
25
down vote



accepted










If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':



content = content.encode('latin1')


because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.



For your example this gives me:



>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表


PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})


at which point you'd not have to re-encode at all.






share|improve this answer























  • I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
    – spatel
    Mar 7 '13 at 23:53












  • Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
    – Martijn Pieters
    Mar 7 '13 at 23:54












  • Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
    – spatel
    Mar 8 '13 at 4:02












  • Thanks! Been tortured by the same issue for one day!
    – Jacky
    Jan 21 '16 at 11:07










  • thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
    – Rajasankar
    Sep 8 at 3:27











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f14539807%2fconvert-unicode-with-utf-8-string-as-content-to-str%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
25
down vote



accepted










If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':



content = content.encode('latin1')


because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.



For your example this gives me:



>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表


PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})


at which point you'd not have to re-encode at all.






share|improve this answer























  • I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
    – spatel
    Mar 7 '13 at 23:53












  • Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
    – Martijn Pieters
    Mar 7 '13 at 23:54












  • Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
    – spatel
    Mar 8 '13 at 4:02












  • Thanks! Been tortured by the same issue for one day!
    – Jacky
    Jan 21 '16 at 11:07










  • thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
    – Rajasankar
    Sep 8 at 3:27















up vote
25
down vote



accepted










If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':



content = content.encode('latin1')


because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.



For your example this gives me:



>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表


PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})


at which point you'd not have to re-encode at all.






share|improve this answer























  • I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
    – spatel
    Mar 7 '13 at 23:53












  • Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
    – Martijn Pieters
    Mar 7 '13 at 23:54












  • Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
    – spatel
    Mar 8 '13 at 4:02












  • Thanks! Been tortured by the same issue for one day!
    – Jacky
    Jan 21 '16 at 11:07










  • thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
    – Rajasankar
    Sep 8 at 3:27













up vote
25
down vote



accepted







up vote
25
down vote



accepted






If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':



content = content.encode('latin1')


because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.



For your example this gives me:



>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表


PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})


at which point you'd not have to re-encode at all.






share|improve this answer














If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':



content = content.encode('latin1')


because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.



For your example this gives me:



>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1')
'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'
>>> content.encode('latin1').decode('utf8')
u'u5c42u53e0u6837u5f0fu8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表


PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:



dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',
{'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})


at which point you'd not have to re-encode at all.







share|improve this answer














share|improve this answer



share|improve this answer








edited Sep 8 at 14:08

























answered Jan 26 '13 at 18:18









Martijn Pieters

696k12924062245




696k12924062245












  • I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
    – spatel
    Mar 7 '13 at 23:53












  • Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
    – Martijn Pieters
    Mar 7 '13 at 23:54












  • Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
    – spatel
    Mar 8 '13 at 4:02












  • Thanks! Been tortured by the same issue for one day!
    – Jacky
    Jan 21 '16 at 11:07










  • thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
    – Rajasankar
    Sep 8 at 3:27


















  • I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
    – spatel
    Mar 7 '13 at 23:53












  • Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
    – Martijn Pieters
    Mar 7 '13 at 23:54












  • Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
    – spatel
    Mar 8 '13 at 4:02












  • Thanks! Been tortured by the same issue for one day!
    – Jacky
    Jan 21 '16 at 11:07










  • thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
    – Rajasankar
    Sep 8 at 3:27
















I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53






I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53














Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters
Mar 7 '13 at 23:54






Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters
Mar 7 '13 at 23:54














Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02






Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02














Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07




Thanks! Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07












thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27




thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f14539807%2fconvert-unicode-with-utf-8-string-as-content-to-str%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to ignore python UserWarning in pytest?

What visual should I use to simply compare current year value vs last year in Power BI desktop

Script to remove string up to first number