Convert unicode with utf-8 string as content to str

up vote
10
down vote

favorite

I'm using pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'

how could I convert it to str without lost the content?

to make it clear:

I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59

1

Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00

@MarkusUnterwaditzer if I print content, I just get some strange strings
– wong2
Jan 26 '13 at 18:01

would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03

1

@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25

|
show 2 more comments

up vote
10
down vote

favorite

I'm using pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'

how could I convert it to str without lost the content?

to make it clear:

I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59

1

Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00

@MarkusUnterwaditzer if I print content, I just get some strange strings
– wong2
Jan 26 '13 at 18:01

would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03

1

@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25

|
show 2 more comments

up vote
10
down vote

favorite

I'm using pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'

how could I convert it to str without lost the content?

to make it clear:

I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

I'm using pyquery to parse a page:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8...'

how could I convert it to str without lost the content?

to make it clear:

I want conent == 'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

not conent == u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

python utf-8 python-2.x mojibake pyquery

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

edited Sep 8 at 14:03

Martijn Pieters♦

696k12924062245

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

asked Jan 26 '13 at 17:55

wong2

13.7k2999147

You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59

1

Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00

@MarkusUnterwaditzer if I print content, I just get some strange strings
– wong2
Jan 26 '13 at 18:01

would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03

1

@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25

|
show 2 more comments

You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59

1

Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00

@MarkusUnterwaditzer if I print content, I just get some strange strings
– wong2
Jan 26 '13 at 18:01

would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03

1

@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25

You can usually treat unicode strings like normal strings. Is there any reason why you want to convert it?
– Markus Unterwaditzer
Jan 26 '13 at 17:59

Also, for more information about Unicode, ASCII and the like i recommend: nedbatchelder.com/text/unipain.html
– Markus Unterwaditzer
Jan 26 '13 at 18:00

@MarkusUnterwaditzer if I print content, I just get some strange strings
– wong2
Jan 26 '13 at 18:01

would content.encode('utf-8') do the trick? Also i think Wikipedia has a proper API to query articles, no need to scrape the website.
– Markus Unterwaditzer
Jan 26 '13 at 18:03

@aychedee: No it won't, that would double encode the UTF-8 data.
– Martijn Pieters♦
Jan 26 '13 at 18:25

|
show 2 more comments

1 Answer
1

active

oldest

votes

up vote
25
down vote

accepted

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

For your example this gives me:

>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1')

'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1').decode('utf8')

u'u5c42u53e0u6837u5f0fu8868'

>>> print content.encode('latin1').decode('utf8')

层叠样式表

PyQuery uses either requests or urllib to retrieve the HTML, and in the case of requests, uses the .text attribute of the response. This auto-decodes the response data based on the encoding set in a Content-Type header alone, or if that information is not available, uses latin-1 for this (for text responses, but HTML is a text response). You can override this by passing in an encoding argument:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',

              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all.

edited Sep 8 at 14:08

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53

Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54

Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02

Thanks！ Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07

thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f14539807%2fconvert-unicode-with-utf-8-string-as-content-to-str%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
25
down vote

accepted

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

For your example this gives me:

>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1')

'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1').decode('utf8')

u'u5c42u53e0u6837u5f0fu8868'

>>> print content.encode('latin1').decode('utf8')

层叠样式表

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',

              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all.

edited Sep 8 at 14:08

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53

Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54

Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02

Thanks！ Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07

thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27

add a comment |

up vote
25
down vote

accepted

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

For your example this gives me:

>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1')

'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1').decode('utf8')

u'u5c42u53e0u6837u5f0fu8868'

>>> print content.encode('latin1').decode('utf8')

层叠样式表

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',

              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all.

edited Sep 8 at 14:08

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53

Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54

Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02

Thanks！ Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07

thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27

add a comment |

up vote
25
down vote

accepted

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

For your example this gives me:

>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1')

'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1').decode('utf8')

u'u5c42u53e0u6837u5f0fu8868'

>>> print content.encode('latin1').decode('utf8')

层叠样式表

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',

              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all.

edited Sep 8 at 14:08

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

For your example this gives me:

>>> content = u'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1')

'xe5xb1x82xe5x8fxa0xe6xa0xb7xe5xbcx8fxe8xa1xa8'

>>> content.encode('latin1').decode('utf8')

u'u5c42u53e0u6837u5f0fu8868'

>>> print content.encode('latin1').decode('utf8')

层叠样式表

dom = PyQuery('http://zh.wikipedia.org/w/index.php', encoding='utf8',

              {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})

at which point you'd not have to re-encode at all.

edited Sep 8 at 14:08

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

edited Sep 8 at 14:08

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

answered Jan 26 '13 at 18:18

Martijn Pieters♦

696k12924062245

I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53

Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54

Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02

Thanks！ Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07

thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27

add a comment |

I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53

Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54

Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02

Thanks！ Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07

thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27

I had the same problem, but your solution only works from the REPL, not from a script. I had to change it to be like this: content.encode('latin1').decode('utf8').encode('utf8')
– spatel
Mar 7 '13 at 23:53

Encoding to UTF-8 is fine if that is what you need in the end. But you can skip the decode then too!
– Martijn Pieters♦
Mar 7 '13 at 23:54

Well I'll be, I should have tried that so I could save myself some self inflicted trauma to the head. I have to admit though, it still confuses me.
– spatel
Mar 8 '13 at 4:02

Thanks！ Been tortured by the same issue for one day!
– Jacky
Jan 21 '16 at 11:07

thanks a lot for this workaround. I was able to convert tamil unicode to readable format.
– Rajasankar
Sep 8 at 3:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Qfyilyi