写python爬虫是遇到编码错误
报错为:
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0'
经过多方查找发现 \xa0是html网页源码中的空格
解决方法
替换掉字符 :replace(u'\xa0', u' ')
下面是一些html中的常见符号
chr | HexCode | Numeric | HTML entity |
" | \x22 | " | " |
& | \x26 | & | & |
< | \x3C | < | < |
> | \x3E | > | > |
空格 | \xA0 |   | |
¡ | \xA1 | ¡ | ¡ |
¢ | \xA2 | ¢ | ¢ |
£ | \xA3 | £ | £ |
¤ | \xA4 | ¤ | ¤ |
¥ | \xA5 | ¥ | ¥ |
¦ | \xA6 | ¦ | ¦ |
§ | \xA7 | § | § |
¨ | \xA8 | ¨ | ¨ |
© | \xA9 | © | © |
ª | \xAA | ª | ª |
« | \xAB | « | « |
¬ | \xAC | ¬ | ¬ |
| \xAD | ­ | ­ |
® | \xAE | ® | ® |
¯ | \xAF | ¯ | ¯ |
° | \xB0 | ° | ° |
± | \xB1 | ± | ± |
² | \xB2 | ² | ² |
³ | \xB3 | ³ | ³ |
´ | \xB4 | ´ | ´ |
µ | \xB5 | µ | µ |
¶ | \xB6 | ¶ | ¶ |
· | \xB7 | · | · |
¸ | \xB8 | ¸ | ¸ |
¹ | \xB9 | ¹ | ¹ |
º | \xBA | º | º |
» | \xBB | » | » |
¼ | \xBC | ¼ | ¼ |
½ | \xBD | ½ | ½ |
¾ | \xBE | ¾ | ¾ |
¿ | \xBF | ¿ | ¿ |
× | \xD7 | × | × |
÷ | \xF7 | ÷ | ÷ |
ƒ | \u0192 | ƒ | ƒ |
ˆ | \u02C6 | ˆ | ˆ |
˜ | \u02DC | ˜ | ˜ |
| \u2002 |   |   |
| \u2003 |   |   |
| \u2009 |   |   |
| \u200C | ‌ | ‌ |
| \u200D | ‍ | ‍ |
| \u200E | ‎ | ‎ |
| \u200F | ‏ | ‏ |
– | \u2013 | – | – |
— | \u2014 | — | — |
‘ | \u2018 | ‘ | ‘ |
’ | \u2019 | ’ | ’ |
‚ | \u201A | ‚ | ‚ |
“ | \u201C | “ | “ |
” | \u201D | ” | ” |
„ | \u201E | „ | „ |
† | \u2020 | † | † |
‡ | \u2021 | ‡ | ‡ |
• | \u2022 | • | • |
… | \u2026 | … | … |
‰ | \u2030 | ‰ | ‰ |
′ | \u2032 | ′ | ′ |
″ | \u2033 | ″ | ″ |
‹ | \u2039 | ‹ | ‹ |
› | \u203A | › | › |
‾ | \u203E | ‾ | ‾ |
⁄ | \u2044 | ⁄ | ⁄ |
€ | \u20AC | € | € |
ℑ | \u2111 | ℑ | ℑ |
? | \u2113 | ℓ |
|
№ | \u2116 | № |
|
℘ | \u2118 | ℘ | ℘ |
ℜ | \u211C | ℜ | ℜ |
™ | \u2122 | ™ | ™ |
ℵ | \u2135 | ℵ | ℵ |
← | \u2190 | ← | ← |
↑ | \u2191 | ↑ | ↑ |
→ | \u2192 | → | → |
↓ | \u2193 | ↓ | ↓ |
↔ | \u2194 | ↔ | ↔ |
↵ | \u21B5 | ↵ | ↵ |
⇐ | \u21D0 | ⇐ | ⇐ |
⇑ | \u21D1 | ⇑ | ⇑ |
⇒ | \u21D2 | ⇒ | ⇒ |
⇓ | \u21D3 | ⇓ | ⇓ |
⇔ | \u21D4 | ⇔ | ⇔ |
∀ | \u2200 | ∀ | ∀ |
∂ | \u2202 | ∂ | ∂ |
∃ | \u2203 | ∃ | ∃ |
∅ | \u2205 | ∅ | ∅ |
∇ | \u2207 | ∇ | ∇ |
∈ | \u2208 | ∈ | ∈ |
∉ | \u2209 | ∉ | ∉ |
∋ | \u220B | ∋ | ∋ |
∏ | \u220F | ∏ | ∏ |
∑ | \u2211 | ∑ | ∑ |
− | \u2212 | − | − |
∗ | \u2217 | ∗ | ∗ |
√ | \u221A | √ | √ |
∝ | \u221D | ∝ | ∝ |
∞ | \u221E | ∞ | ∞ |
∠ | \u2220 | ∠ | ∠ |
∧ | \u2227 | ∧ | ∧ |
∨ | \u2228 | ∨ | ∨ |
∩ | \u2229 | ∩ | ∩ |
∪ | \u222A | ∪ | ∪ |
∫ | \u222B | ∫ | ∫ |
∴ | \u2234 | ∴ | ∴ |
∼ | \u223C | ∼ | ∼ |
≅ | \u2245 | ≅ | ≅ |
≈ | \u2248 | ≈ | ≈ |
≠ | \u2260 | ≠ | ≠ |
≡ | \u2261 | ≡ | ≡ |
≤ | \u2264 | ≤ | ≤ |
≥ | \u2265 | ≥ | ≥ |
⊂ | \u2282 | ⊂ | ⊂ |
⊃ | \u2283 | ⊃ | ⊃ |
⊄ | \u2284 | ⊄ | ⊄ |
⊆ | \u2286 | ⊆ | ⊆ |
⊇ | \u2287 | ⊇ | ⊇ |
⊕ | \u2295 | ⊕ | ⊕ |
⊗ | \u2297 | ⊗ | ⊗ |
⊥ | \u22A5 | ⊥ | ⊥ |
⋅ | \u22C5 | ⋅ | ⋅ |
⌈ | \u2308 | ⌈ | ⌈ |
⌉ | \u2309 | ⌉ | ⌉ |
⌊ | \u230A | ⌊ | ⌊ |
⌋ | \u230B | ⌋ | ⌋ |
〈 | \u2329 | 〈 | ⟨ |
〉 | \u232A | 〉 | ⟩ |
◊ | \u25CA | ◊ | ◊ |
♠ | \u2660 | ♠ | ♠ |
♣ | \u2663 | ♣ | ♣ |
♥ | \u2665 | ♥ | ♥ |
♦ | \u2666 | ♦ | ♦ |