UTF-8,GB2312编码转换
URL中对于POST传递中文的编码采用的是Unicode-8编码的方式,登录开心网,输入”飞@email.com”,捕捉其POST的数据,其传递 的PostData如下所示:”url=%2Fhome%2F&email=%E9%A3%9E@email.com& password=”email后的值%E9%A3%9E除去%后,为”E93A9E”。熟悉编码的各位应该很敏感的感觉到这是汉字的UTF-8编码,在 我最近的一个项目中,需要使用VB来编写,很可惜的是,VB并不没有将UTF-8编码直接解释为对应的字符串,相对于其他语言来讲,这真是一个残念。所以 搞明白UTF-8与Unicode编码的关系,这样才会很容易的将其转换为相对应的字符串。
wiki百科UTF-8的描述:(传送门在这里http://zh.wikipedia.org/w/index.php?title=UTF-8&variant=zh-cn)
UTF-8的编码方式
UTF-8是UNICODE的一种变长度的编码表达方式 〈一般UNICODE为双字节(指UCS2)〉,它由Ken Thompson于1992年建立,现在已经标准化为RFC 3629。UTF-8就是以8位为单元对UCS进行编码,而UTF-8不使用大尾序和小尾序的形式,每个使用UTF-8储存的字符,除了第一个字节外,其余字节的头两个位元都是以 “10″ 开始,使文字处理器能够较快地找出每个字符的开始位置。
但为了与以前的ASCII码相容 (ASCII为一个字节),因此 UTF-8 选择了使用可变长度字节来储存 Unicode:
Unicode和UTF-8之间的转换关系表
UCS-4编码 UTF-8字节流
U-00000000 – U-0000007F 0xxxxxxx
U-00000080 – U-000007FF 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 – U-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
* 在ASCII码的范围,用一个字节表示,超出ASCII码的范围就用字节表示,这就形成了我们上面看到的UTF-8的表示方法,这様的好处是当 UNICODE文件中只有ASCII码时,储存的文件都为一个字节,所以就是普通的ASCII文件无异,读取的时候也是如此,所以能与以前的ASCII文 件相容。
* 大于ASCII码的,就会由上面的第一字节的前几位表示该unicode字符的长度,比如110xxxxxx前三位的二进制表示告诉我们这是个 2BYTE的UNICODE字符;1110xxxx是个三位的UNICODE字符,依此类推;xxx 的位置由字符编码数的二进制表示的位填入. 越靠右的 x 具有越少的特殊意义.只用最短的那个足够表达一个字符编码数的多字节串. 注意在多字节串中,第一个字节的开头”1″的数目就是整个串中字节的数目.。
ASCII字母继续使用1字节储存,重音文字、希腊字母或西里尔字母等使用2字节来储存,而常用的汉字就要使用3字节。辅助平面字符则使用4字节。
在UTF-8文件的开首,很多时都放置一个U+FEFF字符 (UTF-8 以 EF,BB,BF 代表),以显示这个文字档案是以UTF-8编码。
通过上面的描述,我们可以很清楚的知道只要知道某个字符的Unicode编码,则其UTF-8编码就可以很容易得知,并也可以很容易的将UTF-8编码转换成对应的Unicode编码。以下为具体的VB代码….
‘UTF-8编码
Public Function UTF8Encode(ByVal szInput As String) As String
Dim wch As String
Dim uch As String
Dim szRet As String
Dim x As Long
Dim inputLen As Long
Dim nAsc As Long
Dim nAsc2 As Long
Dim nAsc3 As Long
If szInput = “” Then
UTF8Encode = szInput
Exit Function
End If
inputLen = Len(szInput)
For x = 1 To inputLen
‘得到每个字符
wch = Mid(szInput, x, 1)
‘得到相应的UNICODE编码
nAsc = AscW(wch)
‘对于<0的编码 其需要加上65536
If nAsc < 0 Then nAsc = nAsc + 65536
‘对于<128位的ASCII的编码则无需更改
If (nAsc And &HFF80) = 0 Then
szRet = szRet & wch
Else
If (nAsc And &HF000) = 0 Then
‘真正的第二层编码范围为000080 – 0007FF
‘Unicode在范围D800-DFFF中不存在任何字符,基本多文种平面中约定了这个范围用于UTF-16扩展标识辅助平面(两个UTF-16表示一个辅助平面字符).
‘当然,任何编码都是可以被转换到这个范围,但在unicode中他们并不代表任何合法的值。
uch = “%” & Hex(((nAsc 2 ^ 6)) Or &HC0) & Hex(nAsc And &H3F Or &H80)
szRet = szRet & uch
Else
‘第三层编码00000800 – 0000FFFF
‘首先取其前四位与11100000进行或去处得到UTF-8编码的前8位
‘其次取其前10位与111111进行并运算,这样就能得到其前10中最后6位的真正的编码 再与10000000进行或运算来得到UTF-8编码中间的8位
‘最后将其与111111进行并运算,这样就能得到其最后6位的真正的编码 再与10000000进行或运算来得到UTF-8编码最后8位编码
uch = “%” & Hex((nAsc 2 ^ 12) Or &HE0) & “%” & _
Hex((nAsc 2 ^ 6) And &H3F Or &H80) & “%” & _
Hex(nAsc And &H3F Or &H80)
szRet = szRet & uch
End If
End If
Next
UTF8Encode = szRet
End Function
‘UTF-8解码(2-25更改,采用递归方法,可以对一串字符串解码,仅仅为演示此算法,请不要随意调用)
‘形式类如department=%E4%B9%B3%E8%85%BA’%E5%A4%96%E7%A7%91
Public Function UTF8BadDecode(ByVal code As String) As String
If code = “” Then
Exit Function
End If
Dim tmp As String
Dim decodeStr As String
Dim codelen As Long
Dim result As String
Dim leftStr As String
leftStr = Left(code, 1)
If leftStr = “” Then
UTF8BadDecode = “”
Exit Function
ElseIf leftStr <> “%” Then
UTF8BadDecode = leftStr + UTF8BadDecode(Right(code, Len(code) – 1))
ElseIf leftStr = “%” Then
codelen = Len(code)
If (Mid(code, 2, 1) = “C” Or Mid(code, 2, 1) = “B”) Then
decodeStr = Replace(Mid(code, 1, 6), “%”, “”)
tmp = c10ton(Val(“&H” & Hex(Val(“&H” & decodeStr) And &H1F3F)))
tmp = String(16 – Len(tmp), “0″) & tmp
UTF8BadDecode = UTF8BadDecode & ChrW(Val(“&H” & c2to16(Mid(tmp, 3, 4)) & c2to16(Mid(tmp, 7, 2) & Mid(tmp, 11, 2)) & Right(decodeStr, 1))) & UTF8BadDecode(Right(code, codelen – 6))
ElseIf (Mid(code, 2, 1) = “E”) Then
decodeStr = Replace(Mid(code, 1, 9), “%”, “”)
tmp = c10ton((Val(“&H” & Mid(Hex(Val(“&H” & decodeStr) And &HF3F3F), 2, 3))))
tmp = String(10 – Len(tmp), “0″) & tmp
UTF8BadDecode = ChrW(Val(“&H” & (Mid(decodeStr, 2, 1) & c2to16(Mid(tmp, 1, 4)) & c2to16(Mid(tmp, 5, 2) & Right(tmp, 2)) & Right(decodeStr, 1)))) & UTF8BadDecode(Right(code, codelen – 9))
Else
UTF8BadDecode = Chr(Val(“&H” & (Mid(code, 2, 2)))) & UTF8BadDecode(Right(code, codelen – 3))
End If
End If
End Function
‘UTF-8解码(3-12更改,可以解多个字符串 可供正常使用)
Public Function UTF8Decode(ByVal code As String) As String
If code = “” Then
UTF8Decode = “”
Exit Function
End If
Dim tmp As String
Dim decodeStr As String
Dim codelen As Long
Dim result As String
Dim leftStr As String
leftStr = Left(code, 1)
While (code <> “”)
codelen = Len(code)
leftStr = Left(code, 1)
If leftStr = “%” Then
If (Mid(code, 2, 1) = “C” Or Mid(code, 2, 1) = “B”) Then
decodeStr = Replace(Mid(code, 1, 6), “%”, “”)
tmp = c10ton(Val(“&H” & Hex(Val(“&H” & decodeStr) And &H1F3F)))
tmp = String(16 – Len(tmp), “0″) & tmp
UTF8Decode = UTF8Decode & UTF8Decode & ChrW(Val(“&H” & c2to16(Mid(tmp, 3, 4)) & c2to16(Mid(tmp, 7, 2) & Mid(tmp, 11, 2)) & Right(decodeStr, 1)))
code = Right(code, codelen – 6)
ElseIf (Mid(code, 2, 1) = “E”) Then
decodeStr = Replace(Mid(code, 1, 9), “%”, “”)
tmp = c10ton((Val(“&H” & Mid(Hex(Val(“&H” & decodeStr) And &HF3F3F), 2, 3))))
tmp = String(10 – Len(tmp), “0″) & tmp
UTF8Decode = UTF8Decode & ChrW(Val(“&H” & (Mid(decodeStr, 2, 1) & c2to16(Mid(tmp, 1, 4)) & c2to16(Mid(tmp, 5, 2) & Right(tmp, 2)) & Right(decodeStr, 1))))
code = Right(code, codelen – 9)
End If
Else
UTF8Decode = UTF8Decode & leftStr
code = Right(code, codelen – 1)
End If
Wend
End Function
‘gb2312编码
Public Function GBKEncode(szInput) As String
Dim i As Long
Dim startIndex As Long
Dim endIndex As Long
Dim x() As Byte
x = StrConv(szInput, vbFromUnicode)
startIndex = LBound(x)
endIndex = UBound(x)
For i = startIndex To endIndex
GBKEncode = GBKEncode & “%” & Hex(x(i))
Next
End Function
‘GB2312编码
Public Function GBKDecode(ByVal code As String) As String
code = Replace(code, “%”, “”)
Dim bytes(1) As Byte
Dim index As Long
Dim length As Long
Dim codelen As Long
codelen = Len(code)
While (codelen > 3)
For index = 1 To 2
bytes(index – 1) = Val(“&H” & Mid(code, index * 2 – 1, 2))
Next index
GBKDecode = GBKDecode & StrConv(bytes, vbUnicode)
code = Right(code, codelen – 4)
codelen = Len(code)
Wend
End Function
‘二进制代码转换为十六进制代码
Public Function c2to16(ByVal x As String) As String
Dim i As Long
i = 1
For i = 1 To Len(x) Step 4
c2to16 = c2to16 & Hex(c2to10(Mid(x, i, 4)))
Next
End Function
‘二进制代码转换为十进制代码
Public Function c2to10(ByVal x As String) As String
c2to10 = 0
If x = “0″ Then Exit Function
Dim i As Long
i = 0
For i = 0 To Len(x) – 1
If Mid(x, Len(x) – i, 1) = “1″ Then c2to10 = c2to10 + 2 ^ (i)
Next
End Function
’10进制转n进制(默认2)
Public Function c10ton(ByVal x As Integer, Optional ByVal n As Integer = 2) As String
Dim i As Integer
i = x n
If i > 0 Then
If x Mod n > 10 Then
c10ton = c10ton(i, n) + chr(x Mod n + 55)
Else
c10ton = c10ton(i, n) + CStr(x Mod n)
End If
Else
If x > 10 Then
c10ton = chr(x + 55)
Else
c10ton = CStr(x)
End If
End If
End Function
出处:Hacklog【Hacklog】
声明: 本站遵循 署名-非商业性使用-相同方式共享 3.0 共享协议. 转载请注明转自Hacklog【荒野无灯weblog】
本文链接: http://ihacklog.com/?p=6






