Is IsTextUnicode reliable?
|
|
Thread rating:  |
RB Smissaert - 20 Jun 2008 19:49 GMT It looks the API IsTextUnicode is not reliable even when it repeatedly runs on the same string. So, it can give False or True on the same string. Am I doing something wrong or is this API indeed not reliable?
Private Declare Function IsTextUnicode Lib "advapi32" _ (lpBuffer As Any, _ ByVal cb As Long, _ lpi As Long) As Long
Public Function IsUnicodeStr(sBuffer As String) As Boolean
Const IS_TEXT_UNICODE_UNICODE_MASK = &HF
'Returns True if sBuffer evaluates to a Unicode string Dim dwRtnFlags As Long
'note we need a variable dwRtnFlags here as dwRtnFlags is [in] [out] '------------------------------------------------------------------- dwRtnFlags = IS_TEXT_UNICODE_UNICODE_MASK IsUnicodeStr = IsTextUnicode(StrPtr(sBuffer), Len(sBuffer), dwRtnFlags)
End Function
RBS
MikeD - 20 Jun 2008 20:08 GMT > It looks the API IsTextUnicode is not reliable even when it repeatedly runs on the same string. > So, it can give False or True on the same string. Am I doing something wrong or is this API indeed [quoted text clipped - 18 lines] > > End Function Just a guess (having never used this particular function before and not bothering to look at its docs), try changing the declaration to this:
Private Declare Function IsTextUnicode Lib "advapi32" _ (ByVal lpBuffer As Long, _ ByVal cb As Long, _ lpi As Long) As Long
However, I'd think that would ALWAYS return 1 because internally all strings in VB are unicode and you're passing a string pointer. This is just a guess, though.
Did you try googling on that function name to see if anybody's already posted an example?
 Signature Mike Microsoft Visual Basic MVP
RB Smissaert - 20 Jun 2008 20:51 GMT I think the trouble was somewhere else as I can't reproduce this when I call it plain and simple:
Dim str As String
str = "test"
MsgBox IsUnicodeStr(str), , "IsTextUnicode"
This is always giving False.
RBS
>> It looks the API IsTextUnicode is not reliable even when it repeatedly >> runs on the same string. [quoted text clipped - 35 lines] > Did you try googling on that function name to see if anybody's already > posted an example? MikeD - 20 Jun 2008 23:11 GMT >I think the trouble was somewhere else as I can't reproduce this when I >call it plain and simple: [quoted text clipped - 6 lines] > > This is always giving False. You didn't mention if you changed your declaration to what I suggested (which Scott's post also claims would be correct). It's the ByVal in the declaration that's important. I just made the data type Long because you're passing a pointer, which is a Long. It should "work" whether you use Any or Long as the data type in the declaration provided you're passing it ByVal.
Scott and I are also both in agreement that this is a pointless call to make because strings in VB are unicode. Why you're getting a return value of False, to me, only indicates you're not calling the API function correctly OR this particular API function simply won't work properly in VB due to the way VB handles strings.
Only other thing I can suggest is to use the StrConv function to explictly convert the string to unicode and compare the number of bytes to the original string. If the number of bytes is double, the original string is ANSI. Now, I don't know how reliable this would be for certain code pages, if reliable at all, but that's the best alternative I can think of since I just don't believe that this API function is going to be reliable in VB (if called properly, my expectation would be that it should *always* indicate the string is unicode). But again, I've never used this function in VB, let alone read the docs on it. So I still suggest that you do a google search, as I'd bet dollars to doughnuts that this function has been discussed (and even after I suggested it once, you STILL never said if you bothered to search and what you found, assuming you did search).
 Signature Mike Microsoft MVP Visual Basic
Scott Seligman - 20 Jun 2008 21:59 GMT >It looks the API IsTextUnicode is not reliable even when it repeatedly >runs on the same string. So, it can give False or True on the same >string. Am I doing something wrong or is this API indeed not reliable? You need to pass the pointer ByVal and double the count of characters since the function needs a count of bytes.
That said, it's a pointless call, since VB strings are Unicode.
 Signature --------- Scott Seligman <scott at <firstname> and michelle dot net> --------- There are fewer great satisfactions than that of self. -- Calhoun in Star Trek: New Frontier: Being Human by Peter David
RB Smissaert - 20 Jun 2008 23:06 GMT It sure is confusing this. I was just playing with the code posted here: http://vbnet.mvps.org/index.html?code/shell/undocshelldlgs.htm
You say VB strings are Unicode, but is I take it that applies to plain and simple strings, but maybe it is a different matter if code is involved as on the above site.
Your suggestions seem to make it work fine now:
Option Explicit Private Declare Function IsTextUnicode Lib "advapi32" _ (lpBuffer As Long, _ ByVal cb As Long, _ lpi As Long) As Long
Function ShowByteArray(ByteArray() As Byte) As String
Dim i As Long Dim LB As Long Dim UB As Long
LB = LBound(ByteArray) UB = UBound(ByteArray)
ShowByteArray = ByteArray(LB)
If UBound(ByteArray) > LB Then For i = LB + 1 To UB ShowByteArray = ShowByteArray & vbCrLf & ByteArray(i) Next i End If
End Function
Sub testIsUnicodeStr()
Dim str As String
str = "test"
MsgBox ShowStringAsBytes(str), , "original"
MsgBox IsUnicodeStr(str), , "IsTextUnicode"
str = StrConv(str, vbFromUnicode)
MsgBox ShowStringAsBytes(str), , "ANSI"
MsgBox IsUnicodeStr(str), , "IsTextUnicode"
str = StrConv(str, vbUnicode)
MsgBox ShowStringAsBytes(str), , "Unicode"
MsgBox IsUnicodeStr(str), , "IsTextUnicode"
End Sub
Public Function IsUnicodeStr(sBuffer As String) As Boolean
Const IS_TEXT_UNICODE_UNICODE_MASK = &HF
'Returns True if sBuffer evaluates to a Unicode string Dim dwRtnFlags As Long
'note we need a variable dwRtnFlags here as dwRtnFlags is [in] [out] '------------------------------------------------------------------- dwRtnFlags = IS_TEXT_UNICODE_UNICODE_MASK IsUnicodeStr = IsTextUnicode(ByVal StrPtr(sBuffer), Len(sBuffer) * 2, dwRtnFlags)
End Function
RBS
>>It looks the API IsTextUnicode is not reliable even when it repeatedly >>runs on the same string. So, it can give False or True on the same [quoted text clipped - 4 lines] > > That said, it's a pointless call, since VB strings are Unicode. Thorsten Albers - 20 Jun 2008 22:49 GMT RB Smissaert <bartsmissaert@blueyonder.co.uk> schrieb im Beitrag <OkcdjZw0IHA.6096@TK2MSFTNGP06.phx.gbl>...
> IsUnicodeStr = IsTextUnicode(StrPtr(sBuffer), Len(sBuffer), dwRtnFlags) - 'lpBuffer' here has to be passed 'ByVal' since a pointer to the first character of the string is passed ('ByRef' would be a pointer to a pointer to the string). - In 'cb' has to be passed 'LenB(sBuffer)', i.e. the count of bytes, not the count of characters. - If used correctly with a VB string this procedure should return always TRUE since VB strings are always Unicode. So it doesn't make much sense to call this procedure with a VB string as the first argument...
In general you shouldn't rely on IsTextUnicode(). Instead you should use the 'byte order mark' (FEFFh / FFFEh) to check for a Unicode text, and/or you should let the user select ANSI or Unicode character processing.
 Signature ---------------------------------------------------------------------- Thorsten Albers albers(a)uni-freiburg.de ----------------------------------------------------------------------
RB Smissaert - 20 Jun 2008 23:29 GMT Thanks for the tips.
> Instead you should use the 'byte order mark' (FEFFh / FFFEh) to check for > a Unicode text How would that work?
RBS
> RB Smissaert <bartsmissaert@blueyonder.co.uk> schrieb im Beitrag > <OkcdjZw0IHA.6096@TK2MSFTNGP06.phx.gbl>... [quoted text clipped - 13 lines] > the 'byte order mark' (FEFFh / FFFEh) to check for a Unicode text, and/or > you should let the user select ANSI or Unicode character processing. Jim Mack - 21 Jun 2008 03:07 GMT > Thanks for the tips. > >> Instead you should use the 'byte order mark' (FEFFh / FFFEh) to >> check for a Unicode text > > How would that work? "Compatible" Unicode text files begin with a BOM, which is a 16-bit value not otherwise used as a Unicode character (reserved). If you the first character in the file is FEFF or FFFE, you have a Unicode text file... further, which of those you see tells you if the characters are in big- or little-endian order.
-- Jim Mack MicroDexterity Inc www.microdexterity.com
> RBS > [quoted text clipped - 23 lines] >> albers(a)uni-freiburg.de >> ---------------------------------------------------------------------- RB Smissaert - 21 Jun 2008 07:25 GMT OK, so would a function like this pick it up:
Function FileUnicode(strFile As String) As String
'Bytes Encoding Form '---------------------------------- '00 00 FE FF UTF-32, big-endian 'FF FE 00 00 UTF-32, little-endian 'FE FF UTF-16, big-endian 'FF FE UTF-16, little-endian 'EF BB BF UTF-8
Dim hFile As Long Dim A As Byte Dim B As Byte Dim C As Byte Dim D As Byte
On Error GoTo ERROROUT
hFile = FreeFile
Open strFile For Binary As #hFile
Get #hFile, 1, A
Select Case A Case 0 Get #hFile, 2, B If B = 0 Then Get #hFile, 3, C If C = 254 Then Get #hFile, 4, D If D = 255 Then '00 00 FE FF FileUnicode = "UTF-32, big-endian" End If End If End If Case 239 'EF Get #hFile, 2, B If B = 187 Then Get #hFile, 3, C If C = 191 Then 'EF BB BF FileUnicode = "UTF-8" End If End If Case 254 'FE Get #hFile, 2, B If B = 255 Then 'FE FF FileUnicode = "UTF-16, big-endian" End If Case 255 'FF Get #hFile, 2, B If B = 254 Then Get #hFile, 3, C If C = 0 Then Get #hFile, 3, D If D = 0 Then 'FF FE 00 00 FileUnicode = "UTF-32, little-endian" End If Else 'FF FE FileUnicode = "UTF-16, little-endian" End If End If End Select
ERROROUT: Close #hFile
End Function
I don't really need this, but was just playing with this to get the feel of ANSI <> Unicode. Did Google, but couldn't find VB code to pick this BOM up, so put together the above.
RBS
>> Thanks for the tips. >> [quoted text clipped - 43 lines] >>> ------------------------------------------------------------------- > --- Jim Mack - 21 Jun 2008 13:28 GMT > OK, so would a function like this pick it up: I didn't parse your code, but it's the right idea. Of the ones you list, UTF-16 is the only one I ever see 'in the wild', so a simpler test is just to read the first integer to see if it's -2 or -257.
Note that it's an affirmative test only: you can be sure it's UTF-16 if you see the BOM, but absence of a BOM means nothing.
-- Jim
> Function FileUnicode(strFile As String) As String > [quoted text clipped - 124 lines] >>>> ------------------------------------------------------------------- >> --- RB Smissaert - 21 Jun 2008 18:01 GMT Thanks again for the tips; got this all now.
RBS
>> OK, so would a function like this pick it up: > [quoted text clipped - 138 lines] > -- >>> --- Dean Earley - 23 Jun 2008 08:46 GMT > It looks the API IsTextUnicode is not reliable even when it repeatedly > runs on the same string. > So, it can give False or True on the same string. Am I doing something > wrong or is this API indeed > not reliable? It is not: http://blogs.msdn.com/michkap/archive/2005/01/30/363308.aspx http://blogs.msdn.com/oldnewthing/archive/2007/04/17/2158334.aspx http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx
The function has to guess to decide one way or the other.
 Signature Dean Earley (dean.earley@icode.co.uk) i-Catcher Development Team
iCode Systems
Tony Proctor - 23 Jun 2008 11:23 GMT As just about everyone else has said <grin>, VB Strings are always held in Unicode and so the call isn't very helpful
However, I would like to add that if you've imported some textual data and you're trying to determine whether the encoding is Unicode or some other SBCS/DBCS then it should not be imported into String variables. Putting non-Unicode data into String variables breaks several rules and could cause run-time errors. Such data should be imported into Byte arrays, and then calling IsTextUnicode could be useful
Tony Proctor
> It looks the API IsTextUnicode is not reliable even when it repeatedly > runs on the same string. [quoted text clipped - 22 lines] > > RBS
|
|
|