Hi,
I’m trying to use a C DLL. Most of the functions take int or double or char* and I think there are examples of passing in all these in the docs.
However, one function I need returns a wchar_t* of NULL-terminated UTF16-encoded bytes – it is owned by the DLL and could be up to 2MB. How do I copy this into a String?
I finally figured out how to do it. I doubt that my solution is the most efficient (which is why I haven’t marked it as solved), but it works for me (using julia 0.6):
function utf16tostring(p::Ptr{UInt16}; maxlen=100_000)
text = ""
i = 1
while i <= maxlen
u = unsafe_load(p, i)
if u == 0; break end
text *= string(convert(Char, u))
i += 1
end
text
end
For reference, you really want to avoid calling *= in a loop as it’s going to allocate a new string each time. Here transcode is the best solution, but in general for this kind of situation you’d better store the characters in an array, fill it with characters, and only then create a string from it. Of course it’s even better if you can directly load the string from memory.
I looked in base/c.jl and found cwstring() which is great for creating wchar_t to pass into C functions, so thanks for mentioning that!
But using transcode() doesn’t work because I only have a NULL-terminated Ptr{Cwchar_t} of unknown length and transcode() won’t accept a pointer. Nor can I see any way to convert a Ptr{Cwchar_t} to a Vector{Cwchar_t}. I also tried changing my return type from Ptr{Cwchar_t} to Cwstring and then doing convert(UInt16, p) where p is a Cwstring but that didn’t work either.
Also, I guessed that *= is bad in loops (like Python) but I don’t know the length of the string, so I’m stuck.
We could probably stand to make this a little easier to use. The minimalist goal of transcode was to allow calling Windows APIs without needing a full-blown UTF-16 string type in Base. PRs and API design ideas welcome.
This also won’t handle non-BMP characters (i.e. characters that are encoded via two surrogate characters in UTF-16).
One option might be to use the UTF16String type that was moved to the LegacyStrings package.
You can use String(utf16(p)), where p is a Ptr{UInt16} or Ptr{Int16}, to get a UTF16String from a pointer to a null terminated UTF-16 string, and then return it as a UTF-8 encoded String type.
I tried LegacyStrings but it gave lots of warnings so I gave up on it. So I tried a new solution that didn’t keep appending to a string:
function utf16tostring(p::Ptr{UInt16}; maxlen=100_000)
chars = zeros(UInt16, maxlen)
i = 1
while i <= maxlen
u = unsafe_load(p, i)
if u == 0; break end
chars[i] = u
i += 1
end
transcode(String, chars)[1:i-1]
end
It runs about 5% faster on my test data.
I do think that the standard library should provide a Cwstring to String conversion function though.
Unless you know a plausible value for maxlen, it’s going to be more efficient to either go over the string to compute its length before allocating the vector, or to start with an empty vector and call push! repeatedly. That function will allocate a bigger vector than what is needed, so that this kind of pattern does not trigger a reallocation for each added value.
EDIT: like the code @ihnorton linked to, it would be even better to compute the length of the string, then use unsafe_wrap(Array, p, len) and call transcode on the result. That way you avoid an intermediate copy.
Thanks, that’s led to a new (slightly faster) and much shorter version that doesn’t preallocate lots of memory:
function utf16tostring(p::Ptr{UInt16}; maxlen=100_000)
len = 0
while unsafe_load(p, len + 1) != 0 && len < maxlen; len += 1; end
transcode(String, unsafe_wrap(Array{UInt16,1}, p, len))
end
I decided to keep the maxlen limit to be on the safe side.
By “this”, I guess you meant the utf16tostring method posted above. The transcode function, on the other hand, certainly handles non-BMP characters when translating between UTF-8 and UTF-16 in either direction.
Determining the length by searching for a NUL 16-bit word, then wrapping the Ptr{UInt16} in a Vector{UInt8}, then calling transcode, should be the right thing to do.
I agree that this should be built-into Base, however. In analogy with unsafe_string, maybe just an unsafe_transcode(String, p::Cwstring, length=wcslen(p)) method. LegacyStrings should be updated if needed, but it shouldn’t be needed just to convert data to/from String.
Yes, I meant the original version in this post, which treated each 16-bit codepoint as if it were a character.
The final version with should be better, however, scanning first for a 16-bit 0, wrapping the pointed to words as a Vector, and then scanning the string yet again to calculate the length in UTF-8, before finally creating the String version will not be that efficient.
Yes, although I’d have separate methods instead of calling wcslen(p), to avoid scanning twice, and Cwstring can be 16-bit or 32-bit characters, depending on platform, so for UTF-16:
i.e. unsafe_transcode(String, p::Ptr{UInt16}) and unsafe_transcode(String, p::Ptr{UInt16}, len::Int)