How do you convert a wchar_t* returned by a C DLL to a String?

mark · March 8, 2017, 7:27am

Hi,
I’m trying to use a C DLL. Most of the functions take int or double or char* and I think there are examples of passing in all these in the docs.

However, one function I need returns a wchar_t* of NULL-terminated UTF16-encoded bytes – it is owned by the DLL and could be up to 2MB. How do I copy this into a String?

Thanks.

mark · March 8, 2017, 2:15pm

I finally figured out how to do it. I doubt that my solution is the most efficient (which is why I haven’t marked it as solved), but it works for me (using julia 0.6):

function utf16tostring(p::Ptr{UInt16}; maxlen=100_000)
    text = ""
    i = 1
    while i <= maxlen
        u = unsafe_load(p, i)
        if u == 0; break end
        text *= string(convert(Char, u))
        i += 1
    end
    text
end

ihnorton · March 8, 2017, 3:00pm

For 0.6, use transcode(UInt8, s::Vector{Cwchar_t}) to convert to UTF8. There are also some Windows-only helper functions for Cwstring in base/c.jl.

(Cwchar_t == UInt16 on Windows).

nalimilan · March 8, 2017, 7:31pm

For reference, you really want to avoid calling *= in a loop as it’s going to allocate a new string each time. Here transcode is the best solution, but in general for this kind of situation you’d better store the characters in an array, fill it with characters, and only then create a string from it. Of course it’s even better if you can directly load the string from memory.

mark · March 8, 2017, 8:10pm

I looked in base/c.jl and found cwstring() which is great for creating wchar_t to pass into C functions, so thanks for mentioning that!

But using transcode() doesn’t work because I only have a NULL-terminated Ptr{Cwchar_t} of unknown length and transcode() won’t accept a pointer. Nor can I see any way to convert a Ptr{Cwchar_t} to a Vector{Cwchar_t}. I also tried changing my return type from Ptr{Cwchar_t} to Cwstring and then doing convert(UInt16, p) where p is a Cwstring but that didn’t work either.

Also, I guessed that *= is bad in loops (like Python) but I don’t know the length of the string, so I’m stuck.

yuyichao · March 8, 2017, 9:05pm

http://docs.julialang.org/en/latest/manual/calling-c-and-fortran-code.html#Accessing-Data-through-a-Pointer-1

And more specifically http://docs.julialang.org/en/latest/stdlib/c.html#Base.unsafe_wrap-Tuple{Union{Type{Array{T,N}%20where%20N},%20Type{Array{T,N}},%20Type{Array}},Ptr{T},Tuple{Vararg{Int64,N}}}%20where%20N%20where%20T

ihnorton · March 8, 2017, 9:26pm

As far as finding the length, here’s some helper code from older Julia versions.

StefanKarpinski · March 8, 2017, 11:00pm

We could probably stand to make this a little easier to use. The minimalist goal of transcode was to allow calling Windows APIs without needing a full-blown UTF-16 string type in Base. PRs and API design ideas welcome.

ScottPJones · March 9, 2017, 3:23am

This also won’t handle non-BMP characters (i.e. characters that are encoded via two surrogate characters in UTF-16).
One option might be to use the UTF16String type that was moved to the LegacyStrings package.
You can use String(utf16(p)), where p is a Ptr{UInt16} or Ptr{Int16}, to get a UTF16String from a pointer to a null terminated UTF-16 string, and then return it as a UTF-8 encoded String type.

mark · March 10, 2017, 9:25am

I tried LegacyStrings but it gave lots of warnings so I gave up on it. So I tried a new solution that didn’t keep appending to a string:

function utf16tostring(p::Ptr{UInt16}; maxlen=100_000)
    chars = zeros(UInt16, maxlen)
    i = 1
    while i <= maxlen
        u = unsafe_load(p, i)
        if u == 0; break end
        chars[i] = u
        i += 1
    end
    transcode(String, chars)[1:i-1]
end

It runs about 5% faster on my test data.

I do think that the standard library should provide a Cwstring to String conversion function though.

nalimilan · March 10, 2017, 12:31pm

Unless you know a plausible value for maxlen, it’s going to be more efficient to either go over the string to compute its length before allocating the vector, or to start with an empty vector and call push! repeatedly. That function will allocate a bigger vector than what is needed, so that this kind of pattern does not trigger a reallocation for each added value.

EDIT: like the code @ihnorton linked to, it would be even better to compute the length of the string, then use unsafe_wrap(Array, p, len) and call transcode on the result. That way you avoid an intermediate copy.

mark · March 10, 2017, 1:46pm

Thanks, that’s led to a new (slightly faster) and much shorter version that doesn’t preallocate lots of memory:

function utf16tostring(p::Ptr{UInt16}; maxlen=100_000)
    len = 0
    while unsafe_load(p, len + 1) != 0 && len < maxlen; len += 1; end
    transcode(String, unsafe_wrap(Array{UInt16,1}, p, len))
end

I decided to keep the maxlen limit to be on the safe side.

ScottPJones · March 10, 2017, 10:31pm

What sorts of warnings did you see? We use LegacyStrings extensively, and don’t see any warnings (on v0.5).

stevengj · March 10, 2017, 10:59pm

By “this”, I guess you meant the utf16tostring method posted above. The transcode function, on the other hand, certainly handles non-BMP characters when translating between UTF-8 and UTF-16 in either direction.

Determining the length by searching for a NUL 16-bit word, then wrapping the Ptr{UInt16} in a Vector{UInt8}, then calling transcode, should be the right thing to do.

I agree that this should be built-into Base, however. In analogy with unsafe_string, maybe just an unsafe_transcode(String, p::Cwstring, length=wcslen(p)) method. LegacyStrings should be updated if needed, but it shouldn’t be needed just to convert data to/from String.

mark · March 11, 2017, 7:24am

I’m using the 0.6 alpha.

ScottPJones · March 11, 2017, 5:13pm

Yes, I meant the original version in this post, which treated each 16-bit codepoint as if it were a character.

The final version with should be better, however, scanning first for a 16-bit 0, wrapping the pointed to words as a Vector, and then scanning the string yet again to calculate the length in UTF-8, before finally creating the String version will not be that efficient.

Yes, although I’d have separate methods instead of calling wcslen(p), to avoid scanning twice, and Cwstring can be 16-bit or 32-bit characters, depending on platform, so for UTF-16:
i.e. unsafe_transcode(String, p::Ptr{UInt16}) and unsafe_transcode(String, p::Ptr{UInt16}, len::Int)

ScottPJones · March 11, 2017, 5:14pm

That explains it - LegacyStrings needs to be updated (or dropped and replaced with something better for v0.6 )

Topic		Replies	Views
UNICODE string from C++ to Julia and vice versa General Usage question , embedding , examples , cxx	20	3702	April 26, 2017
Calling Windows API from Julia General Usage question , windows	19	3343	October 30, 2020
Convert Vector{String} to Ptr{Ptr{UInt8}}? New to Julia c	5	331	August 14, 2023
How to create a Cstring from a String? General Usage strings , interoperability , c	5	1383	September 16, 2024
Is this the best way to convert a NUL terminated C-string into a Julia `String`? General Usage	1	538	February 11, 2021

How do you convert a wchar_t* returned by a C DLL to a String?

Related topics