UNICODE string from C++ to Julia and vice versa

I’m using embedding Julia in C++ project. Now I need to get any unicode string to Julia, e.g…
std::wstring test = L"šČô_φ_привет_일보";
Tips & hints or a model example for solutions, please. Thanks.

Just pasting "šČô_φ_привет_일보" into Julia code works fine. Julia strings are all unicode.

(More precisely Julia strings are UTF8-encoded Unicode. Any decent editor should be able to edit UTF8 text, and if you paste Unicode into a UTF8 document it should convert the encoding to UTF8 as needed.)

Sorry, I just realized that because you are embedding Julia, you probably have a wchar_t string in C++ that you want to convert to a Unicode String in Julia at runtime.

Since your std::wstring is a wchar_t* string (UTF-16 on Windows and UTF-32 everywhere else), it a different encoding than the UTF-8 encoding used by the Julia String type, so you need to either convert it (the transcode function in Base can do this) or use the WString type in the LegacyStrings package (to use it in Julia without conversion, e.g. to share memory for large strings).

e.g. suppose you pass the data as a wchar_t *p (or p::Ptr{Cwchar_t} in Julia) and a length L. Then you could call transcode(String, unsafe_wrap(p, L)) to get a Julia String object.

1 Like

Thanks @stevengj. I’m starting to understand, but how sensibly to get the data as a wchar_t * p to Julia with the help of julia.h? I have got

jl_function_t funcTransc = jl_get_function(jl_base_module, “transcode”);
std::wstring testS = L"a_šČô_φ_привет_일보";
const wchar_t
s = testS.c_str();

uint64_t Len = testS.length();
jl_value_t* L = jl_box_uint64(Len);

jl_value_t* p = ???
jl_call2(funcTransc, …);

I would write as much as possible of your glue/conversion code in Julia rather than in C/C++. Will be a lot easier. You can even call cfunction in Julia to return a C function pointer to C, so that you don’t have to mess around with jl_call and jl_value_t* conversions.

I create the API for application executables on each PC. It is clear to me that I want to work as little as possible with C ++, but I need a C++ connector to Julia.
Some idea or another way how get unicode string from C++ to Julia?
Thanks.

You can write the glue code in Julia and then call it from C++.

For example, you could write:

jl_value_t *wstr2jl = jl_eval_string("wstr2jl(wptr, len)  = transcode(String, unsafe_wrap(Ptr{Cwchar_t}(wptr), len))");

and then do

std::wstring test = L"šČô_φ_привет_일보";
jl_value_t *jtest = jl_call2(wstr2jl, jl_box_voidpointer(test.c_str()), jl_box_uint64(test.size()));

Note that the result of jl_box_voidpointer has to be rooted before calling jl_box_uint64.

Thanks guys @stevengj and @yuyichao , but I got error with jl_box_voidpointer. Probably I can not get a pointer on a string to Julia correctly. First I copy your code a got error

Error (active)		argument of type "const wchar_t *" is incompatible with parameter of type "void *" 
Error	C2664	'jl_value_t *jl_box_voidpointer(void *)': cannot convert argument 1 from 'const wchar_t *' to 'void *'

For the Second I modify code for void pointer and add println()

jl_value_t *wstr2jl = jl_eval_string("function wstr2jl(wptr, len) println(1); out = transcode(String, unsafe_wrap(Ptr{Cwchar_t}(wptr), len)); println(2); out end");
std::wstring testString = L"šČô_φ_привет_일보";
void *my_ptr1 = static_cast <void*>(const_cast <wchar_t*>(testString.c_str()));
jl_value_t *jtest = jl_call2(wstr2jl, jl_box_voidpointer(my_ptr1), jl_box_uint64(testString.size()));

but the println("2") has not happened. How to get a correct pointer const wchar_t * to Julia?

The unsafe_wrap function should take 3 arguments, the first being Array in this case:

unsafe_wrap(Array,Ptr{Cwchar_t}(wptr), len)

Inspired by this thread, I’m also attempting to add wstring support to CxxWrap.jl, though it fails on Windows right now. The conversion functions that get called are here:
https://github.com/JuliaInterop/CxxWrap.jl/blob/master/src/CxxWrap.jl#L405-L406

edit: Actually, it only fails on MSVC, not on MinGW. The failing string compare is here:
https://ci.appveyor.com/project/barche/cxxwrap-jl/build/1.0.164/job/xmitpqjk6s1ne63v#L1278

No idea why this fails…

1 Like

Thanks @barche. This is Julia part, but how does your part look in c ++ ? i.e. How do you get the wchar_t pointer to Julie? If using a generic function jl_box_voidpointer(), then I must probably cast type Ptr{Void} to type Ptr{Cwchar_t}!?
Correct me if I’m wrong. Thank you very much for taking time for me.
With add Array to unsafe_wrap this get to println(2) :slight_smile: (Windows 8.1), now just check.

You do this in the Julia code, as I showed in my example. (I forgot to add the Array first argument, as @barche pointed out, and @yuyichao pointed out that you need to root the pointer). Then you call the Julia code from C++. The point, as I’ve been trying to emphasize, is to do as little as possible in C++; write all the glue in Julia.

Thanks from C++ to Julia it works perfectly and how I get unicode back (in reverse)? From Julia to C++. I have got

jl_value_t *jl2wstr = jl_eval_string("function jl2wstr() s = \"a_šČô_φ_привет_일보\";  out = transcode(Cwchar_t, s); println(out); out end");
jl_array_t *jl2wstrOut = (jl_array_t*)jl_call0(jl2wstr);
	if (jl2wstrOut)
	{
		wchar_t* xData2 = (wchar_t*)jl_array_data(jl2wstrOut);
		f << xData2;
		printf("\noutStr = [");
		for (size_t i = 0; i < jl_array_len(jl2wstrOut); ++i)
			printf("%X ", xData2[i]);
		printf("]\n");
	}

But conversion is not good. I get
UInt16[0x0061,0x005f,0x009a,0x003f,0xdb37,0xdf9f,0x003f,0x003f,0x003f,0x003f,0x003f,0x003f,0x005f,0x003f,0x003f
Instead of this
UInt16[0x0061,0x005f,0x0161,0x010c,0x00f4,0x005f,0x03c6,0x005f,0x043f,0x0440,0x0438,0x0432,0x0435,0x0442,0x005f,0xc77c,0xbcf4]

I use jl_new_bits to box the pointer in the correct type:

jl_value_t* wchar_dt = jl_get_global(jl_base_module, jl_symbol("Cwchar_t"));
jl_value_t* ptr_dt = jl_apply_type((jl_value_t*)jl_pointer_type, jl_svec1(wchar_dt));
jl_value_t* ptr = jl_new_bits(ptr_dt, (void*)&x); // Boxed Ptr{Cwchar_t} to pass to Julia, x is the C pointer

This is pieced together a bit from different places around CxxWrap.jl, so not tested in the form above (and is missing GC rooting, and is v0.5-only). The objective is also different, since this code is used to make wrapping C++ functions returning or consuming std::wstrings, so if you’re using the embedding interface directly I fully support @stevengj 's advice to do as much as possible in Julia (obviously way simpler, too).

2 Likes

The problem is probably that you are entering this string in C++, not in Julia. i.e. the argument to jl_eval_string is a C++ string… how is that encoded by your compiler? (The same Julia code works fine for me in the REPL, so I think the problem is just the encoding of the string by your C++ compiler.)

If your compiler supports C++11, you should do jl_eval_string(u8"...") so that the Julia code is UTF8-encoded.

Not sure if it’s related, but for CxxWrap I had to change the encoding of the C++ source file from UTF-8 to UTF-8-BOM for MSVC to do the right thing.

I use in MSVC (File >>Advanced Save options… >> Unicode (UTF-8 without signature)) for correct showing.
But I still do not know how to interpret pointer to Array{Cwchar_t} from Julia back to C++.??
Can not I use jl_unbox_voidpointer()?
Is this the right way?
jl_array_t *jl2wstrOut = (jl_array_t*)jl_call0(jl2wstr);
wchar_t* xData2 = (wchar_t*)jl_array_data(jl2wstrOut);

That’s basically what I’m doing too, does it give an error?

@barche If I assign the pointer instead of X, the program fail. See the code.

jl_value_t* wchar_dt = jl_get_global(jl_base_module, jl_symbol("Cwchar_t"));
jl_value_t* ptr_dt = jl_apply_type((jl_value_t*)jl_pointer_type, jl_svec1(wchar_dt));
//jl_value_t* ptr = jl_new_bits(ptr_dt, (void*)&testString); //pointer to std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > *
const wchar_t* ap = testString.c_str();
jl_value_t* ptr = jl_new_bits(ptr_dt, (void*)&ap); // pointer to const wchar_t **

But If assign e.g. const wchar_t ** or std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > * program run. If assign ap = const wchar_t *, testString.c_str() or testString.data() the program fail. This is not good, because the function jl_new_bits expects void * data.
Why I need to assign pointer to pointer instead of pointer?
Thanks.