I have an Array{String,1} which I want to convert into an array of data types to use with loadnsparse in JuliaDB. There appears to be no way to convert a string such as “Union{Missing, Int64}” into the obvious type?
Going from string to type is the same as going from code to result, it’s not conversion, it’s called running the code and you can do that with parsing and eval.
That’s almost always a terrible idea. Why do you have a string of expressions that you want to evaluate to type?
I got the types of a sample of the data I want to load, but some columns were of type Missing due to the sparseness of the data, when the full data will be Union of Missing and Int64. I want to use the correct types in the colparsers argument to JuliaDB.loadndsparse. So I describe()ed the sample dataframe, exported to CSV (which is also broken and requires using show()) filled in the blanks, and reimported as strings.
And here I am…
So I made a dict from string to type and mapped the dataframe column…
It’s still not clear why you need to go from types to strings and back to types at all. Why can’t you just call eltypes()
on your DataFrame to get a vector of element types (as types, not strings)?
Because the dataframe has missing entries. Does Julia really not have an easy way to load an external data schema for an input file?
I’m still missing something here. As I understand it, you’re saying that saving the results of describe()
to a CSV file is giving a different column element type than eltypes()
would? Is that right? And can you give an example of the kind of situation you’re seeing? It’s hard to be helpful in the abstract.
Describe is running on a sample of the full dataset. Some of the columns are sparse and therefore empty, or the source is using numbers for chars and other such stupidities which require manual adjustment.
Like I mentioned, I managed a solution so this topic is more of a “we should cover this case if we’re serious about beating Python” bin. I have a New topic in usage regarding my current travails with JuliaDB which is my immediate concern.
What you mean by
?
Is it done with a script or did you just open a file to edit it?
If it’s a script then you should be able to manipulate the data directly without saving it to a file and not having to worry about retrieving the type from the file.
If it’s manual editing, well, could you just edit (a copy of) the original data? Are you using the save to csv step as a way to convert the original data to a more editable format? Are you saving the edited data as the new original data for all processing or are you just looking for a interactive way to fix the data as they come in before you do more processing on them?
Also, taking a step back, if the editing is not scriptable, since you are saying that you are “loading” the data, I assume the type information you need were all in that original data. What format is that?
Type information is external. Think of it like a file containing the column names, except it’s types.
I’m sorry the stuff about sampling the data and generating a dataframe of the Describe() output and editing that was confusing. That’s just something I did as a first pass because the data is too big to load into memory via normal CSV methods.
The issue is that the data has a separate list of types that I would like to load as such, much like column names can be separately stored and loaded.
The question I was asking from the start is that why you have a list of “external” types to begin with. You shouldn’t need that.
So what do you do when the data is too large and sparse to have its types detected? I have 2000+ columns to load here.
If you want to manually specify the type, that’s exactly what you should do. And you manutally specify the types as code.
Meta.eval(Meta.parse(“Float64”))
This gives me a result that is a data type. The Meta package appears to be built in and doesn’t need to be added.
Yes, just don’t treat that string as data. You have to trust that string as much as you trust your code.
Also, don’t use Meta.eval. use eval.