Split string to multiple columns

Pretty simple, but I’ve been banging my head against it for a while.

I have an array of some thousands of strings representing data files named like “abc-xyz.csv”. I can split on “-” and “.” easy enough, and get a vector of substring arrays.

At most I can coerce this into a vector of arrays of strings. I can’t even get this into a dataframe properly; it comes in as thousands of columns and 3 rows, and there does not appear to be a way to just invert a dataframe.

I’ve loaded the filename array into a dataframe and applied the split, but this just gives me a single column containing 3-element arrays, which I am again stuck with.

This is an issue because the information in the filename string describes the observation and are variables to be used in analysis. The goal is to populate these variables from the filenames, then load the files, each of which contains one observation, and append the data to the row containing the filename.

I’m also curious about the best way to do this, since I could see it getting into a lot of interesting concurrent IO questions. But for now I just want to learn how to not be abused by the type system, thanks.

With your array of substrings, can you do

DataFrame(vcat(A...))

?

This seems to create a dataframe with columns for “offset” and “ncodeunits” in addition to the source string.

I’m not entirely sure, what is the requested layout of the dataframe?
The vector of substring arrays looks like [[“abc”, “xyz”],[“ghi”, “klm”],…]
And you want to have a DataFrame like that:

header: [First, Second]
row1: [“abc”, “xyz”]
row2: [“ghi”, “klm”]

Is that correct? Especially, do you have some code/minimum working example?

Yep. Maybe I should be iterating over the rows and just doing it all at once, i.e. putting the elements of the split string into some union tuple with the other data from the file and making the dataframe out of those, somehow?

As a first “just works” idea (probably with poor performance), you can try to put zip(arr...) into the DataFrame rather than just arr. That should somewhat flip it over.
(Here arr is the vector of substring arrays)

Sorry I can’t put more input/output up right now, I’m phone posting from my desk. I’ll try to replicate this truly amazing error later with dummy inputs… but yes, the zip method completely breaks Julia, at least on Windows…

Reason why you have multiple columns rather than rows in your dataframe is because the dataframe constructor accepts its inputs column by column (AFAIK for type stability reasons). Using zip here was the attempt to merge your data into three columns. Another way to build the DataFrame would be to do it incrementally. For that I recommend looking here:

and for size hints to the DataFrame, there:

Thanks, I’ll keep reading!

How about this

julia> using DataFrames, Random

julia> strings = ["$(randstring(3)).$(randstring(3))" for _ in 1:10]
10-element Array{String,1}:
 "ooJ.qvz"
 "jnk.pGB"
 "LcH.FID"
 "Yh0.ipI"
 "Cm0.aAd"
 "YA6.bNs"
 "kLF.dTe"
 "RTe.7cE"
 "dw4.j7S"
 "3vh.XC6"

julia> df = DataFrame([(a=a,b=b) for (a,b) in split.(strings, ".")])
10×2 DataFrame
│ Row │ a         │ b         │
│     │ SubStrin… │ SubStrin… │
|-----|-----------|-----------|
│ 1   │ ooJ       │ qvz       │
│ 2   │ jnk       │ pGB       │
│ 3   │ LcH       │ FID       │
│ 4   │ Yh0       │ ipI       │
│ 5   │ Cm0       │ aAd       │
│ 6   │ YA6       │ bNs       │
│ 7   │ kLF       │ dTe       │
│ 8   │ RTe       │ 7cE       │
│ 9   │ dw4       │ j7S       │
│ 10  │ 3vh       │ XC6       │

This works because the DataFrame() constructor can take an array of named tuples

2 Likes

That’s the ticket, thanks!