Split string to multiple columns

FredC · July 30, 2019, 8:54pm

Pretty simple, but I’ve been banging my head against it for a while.

I have an array of some thousands of strings representing data files named like “abc-xyz.csv”. I can split on “-” and “.” easy enough, and get a vector of substring arrays.

At most I can coerce this into a vector of arrays of strings. I can’t even get this into a dataframe properly; it comes in as thousands of columns and 3 rows, and there does not appear to be a way to just invert a dataframe.

I’ve loaded the filename array into a dataframe and applied the split, but this just gives me a single column containing 3-element arrays, which I am again stuck with.

This is an issue because the information in the filename string describes the observation and are variables to be used in analysis. The goal is to populate these variables from the filenames, then load the files, each of which contains one observation, and append the data to the row containing the filename.

I’m also curious about the best way to do this, since I could see it getting into a lot of interesting concurrent IO questions. But for now I just want to learn how to not be abused by the type system, thanks.

kevbonham · July 30, 2019, 9:03pm

With your array of substrings, can you do

DataFrame(vcat(A...))

?

FredC · July 30, 2019, 9:19pm

This seems to create a dataframe with columns for “offset” and “ncodeunits” in addition to the source string.

rapus95 · July 30, 2019, 9:26pm

I’m not entirely sure, what is the requested layout of the dataframe?
The vector of substring arrays looks like [[“abc”, “xyz”],[“ghi”, “klm”],…]
And you want to have a DataFrame like that:

header: [First, Second]
row1: [“abc”, “xyz”]
row2: [“ghi”, “klm”]
…

Is that correct? Especially, do you have some code/minimum working example?

FredC · July 30, 2019, 9:32pm

Yep. Maybe I should be iterating over the rows and just doing it all at once, i.e. putting the elements of the split string into some union tuple with the other data from the file and making the dataframe out of those, somehow?

rapus95 · July 30, 2019, 9:34pm

As a first “just works” idea (probably with poor performance), you can try to put zip(arr...) into the DataFrame rather than just arr. That should somewhat flip it over.
(Here arr is the vector of substring arrays)

FredC · July 30, 2019, 9:45pm

Sorry I can’t put more input/output up right now, I’m phone posting from my desk. I’ll try to replicate this truly amazing error later with dummy inputs… but yes, the zip method completely breaks Julia, at least on Windows…

rapus95 · July 30, 2019, 9:50pm

Reason why you have multiple columns rather than rows in your dataframe is because the dataframe constructor accepts its inputs column by column (AFAIK for type stability reasons). Using zip here was the attempt to merge your data into three columns. Another way to build the DataFrame would be to do it incrementally. For that I recommend looking here:

and for size hints to the DataFrame, there:

FredC · July 30, 2019, 9:51pm

Thanks, I’ll keep reading!

kevbonham · July 31, 2019, 8:13pm

How about this

julia> using DataFrames, Random

julia> strings = ["$(randstring(3)).$(randstring(3))" for _ in 1:10]
10-element Array{String,1}:
 "ooJ.qvz"
 "jnk.pGB"
 "LcH.FID"
 "Yh0.ipI"
 "Cm0.aAd"
 "YA6.bNs"
 "kLF.dTe"
 "RTe.7cE"
 "dw4.j7S"
 "3vh.XC6"

julia> df = DataFrame([(a=a,b=b) for (a,b) in split.(strings, ".")])
10×2 DataFrame
│ Row │ a         │ b         │
│     │ SubStrin… │ SubStrin… │
|-----|-----------|-----------|
│ 1   │ ooJ       │ qvz       │
│ 2   │ jnk       │ pGB       │
│ 3   │ LcH       │ FID       │
│ 4   │ Yh0       │ ipI       │
│ 5   │ Cm0       │ aAd       │
│ 6   │ YA6       │ bNs       │
│ 7   │ kLF       │ dTe       │
│ 8   │ RTe       │ 7cE       │
│ 9   │ dw4       │ j7S       │
│ 10  │ 3vh       │ XC6       │

This works because the DataFrame() constructor can take an array of named tuples

FredC · August 1, 2019, 4:25pm

That’s the ticket, thanks!

Topic		Replies	Views
Split Column in Dataframe New to Julia dataframes	5	2093	February 4, 2022
DataFrame transform with many output columns General Usage dataframes	1	269	March 28, 2022
Splitting an array of strings New to Julia	5	5027	October 22, 2018
Split dataframe row into multiple rows Data dataframes	8	1682	May 1, 2022
JuliaDB - split one column into n columns General Usage juliadb	1	1442	February 12, 2019

Split string to multiple columns

Related topics