Converting to DataFrame

I am loading a large .dta file and then converting it to a DataFrame. However, because of its size, the conversion takes a very long time (~20mins). I don’t need all of the columns in the file so I am wondering whether anyone knows of a way to convert only a subset of it.

Specifically:

using DataFrames, StatFiles

df_1 = load("data.dta");
df_2 = DataFrame(df_1);
df_3 = df_2[!,:col1];

step 2 takes a long time so if there were a way to invert steps 3 and 2, namely first select the columns I want and then convert to DataFrame, that would speed things up a lot. Any help would be greatly appreciated.

Thanks!

1 Like

I think you can convert .dta to .csv first, manipulate CSV (e.g. with CSV.jl) and then convert that to a DataFrame.

3 Likes

I don’t think there is a way to load a subset of the columns.

I agree that you should save the data as .csv. If you have access to stata thats very easy. Otherwise both R and pandas have .dta. However I’m not confident they are any better than Julia’s, since I think StatFiles uses the same underlying library as they do.

1 Like

Thank you both. I ended up calling Julia via RCall. That seems to work much faster than loading the .dta file directly via Julia.

Thanks!

It’s weird that it takes much longer than R. @davidanthoff probably knows.

1 Like