I converted Stata data to Julia by taking caution to convert data to equivalent data types as much as possible such as:
Stata byte type to Int8 in Julia, int to Int16, long to Int32, etc. What I found out was that a data set that was about 600 megabytes in Stata uses about ten times memory in Julia. After all data are converted to the equivalent data types, the Julia data set used more than 6 gigabytes of memory (judged by the Windows Task Manger).
My questions are:
are there any people who experience the same problem?
Is this my problem as opposed to the problem inherent in DataFrames + NullableArrays?
If so, is there a way to get around this memory problem?
NullableArrays have a bit of overhead, but I would not expect 10x. Hard to say without a MWE, but my guess would be that you are seeing the uncollected garbage after your transformations. The memory consumption of the whole process is not that informative, try Base.summarysize, which will show you where the problem is.
In particular, how are you importing the data? That step can use a lot of memory depending on the method you use. Also, what’s the Julia memory usage if you save the data set to a file using serialize and deserialize it from a clean Julia process?
Thank you for your replies. I am writing a Stata reader program myself. I found that the data set size depends partly on the number of string variables in it. In a dataset with mostly numeric variables, Julia DataFrame is about 1.8 times bigger than the stata dataset, while a dataset that contains many string variables results in about 3 times larger dataset than Stata. These dataset sizes were calculated using Base.summarysize() as suggested above. Variables in these datasets were imported to the equivalent types as I said above. I am wondering whether DataFrames/NullableArrays may be using memory more wastefully for strings than Stata.
Do you know how Stata encodes strings in memory? If Stata is using pointers to a pool of strings under the assumption that most strings are repeated many times and most strings require more memory than a pointer, that would cause Stata to use less memory than Julia, which doesn’t attempt to perform that optimization since it can interfere with other objectives like ensuring that strings are represented in memory in the most obvious way possible.
At least what I know about Stata is that it stores missing values using a sentinel, so it won’t have the 1 byte overhead per entry of NullableArray. That should only give a 12,5% (for 64-bit numbers) or 25% (for 32-bit numbers) overhead for Julia, not 80% as you report. You could check that Base.summarysize() gives something consistent to what you would expect from element sizes, i.e. (64+8)*N for each NullableArray column of N elements.
Strings are a more complex issue due to potential pooling as John noted. But I would expect Stata to take even less space than that if it pooled strings (use CategoricalArray in Julia for a comparison). I suspect the 3x difference is due to Julia strings being quite inefficiently stored currently (they have some overhead due to the backing array, this should be fixed before 1.0).
BTW, are you aware of ReadStat.jl? I would avoid reinventing the wheel here.
Stata reports the data set size in memory, which is very close to (but larger than) the actual data set size in file. My comparison was between stata dataset in file and Julia converted DataFrame with NullableArrays in memory. So Stata’s pooling of string variables in memory is not an issue here.
I did more tests and found that Julia version was 180% to 300% larger than the Stata counterpart, depending on how many string variables there are in the data sets. These tests were all based on Base.memorysize(). So it seems as nalimilan says above there is some inefficiency in storying string values in Julia Arrays.
I was aware of ReadStat.jl and read_dta() function in it. Mine and ReadStat both generates almost identical datasets, except that my program (entirely written in Julia) is at least 10x faster in converting large datasets (>100MB) and handles missing values more accurately (this was a big issue for me and was the original reason for me to write my own program). They both produce similarly sized Julia DataFrame objects from the same Stata data file, determined by Base.summarysize().
It’s not possible to resolve this question without specifying exactly how both Stata and Julia represent strings using bits. Which exact bits differ when the two represent the string "foo"? What is the encoding strategy both tools are using?
Stata file writes data by rows (observations) in order and each string
variable is written as a null-terminated string in fixed lengths (e.g.,
they occupy the same number of bytes in each observation). I cannot say
anything about how they are read into memory, though. Like Julia, Stata
version 14 allows UTF8 characters.
I don’t think it makes sense to compare a file format and a memory format. That said, if they are similar in Stata, I would try comparing data sets with only integers first, since that’s the simplest case and I wouldn’t expect any major differences.