I’m a relative newbie to Julia and trying to do some record linkage in Julia. In that process, I need to bring two large datasets (1.8million rows and 800K rows respectively) into an array to loop through and do comparisons of the strings in those datasets. I began this process by attempting to initialize an empty array that can store all the comparisons (ie a 1.8m x 800k array). When I attempt to do this I get an OutOfMemoryError(). I have tried initializing it at a few different sizes, and only after reducing the size considerably am I able to create the array (examples below). I’m wondering if what I’m doing is just foolish or if there is another type of array I should be using to accomplish my goal. Many thanks!
How much memory do you have?
What comparison do you need to do? I think the solution has to be context dependent.
Perhaps you may consider not creating such a large array. You might read the strings of your datasets from the source files one at each time (or in reasonably sized batches), make the comparison(s), and write the results in an output file incrementally, without keeping the whole data sets in memory.
File IO is pretty efficient in Julia, so don’t be wary of reading and writing files frequently. In cases like this it may be much faster than loading and handling big datasets in memory.
I’m running on a machine with 24GB RAM. And the comparison I am trying to do is Levenshtein distance. I think I may just approach it by doing some blocking to cut down the number of comparisons by a lot.
Old question of mine covering similar ground (but not well discoverable from the title):
Why do these two functions benchmark the same?
Your immediate problem of not having enough memory may be solved using memory mapping, but the question is if you really need all those distances at the same time and whether or not you can use problem specific knowledge to avoid calculating them at all. Performance might be a problem though. For us to be able to give advice in that direction, we’re going to need some more information about what you’re doing with that data