I have been using a package in R that really only has one main function which does a lot behind the scenes of which a portion is prohibitively slow to the point where the timescales involved are far too long to ever achieve an output on the size of data I need to run it with. So, I want to use this as an opportunity to learn and I’m hopeful the community is willing to guide me in my inexperience. My plan is as follows:
- Profile the existing code to ascertain exactly what is taking the most time (currently narrowed down to two functions which I am now going through)
- Write the identified process above in plain language ignoring objects/structures that were used to implement it in R but focusing on what is actually being executed
- Find existing code written in Julia that is closest to what I need to carry out based on outcome of steps 1 and 2
- Modify the existing code in small steps using tests to ensure I am getting the expected results until the process described in 2 is achieved. Use the same input data in R and the rewrite to ensure same output.
- Investigate whether the types initially used in Julia are able to be optimised for serial execution: adjust accordingly
- Implement in parallel on CPU
- Any hope of executing on GPU?
From those of you who have done this and more before, is this an advisable way to proceed?