Hello, and thanks for dedicating your time at reading my question, even if you can’t help em regarding it.
For the couple past months, I’ve been enrolled in a project where I must analyze code change patterns in open source repositories. The general idea is mining all the repositories available in a given programming language and extract relevant insights from it, and I’ve chosen Julia to be my object of analysis, due to convenience (the General registry provides, I believe, a representative sample of open sourced projects in the language) and ideology (I like Julia).
In order to proceed with this project, I need to find a way to parse Julia source code into a code property graph representation (abstract syntax tree + function call graph + dependency graph) and then enrich it with developer social network data (who committed which changes to which modules at which timestamps under which commit message, who is the repository owner, etc).
The approach I’ve chosen to accomplish this is inspecting the source code of Julia, finding and extracting its facilities responsible for parsing and modifying them to my needs. I may be wrong here, but so far I believe that this can be done if I recompile (the Julia version of) flisp and use it to interpret the scheme parsing scripts provided with the Julia source code.
First, do you think my approach is sounded, or have I misunderstood how the parsing process in Julia works? Second, would you have any insights on how I can tinker the flisp engine for executing the provided parsing scripts?
Thanks in advance for any help, even if not much. Also, English is my second language, and I have no formal education in computer science, so I apologize for any ideas put poorly herein.
You probably want to use the modern JuliaSyntax parser, rather than the legacy flisp parser. With julia 1.10 (current stable) and later is the parser used by default.
It should be easier to work with because it’s in julia so you can easily call it from Julia,
and it provides various levels of representations. It’s much more documented than the flisp parser
Tree-sitter is a parser generator that can parse many languages. There is a Julia grammar for it, though I’m not sure how complete or up-to-date it is.
I can think of two reasons to favor the new parser over Meta.parse, even though the latter now uses it internally.
The GreenNode format from JuliaSyntax will include all comments, and for the sort of correlation to Git commits @hsolerkalinovski is describing, having the comments available allows those to be correlated back to individual developers as well.
The other is that the GreenNode makes it very easy to reconstitute the the raw string form of a given parse tree, which might be a useful operation. Exprs include line numbers, but this is lower fidelity.
I’ve checked the JuliaSyntax module after you’ve pointed me towards it, and for the time being I will use it as my Julia parser backend. Thanks.
About research ethics, I was not aware that Julia had specific community guidelines on that (though quite minimal), so thanks for pointing me towards that as well. Before publishing any research results, I intend to query the community for extra steps such as anonymity requirements (for instance, whether disguising repository contributors through made-up names is enough), proper code contribution recognition, whether any empirical findings are sound, and so on.
One of the main concerns is doing experiments on people by making PRs etc. There was a big kerfuffle last year when some researchers made a bunch of malicious PRs to open source projects to see how hard it was to get them merged.
I knew of one experiment where researchers PRed a malicious component into the Linux OS. As I see it, that was not just unethical, it was actually a crime.
Yeah, so definitely don’t do anything like that. If you have any doubts you can ask here or if you don’t want to reveal what you’re going to do publicly you can contact stewards@julialang.org and run it by them (us, I’m one of them).