Is a solution using Julia strictly necessary? If not, an option is to use jq
to convert the JSONL file to a CSV file and then read the CSV file directly. In my experience, this is much faster when using large files.
The first line (parsed) is:
{
"id": "53a7258520f7420be8b514a9",
"title": "Semantic Wikipedia.",
"authors": [
{
"name": "Max Völkel",
"id": "53f47915dabfaefedbbb728f"
},
{
"name": "Markus Krötzsch",
"id": "53f44a27dabfaedf435dbf2e"
},
{
"name": "Denny Vrandecic",
"id": "5433f551dabfaebba5832602"
},
{
"name": "Heiko Haller",
"id": "53f322dddabfae9a84460560"
},
{
"name": "Rudi Studer",
"id": "53f556b9dabfaea7cd1d5e32"
}
],
"venue": {
"raw": "WWW",
"id": "547ffa8cdabfaebedf84f229"
},
"year": 2006,
"n_citation": 639,
"page_start": "585",
"page_end": "594",
"lang": "en",
"volume": "",
"issue": "",
"url": [
"http://doi.acm.org/10.1145/1135777.1135863"
]
}
You can use jq
to translate the nested JSONL like this (here, creating several rows for the first JSON line), including the nested values in authors as different rows:
head -n 1 aminer_papers_0.txt | jq -r '.id as $id | .title as $title | .year as $year |
.authors[] | [ $id, $title, .name, .id, $year] | @csv'
which yields
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Max Völkel","53f47915dabfaefedbbb728f",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Markus Krötzsch","53f44a27dabfaedf435dbf2e",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Denny Vrandecic","5433f551dabfaebba5832602",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Heiko Haller","53f322dddabfae9a84460560",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Rudi Studer","53f556b9dabfaea7cd1d5e32",2006
Of course, you can have the entries you want saved into each row by saving more variables in the command above by using $
.
Processing the complete file in my laptop this way just took 280.69s, and then reading it using CSV.read
took only 16s (including compilation time):
Row │ 53a7258520f7420be8b514a9 Semantic Wikipedia. Max Völkel 53f47915dabfaefedbbb72 ⋯
│ String String? String? Union{Missing, String} ⋯
──────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 53a7258520f7420be8b514a9 Semantic Wikipedia. Markus Krötzsch 53f44a27dabfaedf435dbf ⋯
2 │ 53a7258520f7420be8b514a9 Semantic Wikipedia. Denny Vrandecic 5433f551dabfaebba58326
3 │ 53a7258520f7420be8b514a9 Semantic Wikipedia. Heiko Haller 53f322dddabfae9a844605
4 │ 53a7258520f7420be8b514a9 Semantic Wikipedia. Rudi Studer 53f556b9dabfaea7cd1d5e
5 │ 53a7280320f7420be8ba5e96 Parsing. Ralph Grishman missing ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋱
20843090 │ 53e99e61b7602d97027281bd Subject: On "the System of Crimi… WANG Bin 54408976dabfae7f9b33f6
20843091 │ 53e99e61b7602d97027281be Spectroscopic and Theoretical St… Peng Chen 54457221dabfae862da1cd
20843092 │ 53e99e61b7602d97027281be Spectroscopic and Theoretical St… Kiyoshi Fujisawa 53f42c20dabfaee0d9ae9b
20843093 │ 53e99e61b7602d97027281be Spectroscopic and Theoretical St… Edward I. Solomon 5487a490dabfae8a11fb3c ⋯
2 columns and 20843084 rows omitted
I’m missing the header here, this was just an example of what can be achieved.