Processing JSON from a .txt file and converting to a DataFrame

Hi all -

I’m just looking for some insight into why the below code is returning a dataframe containing just the first line of my json file. If you’d like to try working with the file I’m working with, you can download the aminer_papers_0.zip from the Microsoft Open Academic Graph site, I’m using the first file in that group of files.

using JSON3, DataFrames, CSV
file_name = "path/aminer_papers_0.txt"
json_string = read(file_name, String)
js = JSON3.read(json_string)
df = DataFrame([js])

The resulting DataFrame has just one line, but the column titles are correct, as is the first line. To me the mystery is why the rest isn’t getting processed. I think I can rule out that read() is only reading the first JSON object, because I can index into the resulting object and see many JSON objects:

My first guess was maybe the newline \n was causing escape issues, and tried to use chomp to get rid of them, but couldn’t get it to work.

Anyway - any help would be greatly appreciated!

EDIT: To include a link to just the single file I’ve been trying to process: Dropbox - File Deleted - Simplify your life

I can’t help with this specific problem. But have you checked out JSONTables.jl? Could be of use.

Already asked - the JSON is nested. You have a link to the file here Processing JSON from a .txt file and converting to a DataFrame in Julia - Stack Overflow (it is 9GB, I do not have enough RAM to process it)

1 Like

It seems to be nested but not so terribly. If you want to train a classifier, you might be interested in GitHub - CTUAvastLab/JsonGrinder.jl: Towards more automatic processing of structured data

So I got a very useful answer over on SO: Processing JSON from a .txt file and converting to a DataFrame in Julia - Stack Overflow

It seems that the key is passing the jsonlines=true argument JSON3.read() command. My only challenge now is that the the conversion to DataFrame is taking a really long time (granted, it’s a very big JSON file) - I’m at a couple of hours now with it still running. Maybe that’s just the best I can hope for, but I am thinking maybe conversion from JSONTable will be faster?

JSONTables seems to complain after some long minutes about input data that cannot be considered a table:

using JSONTables, DataFrames

json_source = read("aminer_papers_0.txt", String);
jst = jsontable(json_source)
# df = DataFrame(jst)

ERROR: ArgumentError: input `JSON3.Object` must only have `JSON3.Array` values to be considered a table
Stacktrace:
 [1] jsontable(x::JSON3.Object{Base.CodeUnits{UInt8, String}, Vector{UInt64}})      
   @ JSONTables C:\Users\jrafa\.julia\packages\JSONTables\g5bSA\src\JSONTables.jl:26
 [2] jsontable(source::String)
   @ JSONTables C:\Users\jrafa\.julia\packages\JSONTables\g5bSA\src\JSONTables.jl:15
 [3] top-level scope
   @ REPL[4]:100: 

Is a solution using Julia strictly necessary? If not, an option is to use jq to convert the JSONL file to a CSV file and then read the CSV file directly. In my experience, this is much faster when using large files.

The first line (parsed) is:

{
  "id": "53a7258520f7420be8b514a9",
  "title": "Semantic Wikipedia.",
  "authors": [
    {
      "name": "Max Völkel",
      "id": "53f47915dabfaefedbbb728f"
    },
    {
      "name": "Markus Krötzsch",
      "id": "53f44a27dabfaedf435dbf2e"
    },
    {
      "name": "Denny Vrandecic",
      "id": "5433f551dabfaebba5832602"
    },
    {
      "name": "Heiko Haller",
      "id": "53f322dddabfae9a84460560"
    },
    {
      "name": "Rudi Studer",
      "id": "53f556b9dabfaea7cd1d5e32"
    }
  ],
  "venue": {
    "raw": "WWW",
    "id": "547ffa8cdabfaebedf84f229"
  },
  "year": 2006,
  "n_citation": 639,
  "page_start": "585",
  "page_end": "594",
  "lang": "en",
  "volume": "",
  "issue": "",
  "url": [
    "http://doi.acm.org/10.1145/1135777.1135863"
  ]
}

You can use jq to translate the nested JSONL like this (here, creating several rows for the first JSON line), including the nested values in authors as different rows:

head -n 1 aminer_papers_0.txt | jq -r '.id as $id | .title as $title | .year as $year |
       .authors[] | [ $id, $title, .name, .id, $year] | @csv'

which yields

"53a7258520f7420be8b514a9","Semantic Wikipedia.","Max Völkel","53f47915dabfaefedbbb728f",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Markus Krötzsch","53f44a27dabfaedf435dbf2e",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Denny Vrandecic","5433f551dabfaebba5832602",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Heiko Haller","53f322dddabfae9a84460560",2006
"53a7258520f7420be8b514a9","Semantic Wikipedia.","Rudi Studer","53f556b9dabfaea7cd1d5e32",2006

Of course, you can have the entries you want saved into each row by saving more variables in the command above by using $.

Processing the complete file in my laptop this way just took 280.69s, and then reading it using CSV.read took only 16s (including compilation time):

      Row │ 53a7258520f7420be8b514a9  Semantic Wikipedia.                Max Völkel         53f47915dabfaefedbbb72 ⋯
          │ String                    String?                            String?            Union{Missing, String} ⋯
──────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
        1 │ 53a7258520f7420be8b514a9  Semantic Wikipedia.                Markus Krötzsch    53f44a27dabfaedf435dbf ⋯
        2 │ 53a7258520f7420be8b514a9  Semantic Wikipedia.                Denny Vrandecic    5433f551dabfaebba58326
        3 │ 53a7258520f7420be8b514a9  Semantic Wikipedia.                Heiko Haller       53f322dddabfae9a844605
        4 │ 53a7258520f7420be8b514a9  Semantic Wikipedia.                Rudi Studer        53f556b9dabfaea7cd1d5e
        5 │ 53a7280320f7420be8ba5e96  Parsing.                           Ralph Grishman     missing                ⋯
    ⋮     │            ⋮                              ⋮                          ⋮                     ⋮           ⋱
 20843090 │ 53e99e61b7602d97027281bd  Subject: On "the System of Crimi…  WANG Bin           54408976dabfae7f9b33f6
 20843091 │ 53e99e61b7602d97027281be  Spectroscopic and Theoretical St…  Peng Chen          54457221dabfae862da1cd
 20843092 │ 53e99e61b7602d97027281be  Spectroscopic and Theoretical St…  Kiyoshi Fujisawa   53f42c20dabfaee0d9ae9b
 20843093 │ 53e99e61b7602d97027281be  Spectroscopic and Theoretical St…  Edward I. Solomon  5487a490dabfae8a11fb3c ⋯
                                                                                 2 columns and 20843084 rows omitted

I’m missing the header here, this was just an example of what can be achieved.

1 Like

This is really great - I never knew about jq. Thanks!

1 Like