Intent of `Base.dump` function

cce · May 17, 2019, 1:30pm

As someone making custom data structures, I’m curious, how is dump meant to be used? Are there specific formatting constraints implied? What would constitute expected extension of this function to a user-defined type? What would be considered abuse? The actual description leaves a bit to be desired.

Show every part of the representation of a value. The depth of the output is truncated at maxdepth.

The example seems to imply some conventions we should follow?

  julia> dump(x)
  MyStruct
    x: Int64 1
    y: Tuple{Int64,Int64}
      1: Int64 2
      2: Int64 3

Specific questions: a) it seems it’s a 2-space indentation hierarchical key/value format with a type header… is/should this be strictly followed? b) like a DataFrame, our data could be quite large with many columns/rows, besides depth, would adding a width and/or height keyword arguments to constrain dump size be unconscionable? c) when truncating, how would a truncation be indicated … works in many contexts for us, d) there seems to be a hierarchical key/value like format, is it OK to repeat keys? since our structure is much like XML, forcing unique keys would doubly nest our output making it break from its logical structure, e) must the dump output exactly match the innards of the object, our application has quite a bit of bookkeeping that is distracting for someone who just wants to see -their- data, f) is there some parser or way to check that our dump output is something expected by the community/tools?

It seems that DataFrame adds their own “summary” line, which, btw doesn’t match the output of summary. In the example below, I’ve used … to truncate the “height” of the frame.

DataFrame  9 observations of 5 variables                                                                                                          
  name: ["JEFFERY A", …]                                
  department: ["POLICE", …]
  position: ["SERGEANT", …]
  salary: Union{Missing, Int64}[101442, …]

Thank you for any advice/thoughts. I think dump fits the intent of what I wish to do, and I think something like dump could be very helpful in the REPL loop or in notebooks. However, it’d really need a way for data to be pruned a bit more to be valuable.

jeff.bezanson · May 17, 2019, 4:49pm

The original intent of dump is to show the entire internal representation of an object: everything in it, like a hex dump of a file. As such it wasn’t meant to be overloaded for new types. dump is there exactly for cases where e.g. there is some metadata not printed by show. If dump also hides metadata, then what do you do when you want to see it?

Is it possible to have show show all the data? Aside from the compact and limit flags, in every case I’ve seen show prints everything.

cce · May 17, 2019, 6:06pm

Jeff, Thanks for your guidance.

Let me explain our concrete use case. Our internal structure is a tree (soon to be a graph). We currently use show to provide a compact tabular display, and in the vast majority of cases, it works great. If we add one more level of nesting, we show a compact tuple or vector representation using , and ; delimiters within an individual cell. It works… for limited hierarchical cases, falling apart completely once you’re nested more than 3 levels.

    julia> show(chicago[@query employee.group(department)])
      │ department  employee{name,position,salary,rate}           │
    ──┼───────────────────────────────────────────────────────────┼
    1 │ FIRE        JAMES A, FIRE ENGINEER-EMT, 103350, missing; …│
    2 │ OEMC        LAKENYA A, CROSSING GUARD, missing, 17.68; DI…│
    3 │ POLICE      JEFFERY A, SERGEANT, 101442, missing; NANCY A…│

Alternatively, here’s the tree version of the same information (truncated differently, of course), you could see that it’s more of an XML or s-expr style data with repeated keys. We use # for cases where the data label is not provided.

    julia> dump(chicago[@query employee.group(department)])
    DataKnot
      #:
        department: "FIRE"
        employee:
          name: "JAMES A"
          position: "FIRE ENGINEER-EMT"
          salary: 103350
        employee:
          name: "DANIEL A"
          position: "FIRE FIGHTER-EMT"
          salary: 95484
        ⋮
      #:
        department: "OEMC"
        employee:
          name: "LAKENYA A"
          position: "CROSSING GUARD"
          rate: 17.68
        ⋮
      ⋮

Anyway, I was thinking we could override dump to provide this alternative visualization. Even with truncated data, it’s still far more verbose than one might want to see on a regular basis. That said, when you are debugging tree transformations, seeing the data structured as a tree without the projection onto a tabular display is quite important. So, having a helpful, memorable name for this kind of logical structure output is important, dump fits that bill. For me, dump seems like an entirely appropriate verb for 95% of our users for this kind of output.

The downside is that dump no longer becomes useful for debugging the actual underlying structure. That said… I think if someone is doing a deep dive into the underlying representation of this sort of data, they have other tools at their disposal. Perhaps we could make an option to dump to show the native formatted result I guess? Is this abuse too far?

jeff.bezanson · May 17, 2019, 6:22pm

You could try implementing the AbstractTrees.jl interface, which then provides a print_tree function. If that doesn’t quite do it, I would just name and write your own custom function for printing in this format. I don’t think it helps to overload dump

cce · May 17, 2019, 6:40pm

Jeff, Thanks for your guidance. The thing is, dump is a very nice, short name – suitable to data scientists who might be using what we’re building. What constitutes a dump of the data vary by who the audience is, yea? Anyway, this kind of display is needed in our tutorial, so it’s not an obscure feature. Finding a good short name w/o collisions is hard…

Anyway, what do you think of DataFrame’s approach to show? Along a similar vein, we could do show(chicago, layout=:tree). The default layout we’d call a :table.

Regarding AbstractTrees.jl, our internal structure is a column-oriented graph. I don’t think this library would be of much help to us, unfortunately. Thanks for pointing it out though.

jeff.bezanson · May 17, 2019, 6:55pm

I like it. If you’re ok with show(x, layout=:tree) I think that’s the best option so far.

Topic		Replies	Views
Can I disable overloaded Base.show() and print the raw form of an object? General Usage show	4	552	September 17, 2021
Itent of `Base.summary` function General Usage	1	1114	May 17, 2019
`show` is too low-level General Usage plea	18	3596	March 26, 2018
Display more decimals in DataFrame General Usage dataframes , prettytables	9	3683	June 21, 2021
'dump' beyond stdout General Usage question	4	187	March 9, 2024

Intent of `Base.dump` function

Related topics