# How to reduce redundancy in a list inside a DataFrame

Hi all. I have two questions, one very basic and one not so basic.

I have a DataFrame that looks like this:

``````df = DataFrame([])
df.list = [[1,2,3,6,12,13,14],[20,21,23,24,50],[1,3,5,7,10,11,12,13]]
``````

And I need to generate a new column, that only stores the βpointsβ of each array. This means, only the first and last number in a sequence of consecutive numbers.
It would look like this:

``````df.new = [[1,3,12,14],[20,21,23,24],[10,13]] #it is important that they are made up of pairs!
``````

My solution, so far, looks something like this:

``````	df.list2 = [[],[],[]]

for r in eachrow(df) #loop the df
j=0 #this is for "walking" down the array
for i in r.list
j+=1
if j == 1 #append the fist number. This needs to go, maybe the fist numbers doesn't have a consecutive pair.
append!(r.list2, r[1][j])

elseif j != 1
if (r[1][j]-1) != r[1][j-1]  #check if the number is not consecutive to the previous one
append!(r.list2, r[1][j])

elseif j == length(r[1]) #This also should go. It works to append the last value, but it might be an isolated one!
append!(r.list2,r[1][j])

end
end
end
end

df

``````

That kinda works. With the first and last part of the loop, I can append the βpointsβ of each string, and with the middle I can identify some of the numbers that I need. Here is the output:

``````df.list2 = [[1,6,12,14],[20,23,50], [1,3,5,7,10,13]]
``````

The problem is, first, Iβm not sure how to create an empty column automatically, so I can append all my values (if I do something like ` df.new = missings(Array, nrow(df))`, that is what I usually do, I get an error, because I am trying to append values, as it were a list).

Secondly, I could use some help with the loop. I thought about using ` diff`, so I could save me some trouble, but I am not sure if it will improve the loop.
Right now I am still getting the βloneβ numbers (i e `1,3,5,7` in the last list), and I donβt really know how to get rid of them, and I should find a way to append the last number in the list (in the second list, I can append the ` 20, 23`, but I canβt append the ` 21, 24`, or the `3, 12` on the first one).

I think I can do it with a little more wresting with the for loop, and the different conditions, but any help is welcome.

Thanks a lot! cheers

Writing a loop is fine. Your logic is sufficiently complicated to merit a loop.

But you should put things in functions to organize your code and make it readable

``````julia> function get_consec(x)
t = [x[1]]
for i in 2:length(x)
# Consecutive with last one
if x[i] == x[i-1] + 1
# If its the last one, add it
if i == length(x)
push!(t, x[i])
# If its the end of the chain, push it
elseif x[i] != x[i+1] - 1
push!(t, x[i])
end
# Not consecutive, start a new chain
else
if i != length(x)
if x[i] == x[i+1] -1
push!(t, x[i])
end
end
end
end
return t
end;

julia> get_consec(df[1,1])
4-element Vector{Int64}:
1
3
12
14
``````

using DataFramesMeta.jl you can do this as

``````julia> using DataFramesMeta

julia> @rtransform df :list2 = get_consec(:list)
3Γ2 DataFrame
Row β list                          list2
β Arrayβ¦                        Arrayβ¦
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ
1 β [1, 2, 3, 6, 12, 13, 14]      [1, 3, 12, 14]
2 β [20, 21, 23, 24, 50]          [20, 21, 23, 24]
3 β [1, 3, 5, 7, 10, 11, 12, 13]  [1, 10, 13]
``````

I guess I didnβt get the logic fully right in the function. But the point stands, write a function and use `@rtransform`.

1 Like

The logic is:

``````julia> function identify_pairs(v::AbstractVector)
local start, stop
initialized = false
res = eltype(v)[]
for x in v
if initialized
if x == stop + 1
stop = x
else
if stop > start
push!(res, start, stop)
initialized = false
end
start = stop = x
initialized = true
end
else
start = stop = x
initialized = true
end
end
if stop >  start
push!(res, start, stop)
end
return res
end
identify_pairs (generic function with 1 method)

julia> transform(df, :list => ByRow(identify_pairs) => :new)
3Γ2 DataFrame
Row β list                          new
β Arrayβ¦                        Arrayβ¦
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββ
1 β [1, 2, 3, 6, 12, 13, 14]      [1, 3, 12, 14]
2 β [20, 21, 23, 24, 50]          [20, 21, 23, 24]
3 β [1, 3, 5, 7, 10, 11, 12, 13]  [10, 13]
``````
4 Likes

Thanks a lot! I should try to use functions more often, especially in this cases. I understand the logic, cheers!

Thanks!! As I said in the previous comment, I should tap more often on functions. I am marking this one as the solution. Thanks again!

1 Like

Fyi, this Matlab post provides a compact alternative, probably less efficient than Bogumilβs solution.

A possible Julia translation here into (~10% slower for your input data) :

``````function sortedendpoints(v)
d = diff(eltype(v)[0; diff(v).==1; 0])
return transpose([v[d.>0] v[d.<0]])[:]
end

transform(df, :list => ByRow(sortedendpoints) => :new)
``````
1 Like

This also works:

``````transform(df, :list => ByRow(f->filter(x->(x-1 β f) β» (x+1 β f),f)) => :new)
``````

(slower in run-time, faster in type-time)

3 Likes

@Dan, what a beautiful one-liner, which also shows the intention most clearly. You might consider adding it to this list.

1 Like

Thatβs really neat! Thanks! It still amazeme that something that in my (newbie) head was a really complex problem can be solved in many ways, and sometimes like this, in one line. Got lots to learn

Just that this one has quadratic time complexity, so if your arrays get longer you might run into problems quickly.

1 Like