Count number of substrings in dataframe column


#1

I have a dataset (read in from a CSV). I’m trying to count how many observations contain a substring (for simplicity, let’s call it “X”). In R, I can do this

sum(stringr::str_count(data$CITY, "X")) > 0

I’m having trouble figuring out what the analagous Julia code would be. I’ve tried

sum(contains.(data[:CITY], "X")) > 0

But the LHS gives me a Nullable array that can’t be compared with 0. I can force 0 to be Nullable (e.g. Nullable(0), but then I can’t use this comparison in an if statement e.g.

if sum(contains.(data[:CITY], "X")) > Nullable(0)
    # do some stuff
end

Any thoughts on the appropriate way to make this count and compare to another number?


#2

You appear to be using an obsolete version of DataFrames. I recommend updating DataFrames to the most recent tagged version.

Note that currently missing data is represented by missing and currently there is no contains(::Missing, ::String) method. The best thing to do is to define contains(::Missing, ::AbstractString) = missing for yourself.

Note also that your expression above will return missing if data[:CITY] contains any missing values. You can use Missings.skip to return an iterator which excludes missings.


#3

Thanks @ExpandingMan. I see that I’m using 0.10.1 for DataFrames in spite of running Pkg.update(). I guess that one of the dependents is requiring that version of DataFrames. Is it possible to force an upgrade to DataFrames? I’ve tried Pkg.pin("DataFrames", v"0.11.5") but get the following error

ERROR: GitError(Code:ENOTFOUND, Class:Reference, Revspec ‘06c352352fde5e90d738959e54d
43efb255bcc15’ not found.)


#4

You cannot pin a package to a version you don’t have yet. You can maybe use this trick to figure out what’s holding DataFrames from upgrading, see also this thread. Then if you don’t need them, you could remove the packages that are holding you back and proceed to update DataFrames. Gadfly or ExcelReaders are likely candidates of packages that have not yet become compatible with the new DataFrames, but maybe in your case it’s something else.

As per the problem per se, once you update to DataFrames 0.11, you can do either:

count(t -> contains(t, "X"), data[:CITY])

or, if you have missing data you want to skip in the CITY column:

count(t -> contains(t, "X"), skipmissing(data[:CITY]))


#5

Yeah, unfortunately the package manager is in really poor shape right now and problems like that are common. The good news is that it is going to be completely overhauled in 0.7. For now you might want to just use git to clone packages directly. I also recommend doing Pkg.rm on a bunch of packages you haven’t been using recently. Sometimes it seems that cleaning old things up a bit and reinstalling helps. Lastly, try a Pkg.update().


#6

git clone is a bit of a last resort, but I have to admit that I manually uninstalled a large chunk of my .julia/v0.6 folder to update DataFrames in a timely fashion. Totally worth it though, I think the new version is a major improvement.

Now it should be a bit easier though, I can only think of Gadfly and ExcelReaders as packages who haven’t updated yet and they both have open PRs.


#7

You can also work with DataFrames 0.10, just use readtable to read CSV files rather than CSV.jl with that version.


#8

Thanks everyone - upgrading DataFrames worked. Unfortunately - I ended up just uninstalling every package to do so. Looking forward to the overhaul of the package manager :wink:


#9

The Query.jl version of this looks as follows:

 df |> @filter(contains(_.city, "n")) |> @count()

At least once this is merged, this example exposed a bug in Query.jl that I just fixed :slight_smile: