Count number of substrings in dataframe column

jacobcvt12 · February 13, 2018, 3:26pm

I have a dataset (read in from a CSV). I’m trying to count how many observations contain a substring (for simplicity, let’s call it “X”). In R, I can do this

sum(stringr::str_count(data$CITY, "X")) > 0

I’m having trouble figuring out what the analagous Julia code would be. I’ve tried

sum(contains.(data[:CITY], "X")) > 0

But the LHS gives me a Nullable array that can’t be compared with 0. I can force 0 to be Nullable (e.g. Nullable(0), but then I can’t use this comparison in an if statement e.g.

if sum(contains.(data[:CITY], "X")) > Nullable(0)
    # do some stuff
end

Any thoughts on the appropriate way to make this count and compare to another number?

ExpandingMan · February 13, 2018, 3:38pm

You appear to be using an obsolete version of DataFrames. I recommend updating DataFrames to the most recent tagged version.

Note that currently missing data is represented by missing and currently there is no contains(::Missing, ::String) method. The best thing to do is to define contains(::Missing, ::AbstractString) = missing for yourself.

Note also that your expression above will return missing if data[:CITY] contains any missing values. You can use Missings.skip to return an iterator which excludes missings.

jacobcvt12 · February 13, 2018, 3:57pm

Thanks @ExpandingMan. I see that I’m using 0.10.1 for DataFrames in spite of running Pkg.update(). I guess that one of the dependents is requiring that version of DataFrames. Is it possible to force an upgrade to DataFrames? I’ve tried Pkg.pin("DataFrames", v"0.11.5") but get the following error

ERROR: GitError(Code:ENOTFOUND, Class:Reference, Revspec ‘06c352352fde5e90d738959e54d
43efb255bcc15’ not found.)

piever · February 13, 2018, 4:12pm

You cannot pin a package to a version you don’t have yet. You can maybe use this trick to figure out what’s holding DataFrames from upgrading, see also this thread. Then if you don’t need them, you could remove the packages that are holding you back and proceed to update DataFrames. Gadfly or ExcelReaders are likely candidates of packages that have not yet become compatible with the new DataFrames, but maybe in your case it’s something else.

As per the problem per se, once you update to DataFrames 0.11, you can do either:

count(t -> contains(t, "X"), data[:CITY])

or, if you have missing data you want to skip in the CITY column:

count(t -> contains(t, "X"), skipmissing(data[:CITY]))

ExpandingMan · February 13, 2018, 4:33pm

Yeah, unfortunately the package manager is in really poor shape right now and problems like that are common. The good news is that it is going to be completely overhauled in 0.7. For now you might want to just use git to clone packages directly. I also recommend doing Pkg.rm on a bunch of packages you haven’t been using recently. Sometimes it seems that cleaning old things up a bit and reinstalling helps. Lastly, try a Pkg.update().

piever · February 13, 2018, 5:56pm

git clone is a bit of a last resort, but I have to admit that I manually uninstalled a large chunk of my .julia/v0.6 folder to update DataFrames in a timely fashion. Totally worth it though, I think the new version is a major improvement.

Now it should be a bit easier though, I can only think of Gadfly and ExcelReaders as packages who haven’t updated yet and they both have open PRs.

nalimilan · February 13, 2018, 5:59pm

You can also work with DataFrames 0.10, just use readtable to read CSV files rather than CSV.jl with that version.

jacobcvt12 · February 13, 2018, 6:21pm

Thanks everyone - upgrading DataFrames worked. Unfortunately - I ended up just uninstalling every package to do so. Looking forward to the overhaul of the package manager

davidanthoff · February 13, 2018, 11:07pm

The Query.jl version of this looks as follows:

 df |> @filter(contains(_.city, "n")) |> @count()

At least once this is merged, this example exposed a bug in Query.jl that I just fixed

Topic		Replies	Views
Counting missing in a dataframe General Usage dataframes	6	2421	April 28, 2021
Counting in dataframes Data dataframes	7	1573	June 7, 2023
Read NULL as missing in dataframe General Usage dataframes , csv	1	971	June 30, 2021
On iterating through columns of a data frame and check if they contain String values General Usage type , dataframes , data_structures	4	2080	April 28, 2020
DataFrame can't get filter to get rid of rows where a column contains a blank " " New to Julia dataframes	8	1073	February 19, 2022

Count number of substrings in dataframe column

Related topics