[RFC] GenderInference.jl

kevbonham · March 25, 2019, 9:49am

I made a little package, GenderInference.jl, that’ similar to Gender in R and SexMachine in python. Basically, you give a first name, and it guesses the gender:

julia> using GenderInference

julia> gender("Kevin")
:male

You can also ask look at a specific year or years, get raw counts and percentages.

julia> gendercount("stefan", 1976:2002)
(female = 42, male = 11217)

julia> percentfemale("jeff")
0.002147423092289253

I haven’t registered it yet, but I’d love thoughts on the API and especially my handling of when data isn’t available. I’ve only got data for 1880-2017, so what happens when you ask for dates outside that range? I’ve opted to use missing in most cases

julia> gendercount("jane", 1880:1900)
(female = 7348, male = 0)

julia> gendercount("jane", 1879:1900)
(female = missing, male = missing)

If you ask for a name that doesn’t have any entries, but in a year that’s in range, gendercount() gives zeros but tthe percent{female/male}() functions give missing. Does that make sense/seem intuitive?

julia> gendercount("Viral")
(female = 0, male = 63)

julia> gendercount("Viral", 1980)
(female = 0, male = 0)

julia> percentmale("Viral")
1.0

julia> percentmale("Viral", 1980)
missing

Since I’ve never studied data structures and algorithms, there’s probably a lot to be desired with respect to how I’m building, storing, and accessing the data, so any comments/suggestions there would also be most welcome. Thanks!

oxinabox · March 25, 2019, 9:58am

Cool!

I made a very similar package.

but based probably on worse data, it doesn’t have count or percenage just 5 categories, and it is old.
Also NameToGender is a near direct port of SexMachine, so it is GPL, which I do not like.

The best data I am aware of is discussed in:

I think it is important to allow to be parameterized by country.

If you ask for a name that doesn’t have any entries, but in a year that’s in range, gendercount() gives zeros but tthe percent{female/male}() functions give missing . Does that make sense/seem intuitive?

Yes, that is what NameToGender.jl does

kevbonham · March 25, 2019, 10:15am

Oof! Sorry for not seeing that. I did a bit of searching a couple of months ago and didn’t see it. Do you think it makes sense to combine somehow?

Neat. The R package mentions this dataset which is also just male/female without counts, and I was thinking of falling back to that list if no count data is available. And I’m already doing case-insensitive matching so that’s no big deal.

Totally agree. Gender also does this, though the package README only seems to list US sources of data. I see you linked a couple of other country sources in that issue - I suppose that’s a start.

I also don’t currently support letters other than A-Za-z. Eg.

julia> gendercount("jose")
(female = 4166, male = 560679)

julia> gendercount("josé")
(female = 0, male = 0)

Tamas_Papp · March 25, 2019, 10:38am

This looks like a thin wrapper for accessing a particular database, ie the frequency count of various names (grouped by gender, year, and potentially country). I wonder if it would not be better to

cooperate on the database itself, sharing it between various languages,
dumping it in a common format (CSV comes to mind),
just using standard, existing tools loading and lookup.

kevbonham · March 25, 2019, 10:59am

Essentially, yes. At the moment, there’s only one data source, and it’s already in CSV format (though separated by year rather than name)

A bunch of packages already use this data source - it’s the US social security administration. I’d be surprised if they want to do more than they’re already doing, but the data is in really good shape - I essentially parsed the csvs and made a dictionary.

At the moment, this doesn’t make sense because as I said it’s already in that form at the SSA. But I think that if I start combining a bunch of sources, then you’re right that it would probably be worth building a package that just pulls them all together and spits them out in a uniform way.

I didn’t actually benchmark, though I assume that filtering a dataframe on a given name is slower than dictionary lookup. I’m going to be using this to look at a dataset of a few hundred thousand names, so taking the time to construct the dictionary seemed worth the cost. I definitely could be wrong though.

Tamas_Papp · March 25, 2019, 11:06am

No argument there, but creating a fast lookup table (eg as a Dict) is a one-liner and I am not sure this calls for a package.

kevbonham · March 25, 2019, 12:40pm

Oh, I see. Maybe not, but this is more or less just a slightly more polished version of code I was working anyway, and other languages have packages like this.

Is there a concern about having a package, or are you just thinking about where effort is best spent?

Tamas_Papp · March 25, 2019, 12:53pm

I would just curate the data somehow, and use the standard tools (eg CSV reader packages, DataFrame and similar for manipulation, the built-in Dict implementation). This way I would benefit from improvements of the latter, and keep things more flexible: most possible groupings and queries can be implemented in under 5 lines even with the Dict construction thrown in.

But that is just my opinion, if you prefer a package for this then that’s fine.

davidbp · March 25, 2019, 1:22pm

Maybe building a character model instead of a simple lookup would add more value to the project since it could generalise to misspelled names (or even unseen names).

Is there any NER package available in julia?
There are several problems with names that are made up with 2 words (at least in Spanish). “Jose Maria” is a male name but “Maria” is a female name.

oxinabox · March 25, 2019, 2:05pm

I think it makes sense to deprecate mine for yours.
Particularly if you add country and more up to date data.

This is useful.
Small packages are useful.
Low effort to mainain.
Extensible.

It is a nontrivial wrapper because it handles missings and percentiles.

I guess one could create a TabularDataSource that abstracts around this general notion.
but it seems over engineered when no other usecase has shown up.

kevbonham · March 25, 2019, 6:15pm

Ok, I’m fine with this. Would you want to pull it into the same org or let it be free-standing?

kevbonham · March 25, 2019, 6:16pm

Considering I’m not really sure what you mean by character model, I think this is beyond my abilities :-). Happy to accept a PR though

oxinabox · March 25, 2019, 9:19pm

I think JuliaText is a good place for it, but no rush, nor nescesity

Considering I’m not really sure what you mean by character model, I think this is beyond my abilities :-). Happy to accept a PR though

And this is why JuliaText is a good place for it.

kevbonham · March 26, 2019, 2:47pm

I’m happy to do this, and it probably makes sense to do before registering. Someone on slack suggested the name could be clearer - any thoughts on that?

giordano · March 26, 2019, 3:12pm

I tried to install the package, but after issuing using GenderInference I see thousands of lines like this:

Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]
Do you want to download the dataset from https://www.ssa.gov/oact/babynames/names.zip to "/home/mose/.julia/datadeps/US Census - names"?
[y/n]

and seems to be stuck in an infinite loop that I can’t stop unless I forcibly kill Julia from the outside (Ctrl + C doesn’t even work).

Is this the expected behaviour?

kevbonham · March 26, 2019, 3:21pm

Yikes! Definitely not. That’s a message from DataDeps.jl, but you should only get it once. I’ll take a look when I get home, but this was not happening on my system (and the tests pass on a fresh environment so…)

giordano · March 26, 2019, 3:28pm

I can reproduce the error also in an empty project to which I added only this package:

julia --project=/tmp
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _  |  |
  | | |_| | | | (_| |  |  Version 1.1.0 (2019-01-21)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(tmp) pkg> status
    Status `/tmp/Project.toml`
  [d768500e] GenderInference v0.0.0 #master (https://github.com/kescobo/GenderInference.jl)

DNF · March 26, 2019, 5:15pm

Shouldn’t percentmale be 100.0 here? Alternatively, the name could be fractionmale.

dellison · March 27, 2019, 3:36am

I can reproduce @giordano’s issue, but I have some suggestions for fixing it.

I took a quick look at the code, and to me it looks like your use of datadep"US Census - names" here means that julia will try to (down)load the whole dataset at package precompilation time to calculate the value of the const NAMES. You can fix this by putting __precompile__(false) at the very top of the module file (see docs), or by restructuring the code so that it doesn’t need to download the dataset to precompile. (Either way, it might also be a good idea to have the register(DataDep(...)) happen inside an __init__()).

If you don’t mind another kind of feedback (regarding the package’s name): even if it’s a mathematically appropriate term for statistical inference, it seems to me that providing code to “infer” a binary gender from surface features (i.e., a person’s name) could be seen as assuming some specific political ideas about gender, even if you don’t mean it that way.

bennedich · March 27, 2019, 7:27am

Nice work! I think it’s super useful to have this data readily available as a package in Julia. Some feedback below.

How about exposing the raw data through your package? I've done a project in the past where I used this data, but what I needed then was a list of the most common names per year, not to do lookups for specific names. I realize that this is not the point of your package, but it seems like it would be easy to expose, and could increase the usefulness of your package (perhaps it's already exposed but I couldn't find it in the README?)

For faster support of range queries, the data structure I'd choose for this is not a dictionary of year -> count, but a cumulative array of counts. Then, when you want to calculate the counts for a range like 1900 to 2000, the calculation would simply be:

name_counts = counts["Kevin"]
count = name_counts[2000 + 1 - START_YEAR] - name_counts[1900 - START_YEAR]

For consecutive range queries, this reduces the complexity from linear time in range size to constant. It makes lookups for single years slightly slower however, since you’d need to do a subtraction (but still LOTS faster than using a dictionary). If you wish to avoid that, you could keep an array with raw counts per year as well.

Also, if space is an issue, consider not using 64 bit ints for the counts. 32 bits should probably suffice.

I would be more clear about the fact that this is only valid for US names. By stating:

“A package to infer a person’s gender based on…”

in the opinion of a non-American, this plays on the stereotype that Americans are unaware that there are other countries out there than the US. Perhaps more appropriate would be something like:

“A package to infer an American’s gender based on…”

percentfemale("Kristofer", 2019)

Since you are using the names of other prolific Julia contributors, I’m wondering if this should be “Kristoffer”? (Unlike the US, this name is usually spelled with two "f"s in Sweden.)

Topic		Replies	Views
RFC: 2020 Julia User and Developer Survey Community	67	3275	June 18, 2020
Towards improving community diversity Diversity & Inclusion	74	5571	September 19, 2020
Ethics in Julia Community	52	4230	August 5, 2020
Julia stats, data, ML: expanding usability Statistics statistics	84	5086	October 14, 2021
Well over 10,000 (non-JLL) Julia packages, with JLLs up to 11,846 Julia registered packages. Congratulation Julia community! Community	40	6342	November 1, 2024

[RFC] GenderInference.jl

Related topics