[RFC] GenderInference.jl

At one point I tried this and had problems with it. Don’t remember why, and I didn’t try that hard to fix it, but I can revisit. Thanks for the ideas!

A totally valid concern, and this goes beyond the package name. On the other hand, of a trans person selects a name that reflects their gender identity, presumably this tool would pick that up. Not sure how to deal with the binary nature though - do you have any suggestions?

Not exposed at the moment, you’re correct. You can do GenderInference.NAMES to get the dictionary for now. I agree it would be good to expose it more directly. I think based on earlier discussion I’m going to split out data curation from data queries, which might make this a bit more straightforward.

Interesting - the only downside being killing track of which years have data as the data set expands. And keeping track of two different vectors (or I guess I could have a vector if tuples).

Solid idea. Should be easy :slight_smile:

What? Where? :laughing:

I guess the ambition is to support more countries, do you think it would be sufficient to be more explicit that it’s only a US dataset for now?

:grimacing: Thanks for pointing that out … Will fix.

I would hold off with the two different vectors until you determine that it’s really necessary. Subtracting two numbers adjacent in memory should be cheap enough for most applications, and I suspect that a low memory footprint would be preferred over saving a nanosecond per lookup. There are also cache benefits to having a single, more compact, vector, so it might not even be slower in practice.

As for the total size of the dataset – I’m seeing 97,310 unique names, so with one count per 140 years that’s 13,623,400 counts, which for 32 bit counts corresponds to ~52 MiB of data. That doesn’t sound terrible to me, but if you add support for more countries for example, it might add up.

One way to reduce the memory footprint would be to use a variable number of bits per name to store the counts. So, for the name William for example, you might need 22 bits to represent the maximum count, while for a name like Xzaviar, 7 bits may be enough. And then pack the counts as tightly as possible (bit by bit). I’m guessing that the majority of names are quite rare, so this would save a ton of space.

You could also save only the range where each name has any data. So if the first year that Xzaviar appears is 2001, and the last is 2017, there’s no need to store a bunch of zeros for years 1880 - 2000. You could store the first year (2001), the count of years (17), the bit size (7), and then 17 * 7 bits of counts.

Sounds like a good plan :slight_smile:

1 Like

Well, I think it’s sensible just to provide a model over the census data as it is, that’s the packages purpose. I didn’t look at that part of the code too closely but it looks like that’s what you’re doing already. :slight_smile: If the model is specific to the US census data, maybe it makes sense to have name the package in a way that makes that obvious.

Yeah, I thought about that. Really don’t like fraction though. Makes me think of eg 3/4. I thought about pfemale or propfemale for “proportion”, but not really a fan…

That’s the way it is now (it’s social security rather than census), but the goal is to expand beyond that. Maybe BirthNameGenders.jl?

But in that case it should return 100.0 instead of 1.0, wouldn’t you say? If I get percentsomething() returning 1.0, I will for sure think it’s 1%.


Ah, right- social security data, not census. Thanks for the correction!

Or perhaps NameGenderDemographics.jl, or something sort of like that? I dunno, it’s hard to come up with something that’s both generic and obvious. In any case, I’m glad that you see my point and are giving it some thought. :slight_smile:

1 Like

For sure! Probably the best thing to do is ask some gender queer people rather than trying to speculate about what they would find objectionable. I only know one trans person (to my knowledge). Maybe Twitter?

So after a little more thought, I’ve sort of found myself having settled on an stronger position than I had at first: basically, I think that it is categorically impossible for an ethically sound model like this to exist at all.

Here’s a brief article on this specific use case from an HCI researcher who I admire: https://ironholds.org/names-gender/ I find their arguments very convincing, so I think at this point I would urge those in the thread to to consider this carefully, and excuse myself from the discussion.

1 Like

Thanks for sharing! Definitely a useful perspective to keep in mind, though I don’t entirely agree.

One last thought:

From my point of view, it seems like that when you did get the opinion of a person like you described, and who additionally is actually a researcher whose area of expertise is precisely what we are discussing, you quickly dismissed them and decided that you know better.

(edited, since I think I was a little combative originally… sorry about that)

Is there no room between fully agreeing and quickly dismissing?

1 Like

Doesn’t need to be. I’d be happy to discuss over DM or email or another forum if you’d rather not continue the conversation here.

Not a quick dismissal, I take this perspective seriously. Also don’t think I know better, I’m just not entirely convinced that this is everywhere and always a bad idea, and would like to hear more perspectives. I’ve reached out to my trans friend (who also works with a lot of trans people).

Also, recall that we were discussing the name of the package, not whether the method itself was ethical.

I am pretty convinced that this is not a good way to measure diversity at a conference or in a company, which is that author’s main thrust it seems. I’m interested in using it to measure gender representation in publications (See here for an example), where it is not feasible to survey or otherwise determine the gender of hundreds of thousands of authors, so the work just wouldn’t be done without this or similar methods. We are also up front about many of the limitations of this method that are mentioned in the blog post.

I do think I will include a link to that post in the README, so users are at least made aware of that perspective.

It’s a fraught subject. Makes sense that one or both of us would get combative. No hard feelings :slightly_smiling_face:


Alright - I’m back to this. When I put register() inside __init__, (see here - init currently commented out), I get

ERROR: LoadError: LoadError: LoadError: KeyError: key "US Census - names" not found

And apparently I can’t do a global const declaration inside the __init__ function.

I’ve removed precompilation on that branch though - any chance you could test to see if the same infinite loop occurs?

Can I clarify something? I should still user a dictionary for the names to access these arrays right?

Yes, I think that makes most sense, to allow for constant time lookups by name.

I’m reviving this thread based on comments from here. This revival will be focused on the ethics of the approach rather than the technical aspects. For those uninterested in this aspect of the discussion, please feel free to unwatch/unsubscribe (and also feel free to DM me if you don’t know how to do that).

I don’t think it comes down to different readings. Rather, I think the author is responding to situations (like conferences) where a survey would be just as easy, yield better/more accurate data, and not have all of the problems this approach has. I agree that the author would likely extend the argument to other use-cases, but doesn’t really address them, and as a consequence, I find it less persuasive.

For example, two of the three primary objections don’t really seem to apply for the use-case I’m primarily interested in, or at least the case wasn’t made. Reason (1) that it’s “inaccurate” isn’t applicable because (a) no measument in science is perfect, and one can assess and account for error (b) the author erroneously talks about the databases as coming only from the top 100 names, but most that I’ve seen come from things like socal security records than include all names (I think as long as they show up at least 2 times in a given year) © I can show how it compares (favorably) to other more laborious methods such as manually looking at social media profiles and looking at self-stated gender (d) the biases w/r/t things like Asian names being under represented or impossible to infer from especially after romanization, are real and should be acknowledged, but even if we expect gender trends in science publishing to be radically different between those that are represented in the database and those that aren’t, pointing out disparities only in the communities that are represented still seems worthwhile.

The third objection that it’s usually unnecessary only addresses the conference situation, and no solution for my use case is offered. Yes, surveys of academics are possible, but far more costly and time consuming, and suffer from their own problems of bias. I’m any case, I would be unable to do this due to time and budget constraints. So one could argue that the work isn’t worth doing, given the other objections. Or one could argue that there’s a better way to do it, given time and budget constraints, but these arguments weren’t made.

The final objection, that it’s morally horrifying, I just find unconvincing. Quoting from the piece here:

The voids in these datasets don’t cover everyone evenly. Rather, they often fall straight down lines of race, culture and ethnicity.

Totally agree.

Accordingly, names that are largely unique to non-white groups are far more likely to be excluded from a top-N dataset than common names used by white people, for the simple reason that there are fewer non-white people .

This has a glimmer of truth but not for the reasons started. As I said above, most of the software I’ve seen (and this package) use datasets that are far more inclusive. There are lots of non-white babies born in the US, and so lots of non-white makes included in the datasets. Add to that, there are plenty of non-white babies named David or Sara.

That said, there are clearly biases. I’ve already mentioned a general problem of Asian names, especially when given romanized spelling, but there’s also the problem of non-ASCII characters being excluded or paved over in a way that obscures/changes the implied gender, plus the fact that the entire continents if Africa and South America tend not to publish such data sets.

All stipulated.

Congratulations: your methodology is racially biased.

So, this is true, but that’s different than saying it’s racist.

The result is, invariably, that you end up with a model that underrepresents people of colour, be they from European/North American contexts or elsewhere. Both are vital, non-excludable populations to consider in even the most half-hearted inclusion initiative.

I agree with all of this, but all models are wrong. Some are useful. This author seems to be arguing that, unless you can get a census-like count, there’s no value in assessing the gender make up of anything, ever. From earlier in the post:

there’s considerable variation in the data: ambiguous names that you can at best probabilistically tie to a binary gender. “Sam” could be of any gender or none.

Again, stipulated. This point is written as if it’s some kind of scandal. In some situations, a probabilistic model can provide valuable, if imperfect insight. At least, I think so. I’m open to being persuaded.

The issue of erasure of trans/non-binary people is the part I’m most conflicted about, though I find this person’s arguments really unconvincing, eg

Frankly, claiming that birth name maps immutably to gender in the first place is the kind of essentialist TERFy nonsense that has no place in inclusion efforts.

This feels like fighting a straw man - I would certainly never claim such a thing. I won’t go through them all, but many other points strike me to way - as if my algorithm would say that Sam is 80% likely to be male, so whenever I met someone named Sam I’ll treat them like a stereotype of a man and refuse any other information.

There’s no argument that addresses what to do if I have a dataset of 100 Sams, 100 Sallys, 100 Roberts and 100 Yishans. Should I say that, because the Sams and the Yishans are ambiguous, and that some of the Roberts and Sallys might be trans or non-binary people that I have absolutely no information?

But even getting past all of that, it could be possible that simply trying to study the inclusion of women, without addressing trans people, is too exclusionary. Does this mean that we can’t talk about the gender pay gap without taking about figuring out whether there’s a trans pay gap? Can we talk about how women take on more childcare responsibilities in general, and how this is being exacerbated by COVID with children being out of school, if we can’t also assess the impacts of childcare responsibilities on non-binary people?

I 100% believe that trans and non-binary people have very different experiences, are generally more marginalized, and deserve to be acknowledged and taken seriously. And the same for POC. If I were on a committee that oversees grant proposals, I would absolutely look to fund efforts trying to assess their contributions in science publishing, which could obviously not be done with this method. But none of that send to indicate that trying to address the role of women (even if it mostly only applies to cis women) is bad.

I don’t mind at all :slightly_smiling_face:. You’re right - this was not the best way to refer to this person. In the heat of the moment, I was responding to someone that I felt was accusing me if the equivalent of “I have a friend who’s black, so I can’t be racist.” I don’t think that having a non-binary friend means that I can’t have bias, but I sought out that person’s opinion because they are non-binary and also spend their life and career in trans-activism. My response was intemperate, to be sure, but I don’t think the author of that piece necessarily had more credibility than my friend.

This is an excellent paper! I think I will need to re-read a couple of times (currently in vacation, so didn’t do a deep dive), but I think that there’s a lot of stuff that’s relevant to my research and to this package.

I will note though that these authors took a similar approach for one of their papers, and are not arguing against the practice categorically. They offer a number of suggestions to make such research more aware of and cognizant of these biases to be sure - and I think this is super valuable.


Congratulations: your methodology is racially biased.

So, this is true, but that’s different than saying it’s racist…

A distinction without a difference. I guess all I can say at this point is that while I hope these exchanges have been valuable, I also hope that they haven’t inadvertently contributed to a little bit of precedent for the Julia forums being a place where the upsides of self-consciously trans-exclusionary and racially biased research methodologies can be discussed openly.

I hope that the Julia forums will remain a place where all kinds of research methodologies can be discussed openly, even if they are imperfect from a particular point of view.

Working with data always involves trade-offs and assumptions, some of which are prone to biases. Sometimes it is difficult to do any better though (until someone comes up with a better methodology, or data), in which case it is important to be aware of these biases.

Stifling discussion about them would have the opposite effect.