You brought this up in the thread that I started. Before I answer, I’ll admit that it’s entirely possible I’m thinking about this completely wrong, and have the wrong mental model of what you’re proposing. I’m relatively new to this sort of thing, so I’m definitely open to being educated. That said:
Wouldn’t join
rely on the main data having samples as rows and features as columns to match the metadata? I do this sometimes, but it often results in DataFrames that are tens of thousands of columns (with only a few hundred rows). One of the reasons that I’ve found this problematic is that I often need to get do calculations on features based on each sample.
One common example: in my sample data, each microbe has a count, and I want to convert that to relative abundance (the count over the sum of counts of all species). This is a within-sample property - if samples are rows in a table along with (what I’m calling) sample metadata, I have to first select columns that are my microbial species (this is complicated now but might be helped by having column labels), then I need to convert the DataFrame to a matrix (since I need to do calculations on rows, and things like sum
are not defined for rows of DataFrames), then do the calculation (and I gather that taking eg the sum of a row vector is less efficient than for a column vector).
Of course, I can just hold on to the original table where samples remain as columns, but then we’re back to the same problem. Another solution would be to just do the calculations on each sample while they’re in columns before combining them into the sample-as-row table, but I often need different calculations for different subsets of the samples depending on the patient data.