Processed predictions dataset?

jeremygygi · December 11, 2023, 2:14pm

Apologies if this was answered elsewhere, but I don’t see a preprocessed RDS object with the predictions dataset. Is there any reason for this?

All data are available in raw format (via tsv’s), but the preprocessed datasets link only contains the training data. It would be great to have the predictions dataset also with the same preprocessing pipeline performed available for my models so I don’t need to reengineer it myself.

Pramod · December 21, 2023, 6:32pm

Thanks @jeremygygi, for making this suggestion. We did not initially provide a preprocessed RDS object for the prediction dataset. The preprocessing applied to the training dataset was primarily for demonstration purposes. However, recognizing its potential utility, we are now considering providing a preprocessed RDS object for the prediction dataset as well.

To give you an overview, here are the steps we plan to replicate from the training dataset preprocessing:

Identify and retain only the features that overlap between the training and prediction datasets.
Execute normalization using the baseline (day 0) median for cell frequency and plasma cytokine assays. It’s important to note that the plasma antibody assay dataset has already been normalized at the baseline (reflected in the ‘MFI_normalized’ column). Also, we did not perform any normalization on TPM counts.

The relevant files can be accessed here. We also made codebases available at [Rpubs] and [GitHub]

akonst3 · December 23, 2023, 1:42pm

@Pramod, to follow-up on this question, will the data for which we make the predictions (e.g. IgG PT 14 days post-booster) also be processed when it is analyzed on your end? Since we are building the models using training data that is processed (not just the baseline, but for the tasks as well), I think that it might make most sense that the data for the tasks themselves should also be processed. Thanks.

Pramod · January 7, 2024, 3:16pm

@akonst3 Thank you for your follow-up question. Yes, you are correct. The data used for making predictions, including for all six tasks like IgG PT 14 days post-booster, will indeed undergo processing during analysis on our end.

I’d like to highlight that our processing methodology has been developed in such a way that it preserves the ranks for all given tasks in both raw and processed datasets. Importantly, this processing approach does not affect the Spearman rank correlations that we utilize for evaluation.

Pramod · January 7, 2024, 3:18pm

@jeremygygi and @akonst3, the processed prediction data is now available on the website. Please let us know if you encounter any issues accessing or utilizing this data.

jeremygygi · January 8, 2024, 9:22pm

Hi @Pramod, I was able to load the processed prediction data successfully! One minor issue; I think you may have accidentally saved the Olink data twice (instead of Ab Titers, which appear to be missing). The matrix under $abtiter$processed_similar_to_training has the exact same dimensions and column names as the $plasma_cytokine_concentrations$processed_similar_to_training. Can this be fixed? Thanks!

Pramod · January 8, 2024, 10:06pm

Thanks @jeremygygi, for spotting it out. I will update RDS object and CSV file and update you,

Pramod · January 9, 2024, 12:16am

@jeremygygi I have updated the RDS object. Please check here and let me know if it looks good.