Apologies if this was answered elsewhere, but I don’t see a preprocessed RDS object with the predictions dataset. Is there any reason for this?
All data are available in raw format (via tsv’s), but the preprocessed datasets link only contains the training data. It would be great to have the predictions dataset also with the same preprocessing pipeline performed available for my models so I don’t need to reengineer it myself.
Thanks @jeremygygi, for making this suggestion. We did not initially provide a preprocessed RDS object for the prediction dataset. The preprocessing applied to the training dataset was primarily for demonstration purposes. However, recognizing its potential utility, we are now considering providing a preprocessed RDS object for the prediction dataset as well.
To give you an overview, here are the steps we plan to replicate from the training dataset preprocessing:
Identify and retain only the features that overlap between the training and prediction datasets.
Execute normalization using the baseline (day 0) median for cell frequency and plasma cytokine assays. It’s important to note that the plasma antibody assay dataset has already been normalized at the baseline (reflected in the ‘MFI_normalized’ column). Also, we did not perform any normalization on TPM counts.
The relevant files can be accessed here. We also made codebases available at [Rpubs] and [GitHub]
@Pramod, to follow-up on this question, will the data for which we make the predictions (e.g. IgG PT 14 days post-booster) also be processed when it is analyzed on your end? Since we are building the models using training data that is processed (not just the baseline, but for the tasks as well), I think that it might make most sense that the data for the tasks themselves should also be processed. Thanks.
@akonst3 Thank you for your follow-up question. Yes, you are correct. The data used for making predictions, including for all six tasks like IgG PT 14 days post-booster, will indeed undergo processing during analysis on our end.
I’d like to highlight that our processing methodology has been developed in such a way that it preserves the ranks for all given tasks in both raw and processed datasets. Importantly, this processing approach does not affect the Spearman rank correlations that we utilize for evaluation.
@jeremygygi and @akonst3, the processed prediction data is now available on the website. Please let us know if you encounter any issues accessing or utilizing this data.
Hi @Pramod, I was able to load the processed prediction data successfully! One minor issue; I think you may have accidentally saved the Olink data twice (instead of Ab Titers, which appear to be missing). The matrix under $abtiter$processed_similar_to_training has the exact same dimensions and column names as the $plasma_cytokine_concentrations$processed_similar_to_training. Can this be fixed? Thanks!