Clarification about prediction target

I followed the provided code that create batch effect correction data, and then compute the Y. However there could be negative value for Igg, such as subject ID 10, specimen ID 77, timepoint0 IgGPT -0.024370729.

Then why compute fold-change it would be negative.

Should I use the original data to compute the prediction target?

Hi @timtsang ,

Apologies for missing your post earlier. Batch correction methods, like ComBat, often introduce negative values due to centering and scaling, particularly for measurements close to zero. For fold-change tasks, I would use the original normalized data to avoid issues with negative values.

I would also like to invite @all contestants to share insights or suggestions on how to better handle negative values in such cases.

Best,
Pramod

Can you clarify if the test (challenge) data is normalized in the same way as the training data? Asking if we should stick to the already normalized data instead of doing our own normalizations…

Hi @singha53 ,

Thanks for your query!

While data processing and normalization can vary based on individual preferences, the CMI-PB team has implemented a standardized data processing method inspired by the approach used in the 2nd CMI-PB challenge. The codebase is also available on GitHub at [GitHub - CMI-PB/cmi-pb-3rd-public-challenge-data-prep: 3rd (public) challenge processed data preparation].

We have provided both harmonized and batch-corrected data here in respective folders. You can use batch-corrected data directly for building models and making predictions where both training and challenge data are processed in the same way. If you choose to develop your own batch-correction pipeline, you could use harmonized data or raw data that was not batch-corrected.

I hope this clarifies!

Best,
Pramod

Hi Pramod, Apologies, I meant, is the challenge data that is “hidden” normalized in the same way as the training and challenge data that is provided, since this might affect the validation results?

Hi @singha53 ,

The “hidden” challenge data has not undergone any normalization. We will evaluate contestants’ submissions using the raw challenge data. All data preprocessing, including harmonization and normalization, was applied only to the baseline challenge data. Since we are using Spearman rank correlation for evaluation, we have verified that the rank order remains consistent between the raw and normalized data across all tasks. I hope this helps.

Let me know if you have any more questions. Thank you for bringing up this important question.

Best,
Pramod

1 Like