What are the best approaches to handle data noise, feature filtering, batch effects between 2020 and 2021 datasets?

Participants require to merge two longitudinal datasets, i.e., 2020 and 2021, to make predictions using the 2022 baseline dataset. I am creating this thread to invite discussions about the best approaches to merging two datasets.

We started a GitHub codebase to closely look into these data meerging issues here.

I suggest batch effects are diagnosed per omic, both within and across 2020 and 2021. BatchQC and proBatch can be used (may need to experiment for each omic). After batch effects are quantified, can proceed to use ComBat or other method to adjust for effect, depending on strength of effect and omic. A special consideration for when machinery or other aspects of data collection have changed should be made prior to doing diagnostics for batch effects.

1 Like