Questions about the master_processed_training_data RDS object

Hello
Ab titre, cytokine and cell frequency assays in the RDS object all have normalized and batch-corrected matrices but the gene expression assay only has a batch-corrected matrix. Why is that? Was there no normalization done on the batch-corrected matrix?

For the other assays, which matrix should we use, the batch-corrected or normalized matrix?
What are the colnames in each matrix? do they correspond to specimen ID or subject ID?

Thank you

Thanks, @rtippalagama, for raising these queries.

  1. That’s correct. We initially normalized Ab titer, cytokine, and cell frequency assays on a yearwise basis using baseline median normalization. Afterward, we applied a batch effect correction pipeline. As for the gene expression assay, we followed the standard batch effect correction pipeline without any prior dataset normalization. That’s why you see normalization and batch-corrected files for Ab titer, cytokine, and cell frequency assays and only batch-corrected files for the gene expression dataset.

  2. We leave the choice of the most suitable computable matrices to the contestants. Some models/packages come with built-in functionality for batch effect correction (e.g., DESeq), which may be better suited for specific tasks.

  3. Both batch-corrected or normalized matrices have feature names as row headers and specimen_id as column headers.

I hope this provides clarity. Please don’t hesitate to reach out if you have any further questions.

1 Like

@Pramod, just to clarify, does the ‘batch’ in batch correction refer to the year of the study (i.e. the two studies are corrected for being performed in different years)? So, normalization is per year, and batch correction is across years…?

Hi @akonst3 ,

That’s right! The “batch” refers to year of study ie. “2020_dataset” and “2021_dataset” (total of 2 studies within training dataset). We first did normalization per year (dataset) and then performed batch effect correction across years.

1 Like

@pramod, sorry, one follow-up question in this regard. In Index of /downloads/cmipb_challenge_datasets/current/2nd_challenge/processed_datasets/, I observe there are both ‘harmonized_training_data’ and ‘processed_training_data’ RDS files. Can you please explain the difference between them?

Hi @akonst3,

Sure. After downloading raw data files from CMI-PB website, we first performed data harmonization (exclude features with low variance and identify overlapping features between two datasets within training datasets). After this step, we performed data normalization and batch effect correction.

  • harmonized_training_data contains overlapping features between 2020 and 2021 dataset. The related code can be found here

  • processed_training_data contains training dataset that is normalized as well as batch effect corrected. The related code can be found here

1 Like

Thank you @pramod. I have (another) follow-up question:
When uploading the processed_training_data, each omic dataset has several sublists, including metadata and batchCorrected_data. I did notice that for the pbmc_gene_expression data, there is no normalized_data (just raw_data and batchCorrected_data). Can you explain how ‘raw’ is the raw pbmc gene expression data? I.e., was there any sample-level correction or any other post-processing applied when collecting this data? In UNDERSTAND THE DATA - CMI-PB Blog pages, it’s stated that the pbmc_gene_ecxpression data will have both raw and tpm counts for each gene.