2nd Challenge Dataset changes tracking

Pramod · October 13, 2023, 12:16am

As the 2nd challenge progresses, contestants might notice inconsistencies or issues in the dataset. This might result in challenge datasets to undergo modifications over time. This page is dedicated to organizing and tracking all changes related to the datasets. Older (legacy) versions of the dataset are stored in the legacy repository, and the updated (current) datasets are available here.

…/legacy/…/2023-10-05

Datasets are made available to 2nd challenge contestants via API and direct download.
This version of the dataset can be found in the legacy repository here.

…/legacy/…/2023-12-04

Few contestants reported issues when accessing the data files. The identified issues include:

Inconsistencies in the actual dates relative to the boost.** A more detailed discussion on this can be found here.
The names of cell populations in the prediction dataset differed from those in the training dataset.**
A more detailed discussion on this can be found here

…/legacy/…/2023-12-21

Our student contestants reported issues of missing antibody titer data for subject_id 98 in “2022BD_plasma_ab_titer.tsv” file. We checked and confirmed that there was indeed missing data for antibody titer data for subject_id 98. We fixed this issue and replaced the old data files with a new correct file that includes Antibody titer data for subject_id 98 (specimen_id’s = 740, 741, 742).

…/current/… Current and final dataset version (updated on Jan 05, 2024)
In response to a suggestion from one of the contestants, we have taken the initiative to process the prediction dataset in a manner similar to the training dataset, to ease prediction. To this end, we have provided both the processed data and the relevant code. This is the current version of the challenge dataset and is accessible here. You can still access the old data file in here.

Joe · October 13, 2023, 8:55pm

could you please check file “2020LD_pbmc_cell_frequency.tsv” in this path (cmipb_challenge_datasets/current/2nd_challenge/raw_datasets/training_data/), file size indicates not all cell freq data included in the file. See this below:

unique(cellfreq_2020$cell_type_name)
[1] “Monocytes” “CD33HLADR” “Classical_Monocytes” “Non-Classical_Monocytes” “Intermediate_Monocytes” “Bcells”
[7] “CD3CD19” “CD3CD19neg” “CD3 Tcells” “CD4Tcells” “CD8Tcells” “Tregs”
[13] “TemraCD4” “NaiveCD4” “TemCD4” “TcmCD4” “TemraCD8” “NaiveCD8”
[19] “TemCD8” “TcmCD8” “NK” “Basophils” “mDC” “pDC”
[25] “ASCs (Plasmablasts)”

Pramod · October 13, 2023, 10:40pm

Hi @Joe 2020 cohort has different feature counts when compared to 2020 and 2021 datasets. The 2020 cohort’s cell frequency dataset comprises 25 cell types, as you pointed out., while the 2021 and 2022 datasets each have 50 cell types.

Similarly, the 2020 plasma_cytokine_concentration (Olink) data has 263 cytokines, while the 2021 and 2022 plasma_cytokine_concentration datasets each have 45 cytokines. Genes in all three datasets are identical. I hope this information is helpful.