Please reply with any questions you had from Mahita’s presentation on 3/22.
CMIPB_Overview (1).pptx (51.7 KB)
Thank you for the detailed presentation on your models. I have two questions that I hope you can help clarify:
- As I understand, your method selectively includes subjects with assay data for the task variables. How did you arrive at this decision? Did you explore any imputation methods before concluding that imputation was ineffective, leading you to choose the simpler approach of filtering based on subjects?
- You mentioned combining features; did this include the combination of multi-omics features? I am interested in understanding your approach, as I believe this particular step is a standout aspect of your model (I believe the other one is the regression method that you used).
Thank you!
Hi @Pramod
Here are my responses to your questions:
-
I did not explore any imputation methods. Since the dataset itself was small in size, I thought it would be a good idea to focus on quality rather than quantity of the data as this makes a lot of difference to the predictive performance of the model, since it is a form of supervised learning. One reason I did not want to use imputation to fill in missing data for those subjects that had missing assay data for the task variables, was that it could possibly lead to outliers and skew the model learning, and every subject is unique.
-
Combining features is part of the regression method. My objective was to first evaluate each feature as an individual predictor and see how well it was able to predict the output. This was followed by trying out pairwise combinations of features, and then triple features, and so on, till I found the optimal set of features that gave the best performance measured as a spearman correlation between the true values and predicted values.
Hope this helps.
Thank you,
Thanks @Mahita_Jarjapu, for the detailed response. We observed in the first challenge that models using a similar approach to data filtering as yours failed. It’s great to see that the combined dataset (2020+2021) with complete subject data could pick up the required signal for making predictions. I will look forward to learning more about the Cat Regressor.
Best,
Pramod