APPLYING MACHINE LEARNING TO IMPROVE THE ACCURACY OF POLYGENIC RISK SCORES WITH AUTISM SPECTRUM DISORDER DATA

Authors

  • Trịnh Thị Xuân, Lê Thị Thanh Thuỳ, Tạ Văn Nhân, Hoàng Đỗ Thanh Tùng, Trương Nam Hải, Trần Đăng Hưng

Keywords:

Complex diseases, polygenic risk scores, GWAS, SNPs, SNP arrays, machine learning, autism

Abstract

Polygenic risk scores (PRS) are relative estimation values of disease risk based on identification of effect variant set. In recent years, there have been many attempts to apply PRS calculation to clinical practice, however, selection of genetic variants affecting

diseases has not been accurate, leading to the model’s performance not yet reached hope. In this study, we have implemented different models to choose the set of variants giving the best prediction. The data used were taken from Genome-Wide Association Studies (GWAS) of Autism Spectrum Disorder (ASD). Original set of variants was reduced by Clumping and Thresholding (“C + T”), Penalized Logistic Regression (PLR), and Recursive Feature Elimination based on Support Vector Machine (SVM-RFE). As a result, the SVM-RFE method gives a set of SNPs that the prediction model has the best performance.

References

[1]. V. Khera, M. Chaffin, K. G. Aragam, M.E. Haas, C. Roselli, S. H. Choi, P. Natarajan, E. S. Lander, S. A. Lubitz, P. T. Ellinor, and S. Kathiresan, “Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations,” Nature Genetics, vol. 50, no. 9, pp. 1219– 1224, Sep. 2018.

[2]. J. Vilhjálmsson, J. Yang, H. K. Finucane, A. Gu-sev, S. Lindstrom,¨ S. Ripke, G. Genovese, P.-R. Loh, G. Bhatia, R. Do, T. Hayeck, H.-H. Won, Schizophrenia Working Group of the Psychiatric Genomics Consor- tium, Discovery, Biology, and Risk of Inherited Vari-ants in Breast Cancer (DRIVE) study, S. Kathiresan, M. Pato, C. Pato, R. Tamimi, E. Stahl, N. Zaitlen, B. Pasaniuc, G. Belbin, E. E. Kenny, M. H. Schierup, P.De Jager, N.A. Patsopoulos, S. McCarroll, M. Daly, S. Purcell, D. Chasman, B. Neale, M. Goddard,P. M. Visscher, P. Kraft, N. Patterson, and A. L. Price, “Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores,” American Journal of Human Genetics, vol. 97, no. 4, pp. 576–592, Oct. 2015.

[3]. T. S. H. Mak, R. M. Porsch, S. W. Choi, X. Zhou, and P. C. Sham, “Polygenic scores via penalized regression on summary statistics,” Genetic Epidemiology, vol. 41, no. 6, pp. 469–480, Sep. 2017.

[4]. T. Ge, C.-Y. Chen, Y. Ni, Y.-C. A. Feng, and J. W. Smoller, “Polygenic prediction via Bayesian regression and continuous shrinkage priors,” Nature Communica-tions, vol. 10, no. 1, p. 1776, Apr. 2019.

[5]. P. J. Newcombe, C. P. Nelson, N. J. Samani, and F. Dudbridge, “A flexible and parallelizable approach to genome- wide polygenic risk scores,” Genetic Epi- demiology, vol. 43, no. 7, pp. 730–741, 2019.

[6]. G. Paré, S. Mao, and W. Q. Deng, “A machine-learning heuristic to improve gene score prediction of polygenic traits,” Scientific Reports, vol. 7, no. 1, p. 12665, Oct. 2017.

[7]. Y. Takahashi, M. Ueki, G. Tamiya, S. Ogishima, K. Ki-noshita, A. Hozawa, N. Minegishi, F. Nagami, K. Fuku- moto, K. Otsuka, K. Tanno, K. Sakata, A. Shimizu, M.Sasaki, K. Sobue, S. Kure, M. Yamamoto, and H. Tomita, “Machine learning for effectively avoiding overfitting is a crucial strategy for the genetic predic- tion of polygenic psychiatric phenotypes,” Translational Psychiatry, vol. 10, no. 1, pp. 1–11, Aug. 2020.

[8]. Vlachakis, E. Papakonstantinou, R. Sagar, F. Ba-copoulou, T. Exarchos, P. Kourouthanassis, V. Kary-otis, P. Vlamos, C. Lyketsos, D. Avramopoulos, and V. Mahairaki, “Improving the Utility of Polygenic Risk Scores as a Biomarker for Alzheimer’s Disease,” Cells, vol. 10, no. 7, p. 1627, Jun. 2021.

[9]. H. Geschwind, J. Sowinski, C. Lord, P. Iversen, J. Shestack, P. Jones, L. Ducat, and S. J. Spence, “The Autism Genetic Resource Exchange: A Resource for the Study of Autism and Related Neuropsychiatric Conditions,” American Journal of Human Genetics, vol. 69, no. 2, pp. 463–466, Aug. 2001.

[10]. J. Grove, S. Ripke, T. D. Als, M. Mattheisen, R. K. Wal-ters, H. Won, J. Pallesen, E. Agerbo, O. A. Andreassen, R. Anney, S. Awashti, R. Belliveau, F. Bettella, J. D. Buxbaum, J. Bybjerg-Grauholm, M. Bækvad-Hansen, F. Cerrato, K. Chambert, J. H. Christensen, C. Church-house, K. Dellenvall, D. Demontis, S. De Rubeis, B. Devlin, S. Djurovic, A. L. Dumont, J. I. Goldstein, B. S. Hansen, M. E. Hauberg, M. V. Hollegaard, S. Hope, D. P. Howrigan, H. Huang, C. M. Hultman, I. Klei, J. Maller, J. Martin, A. R. Martin, J. L. Moran, I.Nyegaard, T. Nærland, D. S. Palmer, A. Palotie, C.B. Pedersen, M. G. Pedersen, T. dPoterba, J. B.Poulsen, B. S. Pourcain, P. Qvist, K. Rehnstrom,¨ A. Re-ichenberg, J. Reichert, E. B. Robinson, K. Roeder, P. Roussos, E. Saemundsen, S. Sandin, F. K. Satter-strom, G. Davey Smith, H. Stefansson, S. Steinberg, C. R. Stevens, P. F. Sullivan, P. Turley, G. B. Walters, X. Xu, K. Stefansson, D. H. Geschwind, M. Nordentoft, D. M. Hougaard, T. Werge, O. Mors, P. B. Mortensen, B. M. Neale, M. J. Daly, and A. D. Børglum, “Identification of common genetic risk variants for autism spectrum disorder,” Nature Genetics, vol. 51, no. 3, pp. 431–444, Mar. 2019.

[11]. S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. Ferreira, D. Bender, J. Maller, P. Sklar, P. de Bakker, M. Daly, and P. Sham, “PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses,” American Journal of Human Genetics, vol. 81, no. 3, pp. 559–575, Sep. 2007.

[12]. C. A. Anderson, F. H. Pettersson, G. M. Clarke, L. R. Cardon, A. P. Morris, and K. T. Zondervan, “Data quality control in genetic case-control association studies,” Nature Protocols, vol. 5, no. 9, pp. 1564–1573, Sep. 2010.

[13]. J. R. I. Coleman, J. Euesden, H. Patel, A. A. Folarin, S. Newhouse, and G. Breen, “Quality control, imputa-tion and analysis of genome-wide genotyping data from the Illumina HumanCoreExome microarray,” Briefings in Functional Genomics, vol. 15, no. 4, pp. 298–304, Jul. 2016.

[14]. T. Marees, H. de Kluiver, S. Stringer, F. Vorspan, E. Curis, C. Marie-Claire, and E. M. Derks, “A tutorial on conducting genome- wide association studies: Quality control and statistical analysis,” International Journal of Methods in Psychiatric Research, vol. 27, no. 2, p. e1608, Feb. 2018.

[15]. S. W. Choi, T. S.-H. Mak, and P. F. O’Reilly, “Tutorial: a guide to performing polygenic risk score analyses,” Nature Protocols, vol. 15, no. 9, pp. 2759–2772, Sep. 2020.

[16]. J. Euesden, C. M. Lewis, and P. F. O’Reilly, “PRSice: Polygenic Risk Score software,” Bioinformat-ics, vol. 31, no. 9, pp. 1466–1468, May 2015.

[17]. H. Zhao, N. Mitra, P. A. Kanetsky, K. L. Nathanson, and T. R. Rebbeck, “A Practical Approach to Adjusting for Population Stratification in Genome-wide Association Studies: Principal Components And Propensity Scores (PCAPS),” Statistical applications in genetics and molecular biology, vol. 17, no. 6, pp. /j/sagmb.2018.17. issue–6/sagmb–2017–0054/sagmb– 2017– 0054.xml, Dec. 2018.

[18]. B. K. Bulik-Sullivan, P.-R. Loh, H. K. Finucane, S. Ripke, J. Yang, N. Patterson, M. J. Daly, A. L. Price, and B. M. Neale, “LD Score regression distinguishes confounding from polygenicity in genome-wide asso- ciation studies,” Nature Genetics, vol. 47, no. 3, pp. 291–295, Mar. 2015.

[19]. N. R. Wray, S. H. Lee, D. Mehta, A. A. E. Vinkhuyzen, F. Dudbridge, and C. M. Middeldorp, “Research re-view: Polygenic methods and their application to psy-chiatric traits,” Journal of Child Psychology and Psy- chiatry, and Allied Disciplines, vol. 55, no. 10, pp. 1068–1087, Oct. 2014.

[20]. Privé, H. Aschard, and M. G. B. Blum, “Efficient Implementation of Penalized Regression for Genetic Risk Prediction,” Genetics, vol. 212, no. 1, pp. 65–74, May 2019.

[21]. R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.

[22]. H. Zou and T. Hastie, “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.

[23]. M.-L. Huang, Y.-H. Hung, W. M. Lee, R. K. Li, and B.-R. Jiang, “SVM-RFE Based Feature Selection and Taguchi Parameters Optimization for Multiclass SVM Classifier,” The Scientific World Journal, vol. 2014, p. 795624, 2014.

[24]. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification using Support Vec-tor Machines,” Machine Learning, vol. 46, no. 1, pp. 389–422, Jan. 2002.

Loading...