br E mail addresses susan clare northwestern edu
E-mail addresses: [email protected] (S.E. Clare), [email protected] (S.A. Khan), [email protected] (Y. Luo).
scope of genes. When genes and mutations are studied together, novel biological interactions and pathways can be identified to further pro-vide biological and clinical insights. Many groups have previously uti-lized feature selection methods for removing irrelevant and redundant information to deal with complexity problems. Vector Quantization (VQ)  and Principle Component Analysis (PCA)  have been widely used for feature selection. More recently, attention has been drawn to non-negative matrix factorization (NMF). In a face recognition study, Lee et al. suggested NMF could outperform VQ and PCA for feature recognition . In addition, the non-negative constraint of NMF is important because non-negativity is more realistic, easier to interpret, and prevalent in real world applications. In particular, NMF has been applied to disease subtype studies using gene Forskolin data [17,18] and sequencing data [19–21]. With the aim to uncover the genetic complexity behind cancer development, and to identify muta-tions that directly affect processes involved with oncogenesis, we pro-pose a framework utilizing NMF.
In our proposed framework, NMF was applied to discover latent factors from somatic mutations. The discovered latent factors were used to train an SVM model for cancer type classification. The NMF-SVM combination was rigorously evaluated and compared to different baselines. Association studies were performed between the factor ma-trices derived from NMF and cancer type using penalized logistical regressions. Major factors associated with each cancer type were in-vestigated, and significant genes were identified for investigation in pathway discovery analysis. In addition to this proposed framework serving as a disease type classifier, it can also be utilized to elucidate novel biological interactions and pathways for disease. The details of the study are reported below.
2. Material and methods
2.1. Mutation profiles
As a pilot study, four prevalent cancers were retrieved from The Cancer Genome Atlas (TCGA), including Glioblastoma Multiforme (GBM), Breast invasive carcinoma (BRCA), Lung Squamous Cell Carcinoma (LUSC), and Prostate Adenocarcinoma (PRAD). Somatic mutations were identified from 2431 tumors (Table 1). SnpEFF  and ANNOVAR  were used to annotate 24,588 missense mutations and 57,319 nonsense mutations in the study cohort. Each mutation was functionally scored for being potentially deleterious using SIFT , PolyPhen2 (PP2) , and CADD  scores. In genes containing multiple mutations, SIFT, PP2, and CADD scores, as well as mutational frequency were collapsed and studied as a single variable separately, known as gene burden . Predicted pathogenicity scores (SIFT, PP2, and CADD) were calculated for each mutation within a gene and col-lapsed as a sum to calculate the gene burden for a specific gene. Namely, gene burden represents a gene’s total predictive pathogenicity based on mutation data. Thus, gene burden was used to represent the damage level of a gene from multiple perspectives. A workflow is il-lustrated to show the methods used in this study (Fig. 1).
The number of samples in each cancer type and the corresponding number of somatic mutations. Mutations annotated with moderate effects are missense mutations or in-frame shift mutations. Mutations annotated as high effects are nonsense mutations. Numbers in parenthesis are standard deviations.
Cancer Sample size Somatic Somatic high
2.2. Gene pre-selection
Prior to modeling, we evaluated whether a subset of representative genes could be derived without information lost to achieve a more balanced sample feature ratio and reduce noise. The collapsed score in each gene was used as an input variable while the cancer type was used as the output variable. Multinomial logistic regression was fit, and a P-value that yields the null hypothesis of corresponding coefficient being zero was used as an indicator for the pre-selection. The selection cri-terion for this initial screening was set with a P-value less than or equal to a cutoff. In order to reduce noise and to prevent model overfitting, we tested the model using multiple cutoff thresholds of 0.05, 0.1, 0.2, 0.5, and 1. We compared the results derived from each threshold and selected the most reasonable cutoff based on prediction accuracy and number of features.
2.3. Applying NMF to discover latent factors of somantic mutations
Genes passing the selection threshold were used as inputs for the NMF study. Assume there were N subjects and M selected genes. The