Exome sequencing study revealed novel susceptibility loci in subarachnoid hemorrhage (SAH)

Aim To expand our current understanding of the genetic basis of subarachnoid hemorrhage (SAH), and reveal the susceptibility genes in SAH risk. Methods We conducted whole-exome sequencing (WES) in a cohort of 196 individuals, including 94 SAH patients and 94 controls, as well as 8 samples that belong to two pedigrees. Systematically examination for rare variations (through direct genotyping) and common variations (through genotyping and imputation) for SAHs were performed in this study. Results A total of 16,029 single-nucleotide polymorphisms (SNPs) and 108,999 short indels were detected in all samples, and among them, 30 SNPs distributed on 17 genes presented a strong association signal with SAH. Two novel pathogenic gene variants were identified as associated risk loci, including mutation in TPO and PALD1. The statistical analysis for rare, damaging variations in SAHs identified several susceptibility genes which were involved in degradation of the extracellular matrix and transcription factor signal pathways. And 25 putative pathogenic genes for SAH were also identified basic on functional interaction network analysis with the published SAH-associated genes. Additionally, pedigree analysis revealed autosomal dominant inheritance of pathogenic genes. Conclusion Systematical analysis revealed a key role for rare variations in SAH risk and discovered SNPs in new complex loci. Our study expanded the list of candidate genes associated with SAH risk, and will facilitate the investigation of disease-related mechanisms and potential clinical therapies.


Introduction
Subarachnoid hemorrhage (SAH), the rarest but most fatal type of stroke, has shown an annual incidence of 8-10/100,000 persons (2007), 30-day case fatality of 35-45% in western countries [17,20]. In China, annual incidence (per 100,000 persons) of SAH was 6.2, which is slightly lower than in western countries [40]. The majority of patients with SAH usually suffered from ruptured intracranial aneurysm (IA). SAH risk was considered to be related to smoking, hypertension, and poor socioeconomic status [2,14]. Moreover, studies based on molecular mechanisms have shown that genetic factors also play an important role in the formation, growth and rupture of IA [6,11,24,32,38]. Therefore, IA is identified as a complex disease that influenced by various genes and environmental factors. Conducting early detection and intervention by identifying risk factors may facilitate to avoid the formation and rupture of IA, and is crucial for the reduced incidence of SAH [12]. However, comprehensive knowledge of pathogenic and ruptured mechanisms of IA has not yet been defined.
Some studies have focused on the pathogenic mechanisms of IA in China, and several susceptibility genes have been identified. Due to the limitation of technology and small sample size, it is still necessary to further study the genetic factors of IA in China. Although genome-wide association analysis (GWAS) studies have found some novel gene loci related to IAs, they can only explain part of the genetic risk. Most of the GWAS studies have focused on both unruptured and ruptured IAs, thus, the gene loci highly related to ruptured IAs have not been completely detected. Moreover, rare, damaging variants also play an important role in complex diseases. With advances in sequencing technology, genetic analysis is gradually extending to rare variants, which often have more obvious functional consequences of harmful phenotypes [13,29].
In this study, we set out to systematically examine rare variation (through direct genotyping) and common variation (through genotyping and imputation) for SAHs by whole-exome sequencing (WES) in a cohort of 196 samples. Mendelian inheritance analysis for the SAH pedigrees was also performed to identified the susceptibility genes. Our study constitutes a detailed simultaneous assessment of causal variations in a large sample of SAHs, offer an opportunity to better understand both the biological and genetic architecture of this type of complex disease.

Study cohorts
We prospectively collected 196 samples, including 8 samples that belong to two pedigrees from Central Hospital of Baotou. The cohorts included 94 SAH cases(with ruptured intracranial aneurysm confirmed by Digital Subtraction Angiography, DSA and computed tomography, CT) and 94 controls(for each case, 1 control without SAH will be sort for interview. Controls will be matched on the basis of the following criteria: gender (sex) 10-year age strata (ie 10-19, 20-29, etc) sector of suburb of residence in Baotou (North, East, etc).
Controls will be chosen from the spouse, relative or friend of patients without SAH who are currently in the same hospital as the case.). This study was approved by the Human Research Ethics Committee of Central Hospital of Baotou, and all participants provided written informed consent. Comprehensive clinical information was provided in Table S1, including height, weight, BMI, gender and age etc.

Whole-exome sequencing
Genomic DNA was isolated from peripheral whole blood samples of participants by using Genomic DNA Extraction Kit (Invitrogen, South San Francisco, CA, USA). The Qubit 3.0 fluorometer and gel electrophoresis were used to evaluate DNA quantity and integrity, respectively. The sequencing paired-end libraries were constructed for each sample and captured using SureSelect Human All Exon V6 kit (Agilent Technologies, Santa Clara, CA, USA) following the manufacturer's instructions. All libraries were sequenced on BGI-SEQ 500 platform at BGI to obtain a desired depth of~100X. The sequencing depths of each sample are listed in Table S2.

Whole-exome sequencing (WES) data processing and variant calling
To get high quality data, Trimmomatic [5] was used to filter out low-quality reads which contained adaptors, high base error rate (> 50%), and highly unknown base proportion (> 10%) from the raw sequencing data. The cleaned reads were aligned to human reference genome (UCSC hg19) by the Burrows-Wheeler Aligner-MEM (v.0.7.15) [26] with default parameters. All the aligned reads were further processed using Picard tools (v2.5.0) and Genome Analysis Toolkit (GATK, v3.7) [28] with default parameters, which included deduplication, base quality recalibration, and multiple-sequence realignment prior to mutation detection.

Sample quality control
The standard quality screening conducted independently in each sample included SNP and sample call rates (> 90%), Hardy-Weinberg equilibrium, Mendelian errors, gender inconsistencies and checks for population stratification. To obtain a high-quality set of samples, the outlier samples discovered using principal-component analysis in GCTA [39] were removed from further analysis.

Association testing
The single marker association analyses with SAH were performed using an additive genetic model implemented in SNPTEST (http://www.stats.ox.ac.uk/~marchini/software/gwas/snptest.html) for the common SNPs (MAF > 10%). Age, sex, BMI, smoking, drinking, body fat, and diabetes were used as covariates in the analysis.

Rare SNP filtering
We used different allele frequency threshold in several public population databases: 1000G (http://browser.1 000genomes.org/index.html), ExAC, ESP etc., to filter out common variants. Then, only variants with frequency less than the thresholds in all these databases were considered as the rare SNPs of SAHs.

Functional impact prediction
Each variant category has to be assessed with a specific set of tools to predict their functional impact. Here, we assumed that synonymous variants have no functional impact, and all the stop gain and stop loss variants were considered as the deleterious mutations. The functional predictions of missense variants were performed by seven computational methods (SIFT (Ng, 2001 #4097), Polyphen2 [1], MutationTaster [33], CADD [27], REVEL [21], M-CAP [22], LRT [10]). The pathogenicity of missense mutations was assumed if predicted pathogenic by at least five out of the computational methods. The dpsi score were employed to determine the pathogenicity of splicing mutations.

Gene-based burden analysis
Gene-base test were performed for the rare, damaging variants. For each gene, we computed the burden of rare, damaging variants in SAH cases and controls, respectively. Fisher's exact test was applied to determine the significantly associated genes in SAHs. Those genes with a P-value of less than 0.05 were identified as susceptibility genes in SAHs. SKAT-O [25] was also applied for burden test, which allowing for variants with opposite directions of effect to reside in the same gene.

Inheritance analysis in pedigrees
The SNPs were called from the 2 pedigrees, and were further filtered as the filtering criterion of rare SNPs. Then, all the SNPs were subjected to functional impact prediction. Mendelian inheritance analysis was performed for the diseasing causing SNPs with 4 inheritance patterns, including (1) dominant inheritance pattern; (2) recessive inheritance pattern; (3) semidominant inheritance pattern; (4) compound heterozygote inheritance pattern.

The network analysis
The SAH-associated genes were collected from the published studies. The STRING database and associated search tools [35] were used for identifying interacting partners of a list of SAH-associated genes. We employed the identified interacting partners as the candidate pathogenic genes in SAHs.

Cohorts description and whole-exome sequencing
In this study, we performed~100x whole-exome sequencing (WES) for 94 SAH cases, 94 controls, and 2 pedigrees. Comprehensive description of the height, weight, sex, age and the other clinical variables of the cohort are provided in Table S1. In brief, the SAH group included 55 hypertension, 10 diabetic, 11 hyperlipidemia, 46 smoker/former-smoker, and 22 drinker/formerdrinker. The control group included 37 hypertension, 9 diabetic, 9 hyperlipidemia, 42 smoker/former-smoker, and 23 drinker/former-drinker. The statistics of the WES data was provided in Table S2 and S3, including effective bases, SNPs numbers, Indel numbers and Ti/Tv rate etc. for each sample.
We totally discovered 716,029 single-nucleotide polymorphisms (SNPs) and 108,999 short indels in all the samples. We then applied Genome Analysis Toolkit (GATK) VQSR for SNVs to distinguish true sites of genetic variation from sequencing artifacts. Then, 549,553 SNPs were remained, including 148,967 exonic SNPs, for the further analysis (See Method section, Table S4). Following sample quality control, the whole-exome sequences of 93 patients with SAH and 92 controls were jointly analyzed (See Method section).

Imputation into GWAS
For imputation purposes, we conducted a genome-wide single-variant analysis of the common SNPs (minor allele frequency, MAF > 0.1) comparing the 93 SAH cases and 92 controls. The associations with SAH risk were tested using logistic regression adjusted for sex, BMI, smoking, drinking, body fat, and diabetes as covariates. The genomic inflation factor (λ = 1.006) showed no evidence of inflated test statistics. There were 30 SNPs distributed on 17 genes presented a strong association signal with SAH (Table 1, Fig. 1). In these SNPs, three of them were in exon, and one in UTR3, and the other in intron. We obtained two loci reached genome-wide significance, within the introns of two genes TPO and PALD1, respectively (Fig. 2), implies a putative functional role in the pathogenesis of SAHs.
TPO encodes a membrane-bound glycoprotein that plays a major role in thyroid gland function. Mutations in this gene are associated with several disorders of thyroid hormonogenesis, i.e., congenital hypothyroidism, and congenital goiter [31,34]. As depicted in Fig. 2, another SNP within Phosphatase Domain Containing Paladin 1 (PALD1) also showed a significant signal. PALD1 is thought to be involved in the formation of vascular endothelium [36].

The role of rare variations in SAH risk
It is plausible that analysis of rare variants could explain additional disease risk or trait variability. We next investigated the rare variants across the cohorts by applying the frequency filtering (see Method section). The variants were defined as rare if their frequency in various databases were less than the corresponding threshold (Table S5). Each rare variant was assessed with a specific set of tools to predict their functional impact (see Method section).
The 30,651 potential damaging rare variants remained for further analysis.
We employed two gene-based methods to identify the susceptibility genes in SAHs. Rare variants burden testing was performed between SAH cases and control samples by Fisher's exact test, and 38 susceptibility genes, such as gene OBSCN, TJP1, ADGRV1, and FBN3 etc., were obtained (Table 2). However, when variants with opposite directions of effect in the same gene, the testing power will be reduced. We then employed another analysis with SKAT-O [25] to identify the signals, which both allowed for variants with opposite directions of effect to reside in the same gene. The SKAT-O identified 37 signals (Table 3), which were highly overlap with the results from the burden test (92.3%, Fig. 3), which suggested that these genes could be directly involved in ALS risk.  The overlapped genes were further subjected to functional enrichment analysis. These genes were overrepresented in some pathways related to cellular organization, i.e., adherens junction, and degradation of the extracellular matrix; and transcription factor signal, i.e., TGF-beta signal pathway (Table 4).

Pedigree analysis
We performed Mendelian inheritance analysis for two SAH pedigrees with probable inheritance patterns, including (1) dominant inheritance pattern; (2) recessive inheritance pattern; (3) semi-dominant inheritance pattern; (4) compound heterozygote inheritance pattern. Pathogenicity of missense mutations was assumed if predicted pathogenic by at least five out of seven computational methods (SIFT, PolyPhen2, LRT, Mutatation-Taster, M-CAP, CADD, and REVEL). The potential disease causing variants were only performed in dominant inheritance pattern, and there were 35 and 15 SNPs identified in these two pedigrees, respectively (Table 5 and 6). Twelve and seven candidate genes were identified in pedigree 1 and 2, respectively ( Table 7). The gene COL1A2, a pathogenic gene in pedigree 2, was also reported to be associated with SAH phenotype [15].

Putative pathogenic genes for SAHs
Protein-protein interactions were known as mediating many cellular functions, including cell cycle progression,   signal transduction, and metabolic pathways. The genes that interacted with the known SAH genes may influence the SAH phenotypes by participating in the same network/pathway. Basic on previous studies, we collected 28 SAH associated genes (Table S6), and these genes were further assessed the direct and indirect associations with other genes by STRING [35].
In total, we identified 47 putative interacted genes with the SAH (Table S7). To look deep into the pathogenic genes associated with SAH, we selected the overlapped genes among the results from the burden test, SKAT-O analysis and putative interacted genes (Table 8). Finally, we identified 25 putative pathogenic genes for SAH. Among these genes, FBLN2 has been identified a member of fibulin family, and is responsible for maintenance of the adult vessel wall after injury [8]. BMP7 was reported to play an important role in facilitating recovery after stroke in rat [9]. ITGA2 is responsible for adhesion of platelets and other cells to collagens  [16,19], which was known to be highly associated with aortic aneurysm and IA. Moreover, both ITGA2 and TTN were involved in hemostasis [3,30]. Notably, Notch signaling plays a pivotal role during vascular development [4,18]. Mutations in NOTCH3 have been identified as the underlying cause of cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), the most common inherited stroke and dementia syndrome in the group of degenerative small vessel diseases [23]. Our findings demonstrated the therapeutic potential of modifying these signaling in SAHs.

Discussion
Subarachnoid hemorrhage (SAH) is the rarest but most fatal type of stroke, identification of genetic variants that confer susceptibility to SAH is clinically important to prevent it [20]. In the present study, we performed WES for SAH cases and controls, to identify causal variations that associated with SAH risk in China, which enabled us to systemically evaluate protein-altering variants and candidate functional genes. Across GWAS with a total of 188 samples, we found a genome-wide significant association of SNPs in TPO and PALD1 with SAH risk. These two genes are involved in disorders of thyroid hormonogenesis and formation of   HSPG2 (rs3767137) loci as susceptibility sites in the Dutch population. However, in our cohorts, there was no significantly associated signal in these genes, which may due to the different genetic background among the populations. We then investigated the role of low-frequency variants of intermediate effect in SAH risk through rare SNPs analysis. The pathogenic genes with rare, damaging SNPs were enriched in some pathways related to cellular organization, i.e., degradation of the extracellular matrix; and transcription factor signal, i.e., TGF-beta signaling pathway. TGF-beta signaling plays a vital role in vasculogenesis and maintenance of blood vessel, and is involved in aortic aneurysm and IA. These results highlight the functional importance of rare variations in SAH risk.
The two pedigree samples were mainly used to performed Mendelian inheritance analysis, and it revealed autosomal dominant inheritance of pathogenic genes. In the same time, some potential disease causing variants were also found, such as the gene COL1A2 which was reported to be associated with SAH phenotype [15]. Combing the results from the network analysis of known SAH-associated genes, we obtained a list of candidate susceptibility genes. Among these genes, several were demonstrated to be associated with maintenance of blood vessel, including FBLN2, ITGA2, BMP7, and NOTCH3. NOTCH3 is known to be associated with the most common inherited stroke, CADASIL. These potential targets needed to be further validated in experiment models both in vivo and in vitro, which may facilitate to develop clinical strategies for early detection and intervention.
In conclusion, we have identified a key role for rare variations in SAH and discovered SNPs in new complex loci. However, there are still some limitations to our current study due to the small sample size and availability of family genetic data. In future, the identified candidate genes, i.e., TPO, PALD1 and ITGA2, will be necessary to validate in independent study populations or a larger sample size for Chinese population. Determination of genotypes for SNPs in these genes will guide the development of therapeutic strategies for SAH.
Additional file 1: Table S1. Sample background information. Table S2. Basic information of sequencing. Table S3. reads mapping statistics. Table S4. Variation distribution statistics. Table S5. Variation filtering threshold. Table S6. Known genes in SAH. Table S7. Interacted genes with known genes in SAH