GoNL SVs =============================== Release notes =============================== * release 6.0: 2016 May 25. * release 6.1: 2016 Oct 13. Same variants as release 6.0; Fixed errors in annotatio of dbSNP IDs. =============================== General =============================== The following pipeline was used to call the indels and structural variants: Twelve tools representing five different algorithmic approaches of variant calling (gapped alignment, split-read mapping, discordant read pairs, read depth, de novo assembly) were used: Pindel, GATK UnifiedGenotyper, GATK HaplotypeCaller, 123SV, BreakDancer, DWAC-Seq, CNVNator, FAƇADE, Mate-Clever, GenomeSTRiP, SOAPdenovo de novo assembly, Mobster. Calls from each of the methods were filtered according to the method best practices. =============================== Simple InDel set =============================== A data set with simple indels (1-20 base pairs) was constructed by merging four individual callsets obtained by running GATK HaplotypeCaller, Pindel, Mate-Clever and SOAPdenovo assembly (n=1,739,300). (File: YYYYMMDD_GoNL_AF_simple_indels.vcf.gz ) =============================== Complex InDel set =============================== Genomic regions showing a high density of polymorphisms (distance between adjacent polymorphisms below 30 basepairs) were tested for being complex events or alleles that potentially appeared as part of the single mutational event, but called as separate adjacent events (n=52,913). (File: YYYYMMDD_GoNL_AF_complex_indels.vcf.gz) =============================== Structural Variants set =============================== After creation of the algorithm-specific calls sets a consensus set of InDels and SVs was made for each of the SV types (indels, deletions, insertions, duplications, inversions, interchromosomal events, and mobile element insertions). Events were merged per variant type using an algorithm-aware merging strategy. A consensus region was defined when overlapping regions were identified by 2 different detection strategies (for example split read and discordant read pair, stratified by AF and event length), and the boundaries of the event were determined by the algorithm with the highest breakpoint accuracy (as determined by the calling strategy) in combination with a 50% reciprocal overlap. The resulted set consists of 54,696 genotyped and 4.662 non-genotyped structural variants. (Files: YYYYMMDD_GoNL_AF_genotyped_SVs.vcf.gz and YYYYMMDD_GoNL_AF_nongenotyped_SVs.vcf.gz) =============================== Novel segments =============================== We realigned the individual-specific sets of discordant reads using these new segments as a reference sequence, in order to determine their presence/absence in the libraries of each individual. The dataset of new segments (n=11,350, total length=7.8Mbp) was divided based on their population frequency (Fixed, > 95%; Common, 5-95%; rare < 5%), gender (Male-specific, >5% population) and a match to herpesvirus DNA (4 individuals from two families). Finally, we used NCBI BLAST to check if these segments were present in the most recent GRCh38/hg38 genome reference or a decoy dataset hg38d1. We required 99% identity in the alignment between assembled segment and latest genome reference to discard a segment as unreported by GRCh38 (total length=4.3Mbp). (File: YYYYMMDD_GoNL_novel_segments.fa.gz)