Analysis | Traces obtained from TSC or NCBI first were trimmed to remove vector sequences using the VecScreen system from NCBI. Low quality regions containing at least five bases in a row with Phred scores below 25 (Ewing and Green 1998) then were trimmed using a custom PERL script. The longest high quality (LHQ) region from each trace was selected for further evaluation, and the remaining trimmed regions of the traces were set aside. The LHQ regions were further required to have average Phred scores of at least 25 and had to be longer than 100 bases in length. Repeats were identified and masked within the LHQ region of each trimmed trace using RepeatMasker and RepBase. The longest unmasked "anchor" region, which had to be at least 50 bases in length, then was used to assign each trace to a unique genomic location in build hg17 of the human genome using BLAST. Successfully mapped anchor sequences were required to have a single 100% match to a unique genomic location. Traces containing anchor sequences with more than one perfect match were set aside to avoid traces that mapped to segmental duplications (Bailey et al.2002). The LHQ regions of successfully mapped traces then were unmasked and aligned to their assigned genomic locations using BLAST2seq (NCBI). Polymorphisms were mined from these alignments using custom PERL scripts. We required the five bases on each side of a polymorphism candidate to have Phred scores that were 25 or higher. For SNP discovery, the SNP base also was required to have a Phred score of 25 or higher. Single-base pair INDELs were screened to identify double-hit INDELs, and only these were included in our final collections. Since BLAST only allows for up to a 16-base gap in the alignments, a custom PERL script was developed to identify INDELs that were larger than 16 bp in length. Upon encountering a region in the alignment that no longer matched the query, this program split trace data into two blocks. The first block (which matched the query) was maintained at the original position, whereas the second block (which did not match the query) was moved over one base at a time until a perfect match was obtained, or a distance of 10,000 bases (the maximum distance allowed by the program) was reached. There is a discrepancy of 2 variants from the published data due to the exclusion of two INDELs which were later found to be false positives. |