下载此文档

融合过滤和相似度计算的高错误率基因组数据敏感序列识别孙辉.pdf

文档分类：IT计算机 | 页数：约11页举报非法文档有奖

1/11

下载提示

1.该资料是网友上传的，本站提供全文预览，预览什么样，下载就什么样。
2.下载该文档所得收入归上传者、原创者。
3.下载的文档，不会出现我们的网址水印。

同意并开始全文预览

(约 1-6 秒)

1/11 下载此文档

文档列表 文档介绍

: .
小型微型计算与 GWAS Catalog 数据库中敏感序列进行计算比对，以准确识别疾病相关序列；最后，
依据短序列识别结果，生成待识别序列的两条掩码序列，作为识别测序数据中敏感序列的结果。实验结果表明，与同类算
法 LRF 和 SRF 相比，本文算法对错误率 2%~20%的测序数据中敏感序列的平均识别准确率分别提高 %和 %，查
准率分别提高 %和 %，有效提升高错误率基因组数据中敏感序列识别的效果。
关键词：敏感序列识别；皮尔逊相关系数；过滤；相似度计算；比对
中图分类号：TP301 文献标识码：A
Recognizing Sensitive Sequences from Genomic Data with High Error Rate Integrating
Filter and Similarity Calculation
SUN Hui1,2，ZHONG Cheng1,2
1（School of Computer, Electronics and Information, Guangxi University, Nanning, Guangxi 530004, China）
2（Key Laboratory of Parallel Distributed Computing Technology in Guangxi Universities, Nanning, Guangxi 530004, China）
Abstract: To solve the problem that existing algorithms are difficult to effectively identify sensitive sequences from sequencing data
with high error rate, a recognizing sensitive sequence algorithm using filter and similarity calculation is proposed. Firstly, the
genomic sequence is divided into several short sequences, and a double Bloom filter is constructed to de-duplicate each short sequence.
Secondly, the local fragments of short sequences are encoded by k-mer , and the method for computing similarity of local fragments of
short sequences are improved to identify short tandem repeats. Thirdly, k-mer encoding short sequences and sensitive sequences in
GWAS Catalog database are aligned to identify disease-related sequences. Finally, according to the results of short sequence
identification, two mask sequences of the sequencing data are generated as the final results of identifying sensitive sequences from the
sequencing data. Experimental re

融合过滤和相似度计算的高错误率基因组数据敏感序列识别孙辉来自淘豆网www.taodocs.com转载请标明出处.

融合过滤和相似度计算的高错误率基因组数据敏感序列识别 孙辉.pdf

融合过滤和相似度计算的高错误率基因组数据敏感序列识别孙辉.pdf