: . 小型微型计算与 GWAS Catalog 数据库中敏感序列进行计算比对,以准确识别疾病相关序列;最后, 依据短序列识别结果,生成待识别序列的两条掩码序列,作为识别测序数据中敏感序列的结果。实验结果表明,与同类算 法 LRF 和 SRF 相比,本文算法对错误率 2%~20%的测序数据中敏感序列的平均识别准确率分别提高 %和 %,查 准率分别提高 %和 %,有效提升高错误率基因组数据中敏感序列识别的效果。 关键词:敏感序列识别;皮尔逊相关系数;过滤;相似度计算;比对 中图分类号:TP301 文献标识码:A Recognizing Sensitive Sequences from Genomic Data with High Error Rate Integrating Filter and Similarity Calculation SUN Hui1,2,ZHONG Cheng1,2 1(School of Computer, Electronics and Information, Guangxi University, Nanning, Guangxi 530004, China) 2(Key Laboratory of Parallel Distributed Computing Technology in Guangxi Universities, Nanning, Guangxi 530004, China) Abstract: To solve the problem that existing algorithms are difficult to effectively identify sensitive sequences from sequencing data with high error rate, a recognizing sensitive sequence algorithm using filter and similarity calculation is proposed. Firstly, the genomic sequence is divided into several short sequences, and a double Bloom filter is constructed to de-duplicate each short sequence. Secondly, the local fragments of short sequences are encoded by k-mer , and the method for computing similarity of local fragments of short sequences are improved to identify short tandem repeats. Thirdly, k-mer encoding short sequences and sensitive sequences in GWAS Catalog database are aligned to identify disease-related sequences. Finally, according to the results of short sequence identification, two mask sequences of the sequencing data are generated as the final results of identifying sensitive sequences from the sequencing data. Experimental re