如果您正在使用FASTA文件时使用BioPython,获得
n序列使用random.sample:
from Bio import SeqIOfrom random import samplewith open("foo.fasta") as f: seqs = SeqIO.parse(f,"fasta") print(sample(list(seqs), 2))输出:
[SeqRecord(seq=Seq('GAGATCGTCCGGGACCTGGGT', SingleLetterAlphabet()), id='chr1:1154147-1154167', name='chr1:1154147-1154167', description='chr1:1154147-1154167', dbxrefs=[]), SeqRecord(seq=Seq('GTCCGCTTGCGGGACCTGGGG', SingleLetterAlphabet()), id='chr1:983001-983021', name='chr1:983001-983021', description='chr1:983001-983021', dbxrefs=[])]您可以根据需要提取字符串:
print([(seq.name,str(seq.seq)) for seq in sample(list(seqs),2)]) [('chr1:1310706-1310726', 'GACGGTTTCCGGTTAGTGGAA'), ('chr1:983001-983021', 'GTCCGCTTGCGGGACCTGGGG')]如果行始终成对出现,并且您跳过了顶部的元数据,则可以压缩:
from random import samplewith open("foo.fasta") as f: print(sample(list(zip(f, f)), 2))这会给你成对的元组线:
[('>chr1:983001-983021n', 'GTCCGCTTGCGGGACCTGGGGn'), ('>chr1:984333-984353n', 'CTGGAATTCCGGGCGCTGGAGn')]要准备编写这些行:
from Bio import SeqIOfrom random import samplewith open("foo.fasta") as f: seqs = SeqIO.parse(f, "fasta") samps = ((seq.name, seq.seq) for seq in sample(list(seqs),2)) for samp in samps: print(">{}n{}".format(*samp))输出:
>chr1:1310706-1310726GACGGTTTCCGGTTAGTGGAA>chr1:983001-983021GTCCGCTTGCGGGACCTGGGG



