187. 重复的DNA序列

我的思路

一种直觉：字典树。

把字符串 s s s 从头开始每 10 个作为一个新的字符串，加入到字典树中，对于当前新字符串 s i s_i si：

如果 s i s_i si 出现在字典树中，那么将该字符串加入结果中。注意，应该现将字符串加入一个集合中，因为当前字符串可能会有和它重复的字符串，且不止一个。
如果 s i s_i si 没出现在字典树中，那么将它插入到字典树中。

设字典树的深度是 L L L，那么显然， L = 10 L=10 L=10。也就是说，我们能在 O ( L ) O(L) O(L) 的时间复杂度内完成上述
字典树的插入和查找操作。

代码实现

class Solution {
public:
    struct TireTree{
        char ch;
        bool isEnd;
        unordered_map next;
        TireTree() : isEnd(false){}
        TireTree(char _ch) : ch(_ch), isEnd(false){}
    };

    void insert(TireTree* root, string& dna10){
        auto node = root;
        for (auto ch : dna10){
            if (node->next.find(ch) == node->next.end()){
                node->next[ch] = new TireTree(ch);
            }
            node = node->next[ch];
        }
        node->isEnd = true;
    }

    bool find(TireTree* root, string& dna10){
        auto node = root;
        bool flag = true;
        for (auto ch : dna10){
            if (node->next.find(ch) == node->next.end()){
                node->next[ch] = new TireTree(ch);
                flag = false;
            }
            node = node->next[ch];
        }
        node->isEnd = true;
        return flag;
    }

    vector findRepeatedDnaSequences(string s) {
        unordered_set uset;
        TireTree *root = new TireTree();

        int s_n = s.size();

        for (int i = 0; i < s_n - 9; i ++){
            string tmp(s.begin() + i, s.begin() + i + 10);
            if (find(root, tmp)){
                uset.insert(tmp);
            }
        }
        vector ret(uset.begin(), uset.end());
        return ret;
    }
};

时间复杂度分析： O ( N ⋅ L ) O(Ncdot L) O(N⋅L)， N N N 是 s s s 的长度， L = 10 L=10 L=10。
空间复杂度分析： O ( N ⋅ L ) O(Ncdot L) O(N⋅L)，最坏的情况是，对于 s s s 中每一个长度为 L L L 的字符串都符合条件，因此都需要加入到unordered_set和结果中。

优化掉 unordered_set

可以把 isEnd 变成一个计数器cnt,只有当字符串在字典树中出现次数为2 时，也即cnt=2时才将该字符串加入到结果中。

代码实现

class Solution {
public:
    struct TireTree{
        char ch;
        int cnt = 0;
        unordered_map next;
        TireTree(){}
        TireTree(char _ch) : ch(_ch){}
    };

    bool find(TireTree* root, string& dna10){
        auto node = root;
        bool flag = true;
        for (auto ch : dna10){
            if (node->next.find(ch) == node->next.end()){
                node->next[ch] = new TireTree(ch);
                flag = false;
            }
            node = node->next[ch];
        }
        node->cnt += 1; // 先加1
        return flag && (node->cnt == 2); // 因此判断重复次数是否为2 
    }

    vector findRepeatedDnaSequences(string s) {
        vector ret;
        TireTree *root = new TireTree();

        int s_n = s.size();

        for (int i = 0; i < s_n - 9; i ++){
            string tmp(s.begin() + i, s.begin() + i + 10);
            if (find(root, tmp)){
                ret.push_back(tmp);
            }
        }
        
        return ret;
    }
};

时间复杂度分析： O ( N ⋅ L ) O(Ncdot L) O(N⋅L)， N N N 是 s s s 的长度， L = 10 L=10 L=10。
空间复杂度分析： O ( N ⋅ L ) O(Ncdot L) O(N⋅L)，最坏的情况是，对于 s s s 中每一个长度为 L L L 的字符串都符合条件，因此都需要加入到结果中。

Simple is better than complex.

虽然用 C++ 做题，但是 Python之禅亦可借鉴。

直觉固然不错，但是也让你陷入了自我惯性之中。

其实，上述第二种思路的想法，已经让重复二字明朗了。

题意转化

给你一堆长度为 L = 10 L=10 L=10 的字符串列表，找出该列表中重复的字符串。

直觉（笑）：用unordered_map。键为字符串，值为出现次数。最后将unordered_map中值大于 1 的键加入到结果中即可。

代码实现

class Solution {
public:
    vector findRepeatedDnaSequences(string s) {
        vector ret;
        unordered_map umap;
        int s_n = s.size();
        for (int i = 0; i <= s_n - 10; i ++){
            string tmp = s.substr(i, 10);
            if (++umap[tmp] == 2){
                ret.push_back(tmp);
            }
        }
        return ret;
    }
};

字符串分割可以使用 substr(index, len) 函数，index是开始位置，len是长度。

时间复杂度分析： O ( N L ) O(NL) O(NL)
空间复杂度分析： O ( N L ) O(NL) O(NL)

优化——滑动窗口

substr的操作能否优化呢？

由于 s s s 中只有四种字符，因此，可以使用二进制表示这四种字符（2个比特）：

A 表示为 00
C 表示为 01
G 表示为 10
T 表示为 11

因此，一个长为 10 的字符串，就可以使用 20 个比特表示。

我们规定，只使用 int 的低 20 位。

设当前滑动窗口对应的字符串，表示的整数是 x x x 。 x x x 的高位代表串 s s s的左边，低位代表 s s s的右边。
那么当需要得到下一个子串时，此时不再需要substr，我们只需要将滑动窗口右移一位即可，此时一个新的字符串进入窗口，同时窗口最左边的字符离开窗口。将上述操作变成对应的位运算:

滑动窗口右移一位：x = x << 2，每个字符用2个比特表示，并且 x x x的高位代表 s s s的左边，因此，窗口右移一位，那么整数 x x x需要左移2位。
新字符进入滑动窗口：x = x | bin[ch]，ch 是新字符，bin[ch] 是 ch 对应的二进制。 x x x 的低位代表 s s s 的右边。
将最左边的字符离开窗口：x = x &((1 << 20) - 1)。因为我们规定只使用低20位，因此需要将其余位置0。

将三步合并：x = ((x << 2) | bin[ch]) & ((1 << 20) - 1)

AC代码

class Solution {
public:
    unordered_map bin = {{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};  
    vector findRepeatedDnaSequences(string s) {
        vector ret;
        int s_n = s.size();
        if (s_n < 10){ // 剪枝
            return ret;
        }
        unordered_map umap;
        int x = 0;
        for (int i = 0; i < 9; i ++){ // 预处理9个字符
            x = (x << 2) | bin[s[i]];
        }

        for (int i = 0; i <= s_n - 10; i ++){
            x = ((x << 2) | bin[s[i + 10 - 1]]) & ((1 << 20) - 1);
            if (++umap[x] == 2){
                ret.push_back(s.substr(i, 10)); // 抱歉，还是需要substr
            }
        }

        return ret;
    }
};

时间复杂度： O ( N ) O(N) O(N)
空间复杂度： O ( N ) O(N) O(N)

2021.10.09 14:05

187. 重复的DNA序列

C/C++/C#相关栏目本月热门文章