10. Regular Expression Matching

问题

问题是说,我们有一个东西正则表达式(不知道什么是正则表达式不重要),他有一个pattern,可以匹配字符串。匹配的规则就是

  1. 小写字母代表它本身
  2. . 代表任意字符
  3. * 代表他前面的字符可以匹配0次或者1次或者多次

举例来说就是我们有一个pattern a*, 我们可以匹配a, aa,aaaa,或者是空字符串。

给出一个字符串s,和一个pattern p,判断s是否可以匹配p。

解析

我会尽力还原每一步是如何想到的,将他们联系起来。

对于aaa是否匹配a*匹配问题,我们的思路一般是(可能我们有时候都不会注意到这个路径):

  • 首先看s中的第一个a是否匹配p中的第一个,不匹配则停止,匹配则继续看下一个字符;

  • s的第二个字符a是否匹配p的第二个字符,对于p的第二个字符出现了*,我们需要回看p的第一个字符发现是a,那么基于前面的结果(s中的第一个a匹配p中的第一个a),而且当前第二步中,p中*的前一个字符是a,s的当前字符是a,也可以匹配;所以继续匹配

  • s中的第三个a同第二部的匹配思路

  • 最终结束匹配。

不知看到这里给予前面的结果一步步构建最终结果,是否可以联想到动态规划的最优子结构。如果可以想到,我们就按照动态规划的思路去构建代码了:

  • 首先是看状态定义,根据结果长度为len(s)的字符是否匹配长度为len(p),我们定义dp[m][n]

  • 然后转移公式:

    • 这里一开始我直接就按照最难的包含.和*的开始推导。根据题解的思路提示,可以先去思考没有星号的情况,就是只包含小写字母和点号。这个时候的话其实匹配的规则就很明朗,就是对应的字符匹配上,或者对应位置是一个点,那么也算匹配,就继续往下匹配即可,反之则结束。
    • 然后继续考虑加上星号的情况:
      • 如果遇到星号的话,存在几种情况:
        • s中的字符和星号之前的字符不相等
        • s中的字符和星号之前的字符相等
        • 星号之前的字符是.(也可以视为与当前s中的字符和星号之前的点相等)
      • 对于以上几种情况
        • 第一种,相当于当前的某个字符串c*匹配不上s中当前字符,此时我们dp[i][j] 的结果需要基于dp[i][j-2]的结果(此处可以理解为j中这个字符以及星号没用了,匹配了0个字符)
        • 第二种以及第三种,可能匹配0次,或者一次,也可能匹配多次。匹配0次的话dp[i][j] = dp[i - 1][j - 2], 匹配一次的话dp[i][j] = dp[i - 1][j - 2],相当于说p中当前的字符匹配上了s中的*和他前一个字符(总共两个字符),所以是j - 2同第一种。匹配多次的话,我们可以考虑当前的结果是基于上一次的结果(aaa匹配a*基于aa匹配a*的基础),dp[i][j] = dp[i - 1][j],
  • 然后初始化。

    • j和i都为0的时候,两个空字符串一定匹配,所以dp[0][0] = true;;

    • 当p的长度为0时候肯定除了上述的情况(空字符串),其他的都匹配不上,为默认值false;

    • 当s的长度为0的时候,能够匹配的pattern应该长得像a*a*b*c*...,所以当s长度为0,当前i位置, i - 1是星号,i的结果取决于i - 2的结果:

      for (int i = 2; i < n + 1; i++) {
          if (p.charAt(j - 1) == '*' && dp[0][j - 2]) {
              dp[0][i] = true;
          }
      }
      

      可以确定一些偶数位。

代码

class Solution {
    public boolean isMatch(String s, String p) {
        if (s == null || p == null) return false;
        int m = s.length();
        int n = p.length();
        
        boolean[][] dp = new boolean[m + 1][n + 1];
        
        dp[0][0] = true;
        
        // initialization
        for (int i = 2; i < n + 1; i += 2) {
            if (p.charAt(i - 1) == '*' && dp[0][i - 2]) {
                dp[0][i] = true;
            }
        }
        
        for (int i = 1; i < m + 1; i++) {
            for (int j = 1; j < n + 1; j++) {
                if (s.charAt(i - 1) == p.charAt(j - 1)) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else if (p.charAt(j - 1) == '.') {
                    dp[i][j] = dp[i - 1][j - 1];
                } else if (p.charAt(j - 1) == '*') {
                    if (p.charAt(j - 2) != '.' && s.charAt(i - 1) != p.charAt(j - 2)) {
                        dp[i][j] = dp[i][j - 2];
                    } else {
                        dp[i][j] = dp[i][j - 2] || dp[i - 1][j] || dp[i - 1][j - 2];
                    }
                }
            }
        }
        return dp[m][n];
    }
}
  1. Regular Expression Matching

Problem

The problem is that we have a thing regular expression (it doesn't matter if you don't know what a regular expression is), and he has a pattern that matches strings. The rules for matching are

  1. lowercase letters for itself
  2. . represents any character
  3. * means that the character in front of him can match 0 times or 1 time or more

For example, if we have a pattern a*, we can match a, aa, aaaa, or the empty string.

Given a string s, and a pattern p, determine if s can match p.

Parsing

I'll try to restore how each step came to mind and connect them.

For the question of whether aaa matches a * match, the general idea is (and we probably don't even notice the path sometimes).

  • First see if the first a in s matches the first in p. Stop if it doesn't match, and move on to the next character;

  • whether the second character a of s matches the second character of p. For the second character of p appears *, we need to look back at the first character of p found to be a, then ** based on the previous result ** (the first a in s matches the first a in p), and the current second step, the previous character of * in p is a, the current character of s is a, can also match; so continue to match

  • the third a in s is the same as the second part of the matching idea

  • finally end matching.

I don't know if it is possible to associate the optimal sub-structure of dynamic programming by seeing the step-by-step construction of the final result given here to the previous result. If it can be thought of, we follow the idea of dynamic programming to build the code: *

  • First is to see state definition, according to the result length of len(s) whether the character matches the length of len(p), we define dp[m][n].

  • Then the transfer formula.

    • Here at the beginning I will directly follow the most difficult ones containing . and *'s to start the derivation. According to the idea of the problem solution hint, you can ** first go to think about the case without asterisks **, that is, containing only lowercase letters and dotted signs. This time, the rules of matching are very clear, that is, the corresponding characters match, or the corresponding position is a dot, then it is also considered a match, and continue to match down, and vice versa is the end.
    • Then consider the case of asterisks.
      • If an asterisk is encountered, there are several cases.
        • The character in s and the character before the asterisk are not equal
        • The character in s and the character before the asterisk are equal
        • The character before the asterisk is . (can also be considered equal to the character in the current s and the dot before the asterisk)
      • For the above cases
        • The first one, which is equivalent to some current string c * does not match the current character in s. At this time, our dp[i][j] result needs to be based on the result of dp[i][j-2] (here it can be understood that this character in j and the asterisk are useless and match 0 characters)
        • The second and the third one may match 0 times, or once, or many times. If we match 0 times, dp[i][j] = dp[i - 1][j - 2], if we match once, dp[i][j] = dp[i - 1][j - 2], which means that the current character in p matches the * in s and his previous character (two characters in total), so it is j - 2 with the first one. Matching multiple times, we can consider that the current result is based on the previous result (aaa match a* based on aa match a*), dp[i][j] = dp[i - 1][j],
  • then initialize.

    • When both j and i are 0, the two empty strings must match, so dp[0][0] = true;;

    • when the length of p is 0 must not match except in the above case (empty string), for the default value of false.

    • When the length of s is 0, the pattern that can be matched should look like a*a*b*c*... , so when the length of s is 0, the current position i, i - 1 is an asterisk and the result of i depends on the result of i - 2.

      for (int i = 2; i < n + 1; i++) {
          if (p.charAt(j - 1) == '*' && dp[0][j - 2]) {
              dp[0][i] = true;
          }
      }
      

      Some even digits can be determined.

Code

class Solution {
    public boolean isMatch(String s, String p) {
        if (s == null || p == null) return false;
        int m = s.length();
        int n = p.length();
        
        boolean[][] dp = new boolean[m + 1][n + 1];
        
        dp[0][0] = true;
        
        // initialization
        for (int i = 2; i < n + 1; i += 2) {
            if (p.charAt(i - 1) == '*' && dp[0][i - 2]) {
                dp[0][i] = true;
            }
        }
        
        for (int i = 1; i < m + 1; i++) {
            for (int j = 1; j < n + 1; j++) {
                if (s.charAt(i - 1) == p.charAt(j - 1)) {
                    dp[i][j] = dp[i - 1][j - 1];
                } else if (p.charAt(j - 1) == '.') {
                    dp[i][j] = dp[i - 1][j - 1];
                } else if (p.charAt(j - 1) == '*') {
                    if (p.charAt(j - 2) ! = '.' && s.charAt(i - 1) ! = p.charAt(j - 2)) {
                        dp[i][j] = dp[i][j - 2];
                    } else {
                        dp[i][j] = dp[i][j - 2] || dp[i - 1][j] || dp[i - 1][j - 2];
                    }
                }
            }
        }
        return dp[m][n];
    }
}