By Kan Tao (Lingxiang)
This article mainly introduces how mainstream parsers implement the LIKE syntax logic. The author analyzes the advantages and disadvantages of several implementation methods and adopts the state machine method to optimize the performance of the scene step by step.
Recently, we have been optimizing the LIKE syntax of the project. Since we have talked about SQL, we might as well look at how some mainstream parsers implement the LIKE syntax logic. The two mainstream SQL parsers are ANTLR and Calcite.
ANTLR is a powerful syntax parser and generator that can be used to read, process, execute, and transform structured text or binary files. It is widely used in some SQL frameworks of big data. For example, Hive's lexical file is written by ANTLR3, and Presto's lexical file is implemented by ANTLR4. However, ANTLR does not directly implement the specific syntax, so there is no way to find the implementation statement.
Calcite simplifies the process of ANTLR code generation. It provides standard SQL language, multiple query optimization, and the ability to connect various data sources. At the same time, Calcite has a good pluggable architecture design, which allows users to easily put a SQL shell on their systems and provides efficient query performance optimization. Therefore, it won the favor of many developers. The Calcite implementation for LIKE logical matching is attached below:
/** SQL {@code LIKE} function. */
public static boolean like(String s,String pattern){
final String regex = Like.sqlToRegexLike(pattern, null);
return Pattern.matches(regex, s);
}
/** Translates a SQL LIKE pattern to Java regex pattern.*/
static String sqlToRegexLike(String sqlPattern,char escapeChar) {
int i;
final int len = sqlPattern.length();
final StringBuilder javaPattern = new StringBuilder(len + len);
for (i = 0; i < len; i++) {
char c = sqlPattern.charAt(i);
if (JAVA_REGEX_SPECIALS.indexOf(c) >= 0) {
javaPattern.append('\\');
}
if (c == escapeChar) {
if (i == (sqlPattern.length() - 1)) {
throw invalidEscapeSequence(sqlPattern, i);
}
char nextChar = sqlPattern.charAt(i + 1);
if ((nextChar == '_')
|| (nextChar == '%')
|| (nextChar == escapeChar)) {
javaPattern.append(nextChar);
i++;
} else {
throw invalidEscapeSequence(sqlPattern, i);
}
} else if (c == '_') {
javaPattern.append('.');
} else if (c == '%') {
javaPattern.append("(?s:.*)");
} else {
javaPattern.append(c);
}
}
return javaPattern.toString();
}
There are also some other compilers or middleware attached (such as TDDL). Let's take a brief look at the implementation. The whole is similar, while the logic of buildPattern is not the same.
...
try {
Pattern pattern = patterns.get(buildKey(right, escTmp), new Callable<Pattern>() {
@Override
public Pattern call() throws Exception {
return Pattern.compile(buildPattern(right, escTmp), Pattern.CASE_INSENSITIVE);
}
});
Matcher m = pattern.matcher(left);
return m.matches() ? 1l : 0l;
} catch (ExecutionException e) {
throw new FunctionException(e.getCause());
}
...
At this point, to sum up, many projects are completed based on regular expressions. Next, I sorted out several ways I have implemented recently.
Java's regular expressions have a different syntax from SQL's LIKE. The most important thing is that you must escape the Java special characters. A simple handling of the regexParse function is the traversal replacement operation for special characters ([](){}.*+?$^|#\)
.
public static boolean like(final String dest, final String pattern) {
String regex = regexParse(pattern);
regex = regex.replace("_",".").replace("%",".*?");
Pattern p = Pattern.compile(regex,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
return p.matcher(dest).matches();
}
This method is simple and clear at the code level, but the performance is poor. The repeated use of replace method has already performed multiple traversals. Here is a point that can be optimized. replaceChars (str, searchChar, replaceChar)
can be used for single character replacement.
The regular expression execution engine used by the Java language is non-deterministic finite automaton (NFA). This engine is characterized by powerful functions, but the backtracking mechanism leads to slow execution efficiency (when backtracking is serious, it can lead to 100% CPU utilization of the machine and directly jam the machine). Optimization related to Pattern processing can be done in the regular expression, and the compiled Pattern object can be cached to avoid repeatedly compiling patterns (one pattern needs to be cached for each pattern-expr str). Try choosing the lazy mode and exclusive mode and avoid using the greedy mode (default).
The three modes here are greedy mode, lazy mode, and exclusive mode.
The number indicator defaults to the greedy mode. The expression of the greedy mode will continue to match until it cannot be matched. If we find that the result of the expression does not match the expected result, it is probably because we thought the expression would only match the first few characters, but instead, it will keep matching.
In contrast to the greedy mode, the lazy mode will try to match less content, as described above for the percent sign.
The exclusive mode should be regarded as a variant of the greedy mode. It will try to match more content. The difference is that the backtracking mechanism will not be triggered in case of matching failure but will continue to judge backward, so this mode is the most efficient.
What is better than directly writing a customized version of LIKE? It is simple and easy to use. The memory and performance are almost the best. The most complicated situation is O(n*m), which is similar to doing an algorithm problem. It is a pure match process and does not need pre-cache processing. The following double pointer method can be implemented by dynamic planning. (Part of the code is hidden).
public static boolean like(final String dest, final String pattern) {
int destPointer = 0, patternPointer = 0;
int destRecall = -1, patternRecall = -1;
final int patternLen = pattern.length();
final int destLen = dest.length();
while( destPointer < destLen) {
......
......
}
while(patternPointer < patternLen && pattern.chatAt(patternPointer) == '%') {
patternPointer++;
}
return patternPointer == patternLen;
}
There is a scene we have to consider, which is backtracking. For example, it is not the same for pattern = "a%bcd" to match abcde and abcdbcdbcd at the same time. As such, it is necessary to record the backtrace mark. (I will talk about a method that does not need backtrace later.)
If this is the most memory-saving method, only a few internal variables are used to complete the entire work. If you insist on listing the shortcomings, we can say the maintainability is too poor. The logic processing is kind of connected, and if you make some extensions to the syntax in the future, it will be tricky.
A state machine has three components: states, events, and actions.
States: All possible statuses, including the current state and the state to be migrated after the condition is met
Event: When a condition is met, an action is triggered, or a state transition is performed.
Action: The action that is executed after the condition is met. After the action is executed, it can migrate to the new state or remain in the original state. The action is not required. When the condition is met, you can directly migrate to the new state without performing any action.
General use involves the exhaustive method, look-up table method, and state mode.
The simplest implementation of the state machine uses if-else or switch-case to translate each state transition into code regarding the state transition diagram. For a simple state machine, the implementation of the branch logic method is acceptable. For a complex state machine, the disadvantage is that it is easy to miss writing and mistakenly write some state transitions. In addition, the code is filled with if-else, and the readability and maintainability are poor.
The look-up table method is suitable for state machines with many types of implementation states, events, and complex state transitions. The use of a two-dimensional array to represent the state transition table can significantly improve the readability and maintainability of the code. State transitions and action execution are represented by array transitionTable and actionTable in the representation of a two-dimensional array, respectively. In both arrays, the x-axis represents the current state, the y-axis represents the event that occurred, and the values represent the new state after the transition and the action to be performed, respectively.
The look-up table method cites an introductory example here.
For example, there is a string with the pattern SCSAS, and we try to sort out the state transition table corresponding to this string. ! indicates characters unrelated to S, C, and A. The following is the finite-state machine state transition table.
The following figure shows the finite-state machine state transition diagram with the pattern SCSAS.
Next is the match process. Get the dest string and match the characters according to the state data in the table until the value of State = 5 is found.
The state mode is usually a state machine that expresses few implementation states and simple state transitions but with complex business logic contained in event-triggered actions. The state transition and action execution triggered by events in different states are split into different state classes to avoid branch judgment logic.
Compared with the look-up table method, when the state mode introduces more state classes, we recommend using the look-up table method for more states. The state mode is more suitable for fewer states with complex actions.
Then, the implementation of LIKE based on the state mode will be more convenient. We need to do a simple analysis before we start. If the test statement is CBEED like %B%E%D%, this result must be true. How do we implement such a state machine?
The specific disassembly can be divided into two parts: the construction of the state machine and the matching of the operation.
public void compile(final String pattern) {
...
LikeStateMachine machine = LikeStateMachine.build(pattern);
...
}
The process of building is the process of parsing and loading the pattern. I used a linked list to construct it. Implementation is the process of traversing the construction. Compile time complexity O(n).
Next is the match string CBEED process. The implementation of the code is to traverse the matching process. Match time complexity O(n*m).
Then, the final matching result can be like Figure 1 or Figure 2, depending on whether the matching logic is reversed or positive priority.
Figure 1
Figure 2
The difficulty here is that the matching possibilities are not unique and not every percent sign corresponds to the same character. For example:
So, when I did the matcher state design, I defined five state types, which ultimately helped me implement the overall logic. The approach above also needs to be implemented with backtracking in mind, and JUMP is specifically designed for backtracking.
Therefore, five classes related to state machines need to be defined, which is sufficient to complete our functions and extensibility.
I am thinking about how to optimize this reversible scenario. Since the existence of backtracking makes the whole logic complicated, my goal is to make the expression of complex logic simpler. I try to handle more things in the compilation stage to obtain better performance when matching. Compared with the previous case, I carried out a split-LIKE operation on the pattern data, expressing the non-percent sign part in the form of a string, and the effect was unexpectedly good. For example, %BF%EA%D. The following is its optimization process:
I have made scene type definitions for several scenes: %a = TYPE:LEFT a% = TYPE:RIGHT %a% = TYPE:SURROUND
Any pattern with % can be parsed into these three types, and the number of nodes in our state machine has been optimized a lot. Based on this implementation, the backtracking logic no longer exists, and it can be identified through the joint check of the current node and the next node. Finally, O(n) can be achieved. At the same time, for some scenarios, it can be identified through type + length judgment without further traversal. Let's look at the previous backtracking case, pattern = "a%bcd" matches abcde and abcdbcdbcd at the same time. Do special processing for this RIGHT + LEFT scenario.
Can we do further optimization so far? It can be optimized further based on common scenarios. I will try to list these scenarios.
'ab' like 'abc','ab' like 'abc%','ab' like '%abc','ab' like '%abc%'
More than 80% of usage scenarios are the conventional applications above. At this time, I think I can identify such scenarios at compile time and make further optimization.
'ab' like 'abc' ----> equals("abc")
'ab' like 'abc%' (or something like 'abc%%','abc%%%' ends with n '%')----> startsWith("abc")
'ab' like '%abc' (or something like '%%abc','%%%abc' starts with n '%')----> endsWith("abc")
'ab' like '%abc%' (or something like '%%abc%%', surrounded by n '%')----> contains("abc")
'ab' like '%' (or something like '%%', n '%')----> true
Try to use the ability of JDK at the bottom to do such a thing, and there may be no need to traverse the judgment such as length test.
public LikeParserResult compile(final String pattern) {
return parseLikeExpress(pattern);
}
....
public boolean match(final String dest, LikeParserResult likeParserResult) {
switch (likeParserResult.getMatchResult()) {
case LikeMatchResult.TYPE.ALLMATCH:
return true;
case LikeMatchResult.TYPE.EQUALS:
return doEquals(dest, likeParserResult.getFinalPattern());
case LikeMatchResult.TYPE.STARTSWITH:
return doStartsWith(dest, likeParserResult.getFinalPattern());
case LikeMatchResult.TYPE.ENDSWITH:
return doEndsWith(dest, likeParserResult.getFinalPattern());
case LikeMatchResult.TYPE.CONTAINS:
return doContain(dest, likeParserResult.getFinalPattern());
default:
// or another implementation.
return likeParserResult.getLikeStateMachine().match(dest);
}
}
The code given above is to see the operation inside. It can be optimized and streamlined further according to the code style (functional Function, interface, etc.) that you are good at.
...
public LikeParserResult compile(final String pattern) {
return parseLikeExpress(pattern);
}
public boolean match(final String dest, LikeParserResult likeParserResult) {
return likeParserResult.getMatcher().match(dest);
}
...
public class StartsWithMatcher implements Matcher {
...
@Override
public Boolean match(String dest) {
return dest.startsWith(this.pattern);
}
}
...
Algorithm / run 0.1 billion times avgms / business scenario | short-exact match pattern = "halo", dest = "halo"; |
short-percent only pattern = "%", dest = "hello"; |
short-right match pattern = "he%", dest = "hello!"; |
short-left match pattern = "%go", dest ="let's go"; |
short-bidirectional match pattern = "%wo%", dest ="world"; |
Short-right and left match pattern = "he%lo", dest ="hello"; |
Short-Double Left Match pattern = "%he%lo", dest ="hello"; |
state machine (compile in advance) | 11 | 7 | 181 | 321 | 492 | 1198 | 1477 |
double pointer algorithm | 740 | 780 | 1277 | 1710 | 1133 | 1768 | 2152 |
long-exact match pattern ="this is hello world!let's go!", dest = "this is hello world!let's go!"; |
long-percent only pattern = "%%%%%", dest = "this is hello world!let's go!"; |
long-right match pattern = "hello%", dest = "hello world!let's to!"; |
long-left match pattern = "%go", dest ="this is hello world!let's go"; |
Long-two-way matching pattern = "%wo%", dest ="hello world!let's go!"; |
Long-right-left match pattern = "he%go", dest ="hello world, let's go"; |
Long-double left match pattern = "%lo%go", dest ="hello world, let's go"; |
|
state machine (compile in advance) | 10 | 5 | 228 | 356 | 524 | 1182 | 1966 |
double pointer algorithm | 2063 | 1984 | 2433 | 3775 | 3073 | 3294 | 4569 |
short-right double match pattern = "he%lo%", dest ="hello!"; |
long-right double match pattern = "he%ld%", dest ="hello world, let's go"; |
short-multiple two-way matching pattern = "%wo%ld%",dest ="world"; |
long-multi two-way matching pattern = "%wo%et%", dest ="this is hello world!let's go!"; |
Short-Backtrack pattern = "%he%rld", dest ="helloworld world"; |
Long-Backtrack pattern = "%he%rld", dest ="this is hello world world world"; |
||
state machine (compile in advance) | 1366 | 1382 | 1345 | 3433 | 1749 | 2063 | |
double pointer algorithm | 2187 | 4402 | 2072 | 4755 | 4986 | 8650 |
The chart above is a test case designed according to a typical scenario. The general conclusion of the data is that the longer the string is, the slower it will be for the general algorithm. The state machine will perform better than the general algorithm in most cases, and the performance is similar in quite a few cases.
According to the comprehensive evaluation of multiple advantages and disadvantages, I think state machine implementation > direct implementation > regular expression implementation. The compilation process of the state machine and regular expression can be put into the code compilation stage or initialization, which can avoid frequent performance loss. In terms of extensibility and maintainability, I prefer to implement LIKE syntax based on state machine.
Alibaba Cloud Provides Proven Data Security Solution to Empower UAE E-Commerce Platform
1,044 posts | 257 followers
FollowAlibaba Cloud Community - June 9, 2023
Alibaba Clouder - March 3, 2021
Alibaba Cloud MaxCompute - May 5, 2019
Alibaba Clouder - July 2, 2019
Alibaba Cloud MaxCompute - March 20, 2019
Alibaba Cloud MaxCompute - October 18, 2021
1,044 posts | 257 followers
FollowExplore Web Hosting solutions that can power your personal website or empower your online business.
Learn MoreExplore how our Web Hosting solutions help small and medium sized companies power their websites and online businesses.
Learn MoreAn on-demand database hosting service for SQL Server with automated monitoring, backup and disaster recovery capabilities
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreMore Posts by Alibaba Cloud Community