×
Community Blog Why You Should Use FlashText Instead of RegEx for Data Analysis

Why You Should Use FlashText Instead of RegEx for Data Analysis

When it comes to data analysis, FlashText is a much faster and reliable tool for performing search and replace.

If you have done any text/data analysis, you might already be familiar with Regular Expressions (RegEx). RegEx evolved as a necessary tool to execute text editing. If you are still using RegEx to deal with text processing, then you may have some problems to deal with. Why? When it comes to large-sized texts, the low efficiency of RegEx can make data analysis unacceptably slow.

In this article, we will discuss how you can use FlashText, a python library that is 100 times faster than RegEx to perform data analysis.

RegEx vs. FlashText

Before you proceed with your analysis, you need to clean your source data, even for the simplest text. This often includes searching and replacing keywords. For example, search the corpus for the keyword "Python," or replace all "python" to "Python."

RegEx is an ideal tool if you have to search and replace a few hundred keywords. However, many of these tasks involve Natural Language Processing (NLP). You might come across tens of thousands of such operations. RegEx would require several days to complete such a requirement.

Of course, you may think that you can solve the problem by parallelizing your processes; however, in practice this solution does not make too much of a difference.

Is there any other way to attack this problem?

The creator of FlashText also faced the same problem back then. After some research that did not return any results, he decided to write a new algorithm.

Before understanding the algorithm behind it, let us take a look at the comparison chart showing the speed of FlashText in search vs. Regular Expression in search.

1

You can observe from the above chart that RegEx’s processing time increases on a nearly linear basis as the number of keywords increases. However, on the other hand, the increase in keywords does not affect FlashText.

Moving on, let us look at another chart on keyword replacement.

2

Likewise, when the number of keywords increases, the processing time of FlashText almost remains the same. So, what is FlashText? I have explained it in the following section.

The Smarter and Faster way of Data Cleansing - FlashText

As the name suggests, FlashText is one of the fastest ways to execute search and replace keywords. It is an open source python library on GitHub.

When using FlashText, begin by providing a list of keywords. FlashText uses this list to build an internal Trie dictionary. You then send it a string of text depending on whether you want to search or replace.

For performing replacements, it will create a new string with the replacement keyword. To carry out a search, it will return a list of keywords in the string. These tasks will only iterate through your string once.

Why is FlashText so Fast?

To truly understand the reason behind FlashText’s speed, let us consider an example. Take a sentence that comprises three words "I like Python". Assume that you have a corpus of four words {Python, Java, J2ee, and Ruby}.

If for every word in the corpus, you select it out and see if it appears in the sentence, you need to iterate the string four times.

3

For n words in the corpus, we need n iterations. And each step (is in sentence?) will take its own time. This is the logic behind RegEx matching.
There is also an alternative method that contradicts the first method. That is for each word in the sentence, see if it exists in the corpus.

4

For m words in the sentence, you have m cycles. In the situation, the time spent only depends on the number of words in the sentence. You can quickly perform this step (is in corpus?) using a dictionary.

The FlashText algorithm uses the second method. Moreover, Aho-Corasick algorithm and Trie data structure inspire its algorithm.

How Does FlashText Work?

First, it creates a Trie data structure from the corpus. Like the following graph:

5

Start and EOT (End of Term) indicate the word boundaries. It can be space, comma or line return. It can match keywords when it has boundaries on both sides. This way it can prevent cases such as apple matching for pineapple.

Use the string "I like Python," and search for it character by character.

6

7

As this algorithm searches character by character, when searching for 1 you can easily jump over like because l is not immediately after . This mechanism can let you skip all the non-existent words.

The FlashText algorithm only inspects every character in the string "I like Python." Even if your dictionary contains one million keywords, it does not really affect its operation.

When do You Need to Use FlashText?

I would suggest you use FlashText whenever the number of keywords is greater than 500.

8

In terms of search, if the number of keywords is greater than 500, FlashText will perform better than RegEx.

Additionally, RegEx can search special characters such as "^, $, *, d" but FlashText does not support them.

You cannot match partial words (for example "worddvec"), but it can match full word ("word2vec").

Take a look at the basic usage of FlashText. Give it a try. You will observe that it is much faster than RegEx.

Below are codes which will help you use FlashText

Code: Search for keywords using FlashText.

9

Code: Search for keywords using FlashText.

10

Conclusion

I hope this article convinced with the fact that FlashText is the better tool in comparison to RegEx. I have also especially illustrated a graph, which displays the performance of RegEx and FlashText.

0 1 0
Share on

Alibaba Clouder

2,599 posts | 762 followers

You may also like

Comments