Data Analysis: FlashTex or RegEx

Before you proceed with your analysis, you need to clean your source data, even for the simplest text. This often includes searching and replacing keywords. For example, search the corpus for the keyword "Python," or replace all "python" to "Python."

The Smarter and Faster way of Data Cleansing - FlashText

As the name suggests, FlashText is one of the fastest ways to execute search and replace keywords. It is an open source python library on GitHub.

When using FlashText, begin by providing a list of keywords. FlashText uses this list to build an internal Trie dictionary. You then send it a string of text depending on whether you want to search or replace.

Why is FlashText so Fast?

To truly understand the reason behind FlashText’s speed, let us consider an example. Take a sentence that comprises three words "I like Python". Assume that you have a corpus of four words {Python, Java, J2ee, and Ruby}.

If for every word in the corpus, you select it out and see if it appears in the sentence, you need to iterate the string four times.

For n words in the corpus, we need n iterations. And each step (is in sentence?) will take its own time. This is the logic behind RegEx matching.
There is also an alternative method that contradicts the first method. That is for each word in the sentence, see if it exists in the corpus.

For m words in the sentence, you have m cycles. In the situation, the time spent only depends on the number of words in the sentence. You can quickly perform this step (is in corpus?) using a dictionary.

The FlashText algorithm uses the second method. Moreover, Aho-Corasick algorithm and Trie data structure inspire its algorithm.

When do You Need to Use FlashText?

In terms of search, if the number of keywords is greater than 500, FlashText will perform better than RegEx.

Additionally, RegEx can search special characters such as "^, $, *, d" but FlashText does not support them.

You cannot match partial words (for example "worddvec"), but it can match full word ("word2vec").

Take a look at the basic usage of FlashText. Give it a try. You will observe that it is much faster than RegEx.

Related Market Product

AISE TensorFlow 1.9 Python 3.6 CPU MKL Notebook

A pre-configured and fully integrated minimal runtime environment with TensorFlow, an open source software library for machine learning, Keras, an open source neural network library, Jupyter Notebook, a browser-based interactive notebook for programming, mathematics, and data science, and the Python programming language. The stack is built with the Intel MKL and MKL-DNN libraries and optimized for running on CPU.

Versions: TensorFlow 1.8.0, Python 3.6.3, Development preset 1, Libc 2.22, OpenBLAS 0.2.20, Python_enum34 1.1.6, NumPy 1.13.3, Ubuntu 16.04

Related Products

Object Storage Service

Alibaba Cloud Object Storage Service (OSS) is an encrypted, secure, cost-effective, and easy-to-use object storage service that enables you to store, back up, and archive large amounts of data in the cloud, with a guaranteed reliability of 99.999999999%. RESTful APIs allow storage and access to OSS anywhere on the Internet. You can elastically scale the capacity and processing capability, and choose from a variety of storage types to optimize the storage cost.

Server Load Balancer

Alibaba Cloud Server Load Balancer (SLB) distributes traffic among multiple instances to improve the service capabilities of your applications. You can use SLB to prevent single point of failures (SPOFs) and improve the availability and the fault tolerance capability of your applications.

Related Course

How to Scale Python on Cloud

Analysts often use libraries, tools in the Python ecosystem to analyze data on their personal computer. They like these tools because they are efficient, intuitive, and widely trusted. However when they choose to apply their analyses to larger datasets they find that these tools were not designed to scale beyond a single machine. In this course we will introduce how Alibaba Cloud scales Python based on its offline data processing engine,taking advantage of the unlimited computing resource on cloud.

Community

Data Analysis: FlashTex or RegEx

The Smarter and Faster way of Data Cleansing - FlashText

Why is FlashText so Fast?

When do You Need to Use FlashText?

Related Blog Posts

How to Write a Headless Web Scraping Bot in Python

Alibaba Cloud DevOps Cookbook Part 1 – CLI, SDK, SSH, SFTP

Related Market Product

AISE TensorFlow 1.9 Python 3.6 CPU MKL Notebook

Related Documentation

RDS Python SDK Developer Guide

Enable logging in Python SDK

Related Products

Object Storage Service

Server Load Balancer

Related Course

How to Scale Python on Cloud

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

AnalyticDB for PostgreSQL