×
Community Blog Big Data User Profiling Solution Based on Lindorm

Big Data User Profiling Solution Based on Lindorm

This article starts from the business characteristics of user profiling, providing a comprehensive and multi-angle analysis of why Lindorm is a suitable choice for user profiling.

This article starts from the business characteristics of user profiling, providing a comprehensive and multi-angle analysis of why Lindorm, as a storage solution for big data, is a suitable choice for user profiling. The aim is to help readers avoid detours and make the appropriate storage selection directly when faced with similar requirements.

1. Background

In traditional malls, sales staff often observe customers' expressions and behaviors to gauge their preferences, catering to their tastes to achieve marketing goals and maximize revenue. In the internet era, stores have moved online, placing buyers and sellers in a "virtual" world. The opponent could be male or female, attractive or not, an ordinary office worker, or a wealthy individual, but such information seems unattainable.

The "virtual" nature of the internet implies the need to use technical means to solve the issue of "reading expressions and observing behaviors." Fortunately, in the age of the internet, especially mobile internet, users leave traces of their "words" and "actions" online at all times and places. Making these data "speak" has become increasingly important. User profiling has emerged to address this need, and it is widely used in targeted marketing, recommendation systems, advertising, risk control, intelligent customer service, and many other fields.

So, what characteristics do user profile data and their application scenarios have? Based on these characteristics, what kind of data storage should be chosen?

2. Characteristics of User Profile Data

1

From the architectural diagram, we can see that the system generally includes four parts: the data collection system, online query system, offline analysis system, and data storage system. After data is collected, it is written into the data storage system, and simultaneously, the data is archived to the offline system. The offline system periodically trains the data to generate new user profiles, and these new profiles are fed back to the online system for upper-layer business queries. The online system contains both detailed data and historical profiles.

Based on the above definitions of the scope of user profile data and its application in business scenarios, what pain points exist for user profile data?

  1. Large Volume of Data: A typical feature of internet applications is having a massive number of users, often counted in tens of millions or even billions. This massive user base leads to the generation of an enormous amount of behavioral data. Some products also need to scrape external data to enrich their data dimensions. Based on extensive detailed data, offline model training produces final user profiles, which often involve high-dimensional data (with hundreds, thousands, or even tens of thousands of fields) numbering in the billions.
  2. High Concurrency for Read and Write: The large number of users generates large volumes of data that need to be written to the backend storage system in real time. Consequently, the concurrency for data writes often reaches tens of thousands, hundreds of thousands, or even millions per second or higher. At the same time, online application scenarios of user profile data, such as recommendations and advertising, often increase with the improvement of delivery effectiveness and operational promotion willingness, leading to more queries and higher concurrency for those queries.
  3. Detailed Data Needs Archiving: The user behavior details or other basic data written into the backend storage often need near-real-time archiving to the offline system to quickly provide feedback into the user profile data.
  4. Large Volume Data Feedback: Data archived to the offline system generates new profile data after analysis, which needs to be fed back to the online system for providing online queries.
  5. Dynamic Column Requirements: The dimensions of user profile data are often constantly changing and enriching, which means the table structure is also constantly evolving.
  6. Diverse and Complex Query Types: Different business needs lead to varying query requirements for user profile data. For example, retrieving user profile data might require single-record queries based on a key; analyzing user behavior data might involve batch retrieval by user IDs; and operational staff might need to query statistical data for a specific dimension based on their requirements.

Given these pain points of user profile data and the fact that profile data generally does not have strong transaction requirements, is there a suitable storage solution?

3. Lindorm for Big Data Scenarios

Since user profiling does not have strong transaction requirements and involves large data volumes with high concurrency for read and write operations, a relational database is not a suitable choice. In this context, I recommend a NoSQL database product called "Lindorm," which can perfectly address the pain points of user profiling and is designed for big data scenarios.

As a semi-structured and structured storage system for big data scenarios, Lindorm has been developed at Alibaba for nearly a decade and continues to receive rapid updates and technical upgrades. It is currently one of the core database products supporting the business within the Alibaba ecosystem. Over the years, driven by the internal demand for massive structured data storage and processing, Lindorm has undergone extensive large-scale practical testing in terms of functionality, performance, and stability. It has been widely applied across various business units within Alibaba Group, Ant Group, Cainiao, and Alibaba's Digital Media and Entertainment business unit, making it the database product with the largest data volume and the widest business coverage within Alibaba.

The architecture of user profiling based on Lindorm storage can be illustrated with the following diagram:

2

3.1 Cost-Effectiveness

Big data is known for its 5V characteristics, with Volume being the foremost. Therefore, data storage solutions designed for big data scenarios must feature high density and cost-effectiveness. Lindorm, a NoSQL database born in the big data era, inherently possesses the ability to efficiently store and retrieve vast amounts of data at a low cost. Lindorm’s cost-effectiveness is demonstrated in several ways:

Diverse Storage Types Support

Performance Storage: Optimized for high performance.

Standard Storage: Balanced storage offering a mix of good performance and cost.

Capacity Storage: Focused on high capacity with a lower cost, suitable for less frequently accessed data.

There is always a storage type that fits your business scenario.

Deep Compression Optimization

The most cost-effective storage system would be one that requires no storage, but that's clearly unrealistic. The feasible solution is to minimize the amount of data that needs to be stored. To reduce storage costs, Lindorm introduces a new lossless compression algorithm designed to provide fast compression and achieve a high compression ratio. This algorithm doesn’t aim for the highest possible compression ratio like LZMA or ZPAQ, nor does it strive for extreme compression speed like LZ4. Instead, it balances both, achieving a compression speed over 200MB/s and a decompression speed over 400MB/s (lab data), which meets Lindorm's throughput requirements well. Real-world scenarios have shown that this new compression optimization significantly improves the compression ratio compared to LZO, potentially saving storage costs by 50% to 100%. For storage-centric businesses, this can mean up to a 50% reduction in storage costs.

Hot and Cold Data Separation

Lindorm supports hot and cold data separation within a single storage architecture and table. The system automatically archives cold data to cold storage based on user-defined thresholds for hot/cold data distinction. From the user's perspective, accessing this data is almost identical to accessing a standard table. During queries, users only need to configure a Query Hint or Time Range, and the system will automatically determine whether the query should target the hot or cold data zone. To users, it always appears as a single table, making the process almost entirely transparent.

3

3.2 High-Performance Throughput

Based on empirical tests under identical specifications and data volumes, Lindorm demonstrates several times the throughput and P99 latency improvements compared to the community version HBase 2.0 in scenarios including single-row read, range read, single-row write, and batch write.

4
5

The diagram below shows the performance after a real-world business scenario, primarily involving batch writing, has been migrated. Similarly, the behavior log data collection for user profiling can often be written in batches after accumulating a certain amount of data.

6

3.3 Real-time Incremental Archiving

Real-time incremental archiving is an independent service of Lindorm. By listening to the logs generated by Lindorm, LTS parses the logs and synchronizes them to offline systems such as Hadoop or MaxCompute. The data synchronized to the offline systems is partitioned by time, making it convenient to perform T+1, H+1, or other periodic computations.

7

Under such a synchronization mechanism, on one hand, the archiving process is decoupled from online storage, ensuring that online read and write operations are completely unaffected by data archiving. On the other hand, detailed data can achieve near-real-time synchronization to offline storage for analysis, thereby efficiently updating user profile data.

3.4 Bulkload Technology

Unlike relational databases, Lindorm uses an LSM Tree architecture. Reading a record stored in Lindorm requires merging data in the memory of the corresponding shard (i.e., memestore) and the latest version of the record data in multiple LDFiles owned by that shard, and then submitting the merged result to the client. Based on this principle, Lindorm can directly generate and "insert" a new LDFile into the system, achieving the loading of "new" data. This gives it a significant advantage over other relational databases or NoSQL systems. This data loading process completely bypasses the storage engine, WAL, and Memstore, involving only essential physical IO and network overhead, thereby greatly enhancing the performance of data loading and reducing the impact on online business requests.

3.5 Dynamic Columns

Lindorm’s wide table model supports features such as multiple column families, dynamic columns, TTL, and multiple versions, making it well-suited for use cases where the table structure is unstable and frequently requires changes, such as user profiles.

3.6 Multi-dimensional & Complex Queries

For single-key queries based on rowkey or scans based on rowkey prefixes, Lindorm itself can well meet business needs. In situations requiring multi-dimensional queries with a small number of composite columns and fixed query patterns, Lindorm's built-in high-performance global secondary index functionality can also meet business requirements while still maintaining strong throughput and performance.

8

When dealing with more complex queries, such as fuzzy searches and random condition combination queries, secondary indexing solutions might seem inadequate. This is where the Lindorm search engine, LSearch, comes into play. LSearch is a distributed search engine designed for massive datasets, compatible with the open-source Solr standard interface. It can seamlessly serve as index storage for wide tables and time-series engines, accelerating retrieval queries. Its overall architecture aligns with the wide table engine, featuring a structure based on automatic data partitioning, partition multiple replicas, and Lucene. LSearch supports full-text search, aggregation calculations, and complex multidimensional queries, with capabilities for horizontal scaling, one-write-multiple-read, cross-data-center disaster recovery, and TTL (time-to-live), meeting the high-efficiency retrieval needs for massive amounts of data.

9

4. Overview of Lindorm Core Capabilities

Lindorm, through its comprehensive and multi-faceted capabilities, effectively meets the demanding requirements of user profiling for large volumes of data, high concurrency, real-time archiving, efficient and stable bulk data loading, dynamic columns, and complex multi-dimensional queries.

Of course, Lindorm's abilities extend far beyond just these features. It possesses a series of capabilities that a storage system should have in the context of big data and massive data volumes:

Multi-Model Database: Supports wide tables, time series, search, and files.

Separation of Storage and Compute: Based on an architecture that separates storage and compute, it provides extreme flexibility and scalability in both computation and storage, and introduces Serverless services, enabling on-demand instant elasticity and pay-as-you-go capabilities.

Cost-Effective: Supports hot and cold data separation and pursues optimal compression solutions for better cost efficiency.

Advanced Indexing and Searching: Equipped with global secondary indexing, multi-dimensional retrieval, time series indexing, among other features.

LDInsight Tool: Provides an intelligent service tool for completing system management, data access, and fault diagnosis with a no-code interface.

LTS (Lindorm Tunnel Service, formerly BDS): Supports easy-to-use data exchange, processing, and subscription capabilities, meeting user requirements for data migration, real-time subscription, data lake storage, data warehouse backflow, cross-unit active-active configurations, backup, and recovery.

5. Case Study

Financial Risk Control System of a Large Third-Party Payment Company

The risk control system is the cornerstone of any financial system. The risk control system provided by this third-party payment company offers the highest level of security in the industry, with a loss rate as low as 0.5 per million. In comparison, the world's second-lowest loss rate is 6 per ten thousand (data published in 2018). Behind this achievement are various models and rules crafted by the security team. The data storage support for running these rules and models is provided by Lindorm.

When you make a payment or scan a QR code, it might only take a fraction of a second. Within those few milliseconds, specifically about 100 milliseconds, the system retrieves security profile data of the user for safety verification of the transaction. This includes the payment scenario, the background of the recipient, the environment at the time of payment, and some of your behavioral characteristics, shopping preferences, usual shopping times, etc., to assess whether the transaction poses any risks. If risks are identified, the system will try to alert both ends of the transaction or even cut off the transaction. Behind each transaction, there are approximately over a hundred risk models and more than five hundred risk strategies being computed.

The user security profile data mentioned above refers to the detailed data depicted in the diagram below, which is archived, analyzed, and then re-imported into Lindorm as daily account data.

10

For a single transaction to require such a large number of models and rules, one can only imagine the demands placed on the supporting data systems during the peak period of the Double Eleven shopping festival.

More Case Study

Best practices within Alibaba Group:

11

Advertising: Real-time storage of large amounts of advertising and marketing data:

12

6. Reference

https://www.alibabacloud.com/help/en/lindorm/latest/scenarios
https://www.alibabacloud.com/help/en/lindorm/latest/migrate-the-advertising-data-of-an-international-marketing-company-to-lindorm?spm=a2c63.p38356.0.0.647a5437Wdqx2S
https://www.alibabacloud.com/help/en/lindorm/latest/education-industry

0 1 0
Share on

ApsaraDB

459 posts | 98 followers

You may also like

Comments

ApsaraDB

459 posts | 98 followers

Related Products