Rowkeys are unique identifiers for table rows and used for queries and partitioning. This topic describes the notes and limits that you must consider before you design rowkeys. This topic also provides examples on how to design rowkeys.
- Question 1: Is a rowkey unique?
In HBase, the records that have the same rowkey are considered as one record of multiple versions. By default, the latest version of the record is returned for a specific query. Rowkeys must be unique unless the multiversion concurrency control (MVCC) feature is used.
Best practice: Use a rowkey in the similar way that a primary key is used in a database table. A rowkey identifies a record. A rowkey can be one field or a combination of multiple fields. If the rowkey is [userid], each user has only one record. If the rowkey is [userid][orderid], each user has multiple records.
- Question 2: How do I use a rowkey? In HBase, you can use a rowkey in the following
methods:
- The GET method uses a complete rowkey to query data. For example, you can execute
the following statement to query data:
SELECT * FROM table WHERE Rowkey = 'abcde'
. Note In the GET method, a complete rowkey must be provided. This indicates that the values of all fields that constitute the rowkey are determined. - The scan method uses a rowkey range to query data. For example, you can execute the
following statement to query data:
SELECT * FROM table WHERE 'abc' < Rowkey <'abcx'
. Note In the scan method, the value to the left of a rowkey must be provided. For example, you can query all words that contain the prefix pre or prefi from an English dictionary, but you cannot query words that contain the suffix prefi or words in the middle of which prefi is located.
- Create a table as an index table.
- Use a filter to filter data that you do not need on the server.
- Use secondary indexes.
- Use the reverse scan method to sort data in descending order. This way, the first
record is the most recent data entry. When you use this method, specify the following
setting:
scan.setReverse(true)
. Note A reverse scan performs worse than a normal scan. If data needs to be sorted in descending order in most scenarios, a rowkey can be designed to resolve this issue. For example, [hostname][log-event][timestamp] is changed to[hostname][log-event][Long.MAX_VALUE - timestamp]
.
- The GET method uses a complete rowkey to query data. For example, you can execute
the following statement to query data:
- Question 3: Do stacked hot spots occur for fully distributed data?
No, hot spots do not occur for fully distributed data. The hash method is used to distribute data to different partitions. This prevents a server from being terminated by hot spots and the other servers from being idle. This way, the distributed architecture and concurrent processing are utilized in an efficient manner.
Best practice:- The Message Digest Algorithm 5 (MD5) hash algorithm is specified in a rowkey. For
example, the rowkey [userId][orderid] is changed to
[md5(userid).subStr(0,4)][userId][orderid]
. - A reverse scan is specified in a rowkey. For example, the rowkey [userId][orderid]
is changed to
[reverse(userid)][orderid]
. - The modulo operation is specified in a rowkey. For example, the rowkey [timestamp][hostname][log-event]
is changed to
[bucket][timestamp][hostname][log-event]; long bucket = timestamp % numBuckets
. - A random number is added to a rowkey. For example, the rowkey [userId][orderid] is
changed to
[userId][orderid][random( 100)]
.
- The Message Digest Algorithm 5 (MD5) hash algorithm is specified in a rowkey. For
example, the rowkey [userId][orderid] is changed to
- Question 4: What are the limits on the length of a row key?
Keep a rowkey as short as possible. A short row key can reduce the volume of data and make data queries and writing more efficient.
Best practice:- The STRING data type is replaced with the LONG or INT data type. For example. '2015122410'
can be replaced with
Long(2015122410)
. - A name is replaced with code. For example, 'Taobao' is replaced with
tb
.
- The STRING data type is replaced with the LONG or INT data type. For example. '2015122410'
can be replaced with
- Question 5: Can I query unused data by using the scan method?
Yes, the data that you do not use can be queried by using the scan method. For example, the rowkey of table1 is a combination of column1, column2, and column3. If you want to query all records for which the value of column1 is host1, execute the
Best practice:scan 'table1',{startkey=> 'host1',endkey=> 'host2'}
statement. If a record for which the value of column1 is host12 is available, the record is also returned.- The length of a field is specified in a rowkey. For example, the rowkey [column1][column2]
is changed to
[rpad(column1,'x',20)][column2]
. - The delimiter is added to a rowkey. For example, the rowkey [column1][column2] is
changed to
[column1][_][column2]
.
- The length of a field is specified in a rowkey. For example, the rowkey [column1][column2]
is changed to
Common design examples
- Rowkeys are designed for log data and time series data in the following scenarios:
- Data of a metric measured in a period of time needs to be queried from a host. The
rowkey is designed as
[hostname][log-event][timestamp]
to meet this requirement. - The latest records of a metric need to be queried from a host. The rowkey is designed
as
timestamp = Long.MAX_VALUE - timestamp; [hostname][log-event][timestamp]
to meet this requirement. - The data that you want to query has only the time dimension or a large volume of data
in a dimension needs to be queried. The rowkey is designed as
long bucket = timestamp % numBuckets; [bucket][timestamp][hostname][log-event]
to meet this requirement.
- Data of a metric measured in a period of time needs to be queried from a host. The
rowkey is designed as
- Rowkeys are designed for transaction data in the following scenarios:
- The transaction records generated in a specific period of time need to be queried
for a seller. The rowkey is designed as
[seller id][timestamp][order number]
to meet this requirement. - The transaction records generated in a period of time need to be queried for a buyer.
The rowkey is designed as
[buyer id][timestamp][order number]
to meet this requirement. - Data needs to be queried by using order numbers. The rowkey is designed as
[order number]
to meet this requirement. - Data needs to be queried from three tables. The rowkey for a buyer table is designed
as
[buyer id][timestamp][order number]
. The rowkey for a seller table is designed as[seller id][timestamp][order number]
. The rowkey for an index table is designed as[order number]
.
- The transaction records generated in a specific period of time need to be queried
for a seller. The rowkey is designed as