How to Connect Tableau to MaxCompute Using HiveServer2 Proxy

By Jonathan Peng, Staff Solutions Architect

Nowadays, customers are using different BI tools to gain new insights into their treasure trove of data. And Tableau is one of the popular tools in the market for enterprises. In this tutorial, we will demonstrate how to connect Tableau to Alibaba Cloud's most advance big data platform MaxCompute.

What Is Hive?

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.

What Is HiveServer2?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

The Thrift interface definition language (IDL) for HiveServer2 is available at https://github.com/apache/hive/blob/trunk/service/if/TCLIService.thrift

Thrift documentation is available at http://thrift.apache.org/docs/

What Is HiveServer2 Proxy?

HiveServer2 Proxy is a proxy, which is obtained after custom development based on the original HiveServer2. It does the job of accepting the Thrift request submitted by the client, de-serializing it and converting it into a request that MaxCompute can recognize, then submitting it to MaxCompute for processing, and converting the response to a client-recognizable process after MaxCompute has finished processing it. The Thrift response conforms to the Hive interface specification, enabling the Hive ecosystem to interoperate with MaxCompute. In short, its function is to provide a way for these tools to interact with MaxCompute without modifying the Hive ecosystem, so that we can reuse existing Hive tools. MaxCompute's powerful big data engine.

This approach is also applying to other BI tools such as Beeline and QlikView, which can connect to HiveServer2.

Deploying HiveServer2 Proxy

Before you install HiveServer2 Proxy, you need to install Java 1.7.0 or Java 1.8.0 in your Linux server. And download the HiveServer2 Proxy for ODPS from here http://repo.aliyun.com/download/apache-hive-2.1.0-odps-proxy.tar.gz.

First, we need to configure the HiveServer2 Proxy running environment, including JAVA_HOME, HIVE_HOME and HADOOP_HOME.

[root@HS2Proxy apache-hive-2.1.0-odps-proxy]# cd /usr/lib/jvm/jre-1.8.0-openjdk
[root@HS2Proxy jre-1.8.0-openjdk]# export JAVA_HOME=$(pwd)
[root@HS2Proxy jre-1.8.0-openjdk]# echo $JAVA_HOME
/usr/lib/jvm/jre-1.8.0-openjdk 
[root@HS2Proxy ~]# cd ~/apache-hive-2.1.0-odps-proxy
[root@HS2Proxy apache-hive-2.1.0-odps-proxy]# export HIVE_HOME=$(pwd)
[root@HS2Proxy apache-hive-2.1.0-odps-proxy]# echo $HIVE_HOME
/root/apache-hive-2.1.0-odps-proxy
[root@HS2Proxy apache-hive-2.1.0-odps-proxy]# export HADOOP_HOME=$(pwd)/hadoop
[root@HS2Proxy apache-hive-2.1.0-odps-proxy]# echo $HADOOP_HOME
/root/apache-hive-2.1.0-odps-proxy/hadoop

After we have finished the configuration for environment, we need to configure the hive-site.xml under folder apache-hive-2.1.0-odps-proxy/conf, you can find the property description below, and there should be 4 properties you need to change, odps.accessid, odps.accesskey, odps.project, and odps.projects.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hive.execution.engine</name>
    <value>odps</value>
    <description>Hive execution engine, here is odps by default, no need to modify</description>
  </property>
  <property>
    <name>hive.session.impl.classname</name>
    <value>org.apache.hive.service.cli.session.HiveSessionOdpsImpl</value>
    <description>HiveSession's odps plugin, no need to modify</description>
  </property>
  <property>
    <name>odps.accessid</name>
    <value>customeraccessid</value>
    <description>Please modify it to your accessid</description>
  </property>
  <property>
    <name>odps.accesskey</name>
    <value>customeraccesskey</value>
    <description> Please modify it to your accesskey</description>
  </property>
  <property>
    <name>odps.project</name>
    <value>odpsdemo</value>
    <description>Please change to your default project</description>
  </property>
  <property>
    <name>odps.projects</name>
    <value>odpsdemo</value>
    <description> Please change it to your project list. If there are multiple, please separate them with a comma. This configuration will take effect in show schemas. </description>
  </property>
  <property>
    <name>odps.endpoint</name>
    <value>https://service.odps.aliyun.com/api</value>
    <description>ODPS endpoint</description>
  </property>
  <property>
    <name>hive.server2.thrift.port</name>
    <value>20000</value>
    <description>HiveServer2 Thrift Server starts the service port in binary mode, which can be modified as appropriate to avoid port conflicts.</description>
  </property>
  </configuration>

Now we can start hiveserver2 proxy by running command bin/hiveserver2 under folder apache-hive-2.1.0-odps-proxy, and you can run tail -f /tmp/$USER/hive.log to see if the service is started.

[root@HS2Proxy conf]# tail -f /tmp/$USER/hive.log
2019-02-12T11:58:48,891  INFO [main] service.AbstractService: Service:HiveServer2 is started.
2019-02-12T11:58:48,893  INFO [main] server.Server: jetty-7.6.0.v20120127
2019-02-12T11:58:48,958  INFO [main] webapp.WebInfConfiguration: Extract jar:file:/root/apache-hive-2.1.0-odps-proxy/lib/hive-service-2.1.0.jar!/hive-webapps/hiveserver2/ to /tmp/jetty-0.0.0.0-10002-hiveserver2-_-any-/webapp
2019-02-12T11:58:49,068  INFO [Thread-7] thrift.ThriftCLIService: Starting ThriftBinaryCLIService on port 20000 with 5...500 worker threads
2019-02-12T11:58:49,109  INFO [main] handler.ContextHandler: started o.e.j.w.WebAppContext{/,file:/tmp/jetty-0.0.0.0-10002-hiveserver2-_-any-/webapp/},jar:file:/root/apache-hive-2.1.0-odps-proxy/lib/hive-service-2.1.0.jar!/hive-webapps/hiveserver2
2019-02-12T11:58:49,147  INFO [main] handler.ContextHandler: started o.e.j.s.ServletContextHandler{/static,jar:file:/root/apache-hive-2.1.0-odps-proxy/lib/hive-service-2.1.0.jar!/hive-webapps/static}
2019-02-12T11:58:49,148  INFO [main] handler.ContextHandler: started o.e.j.s.ServletContextHandler{/logs,file:/tmp/root/}
2019-02-12T11:58:49,168  INFO [main] server.HiveServer2: Web UI has started on port 10002
2019-02-12T11:58:49,167  INFO [main] server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:10002
2019-02-12T11:58:49,168  INFO [main] http.HttpServer: Started HttpServer[hiveserver2] on port 10002

If all the services are started and no errors are prompted, it means the proxy has been successfully deployed.

Connecting Tableau to HiveServer2 Proxy

If you don't already have Tableau, you can download Tableau for a 14-day trial from https://www.tableau.com/products/desktop/download

After installed Tableau, we need to use connector that support Hive to connect to HiveServer2 Proxy, including Cloudera Hadoop, Hortonworks Hadoop Hive and MapR Hadoop Hive, all connector needs to install their own drivers before using it.

After we install the driver, now we can connect Tableau to HiveServer2 Proxy and access to MaxCompute tables.

Project Demo

I will use Cloudera Hadoop as a connector and use the settings described in the following sections. You can fill in any username and password, because HiveServer2 Proxy do not support authentication at this moment, and the actual authentication is by using accessId and accesskey. Click Sign In and connect to Proxy.

Now, we can connect to MaxCompute project's table and analyze the data by using Tableau.

Community

How to Connect Tableau to MaxCompute Using HiveServer2 Proxy

What Is Hive?

What Is HiveServer2?

What Is HiveServer2 Proxy?

Deploying HiveServer2 Proxy

Connecting Tableau to HiveServer2 Proxy

Project Demo

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Raja_KT March 10, 2019 at 1:25 pm

Alibaba Clouder

Related Products

MaxCompute

DataWorks

DataV