Monday, August 12, 2013

Installing Hive using Cloudera Manager CDH4.6+ on an already set HDFS Cluster

Hive is part of the Hadoop project currently licensed under Apache License.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive home: http://hive.apache.org/

 

Apache Hive Getting Started Guide

Check out the Getting Started Guide on the Apache Hive wiki.

How to install Hive using Cloudera Manager (CM) CDH 4.+

 

 Prerequisite: The Hadoop Cluster

The following supposes that CDH 4.x was used to setup a Hadoop cluster, and that the Hadoop services (HDFS, MapReduce, Zookeeper) are up and running.

 

Adding the Hive Service

From the Cloudera Manager (CM) menu, top right, click on Actions:


In the drop-down menu, select "Add Service" then "Hive". Accept the dependency on Zookeeper.

Depending on the size of the cluster, it might be good to have an external database for all the Relational data required by Hive to support SQL like syntax.

 

Configuration For Hive to use an external Database and not an Embedded one

Very well documented on the Cloudera web site: Remote DB Installation Guide

 

Hive shell RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

The thing is that on both my clusters, the same issue came up when launching the "hive" shell:

  > hive
  hive > SHOW TABLES;


Generated the following error (in both cases):

root@master1:/var/log/hadoop-hdfs# hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.2.0.jar!/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201308120940_141559527.txt
hive> show tables;
FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask


And the (abbreviated) associated log traces (hive.log):

2013-08-12 09:40:45,119 ERROR metastore.RetryingRawStore (RetryingRawStore.java:invoke(133)) - JDO datastore error. Retrying metastore command after 1000 ms (attempt 1 of 1)
2013-08-12 09:40:46,133 ERROR exec.Task (SessionState.java:printError(401)) - FAILED: Error in metadata: java.lang.RuntimeException: Unable to instantiate 

[..]
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1082)
    ... 23 more
Caused by: javax.jdo.JDODataStoreException: Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.autoCreateTables"
NestedThrowables:
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.autoCreateTables"
 [..]

org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:61)
    at org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4010)
    at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:120)
    ... 28 more
Caused by: org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.autoCreateTables"
    at org.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:455)
[..]

    at org.datanucleus.store.rdbms.RDBMSStoreManager.getExtent(RDBMSStoreManager.java:1332)
    at org.datanucleus.ObjectManagerImpl.getExtent(ObjectManagerImpl.java:4149)
    ... 52 more

2013-08-12 09:40:46,140 ERROR ql.Driver (SessionState.java:printError(401)) - FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask


Note: Cloudera redirects the Hive shell logs to a temporary directory (not in /var/log/hive as could be expected).  /var/log/hive is reserved for the MetaStore process and not the shell logs.  The following location(s) are used instead:
 /tmp/hadoop/hive.log
or
 /tmp/root/hive.log

There can be many reasons for the following exception as can be found on slashdot.org.  In most cases it is due to a problem communicating with the remote database, or an access right issue when the hive shell attempts to modify some of the tables. 


In my case, I am using MySQL as the remote DB server. And when connecting to MySQL, all the Hive MetaData tables are there, and the "hive" user has full access rights on the database. So access rights are not the issue.

It is still important to check in /var/log/hive that the right properties for the remote DB are set and used by the Hive MetaStore. The /var/log/hive/hadoop-....log.out logs will display an INFO log with used properties:

 2013-08-08 13:54:43,251 INFO DataNucleus.Persistence: DataNucleus Persistence Factory initialised for datastore URL="jdbc:mysql://localhost:3306/hive?useUnicode=true&characterEncoding=UTF-8" driver="com.mysql.jdbc.Driver" userName="hive"

But, what is critical is that the hive shell finds and uses the right properties.

In the Cloudera case, the Hive component is found in /usr/lib/hive and the "conf" directory is a link to a specific directory '''/etc/hive/conf''' that contains the Hive shell hive-site.xml

What to do?

Verify there that the config file does contain the remote DB information (javax.jdo.option.ConnectionURL, javax.jdo.option.ConnectionDriverName):

 <?xml version="1.0" encoding="UTF-8"?>
 <!--Autogenerated by Cloudera CM on 2013-08-09T06:59:55.560Z-->
 <configuration>
   <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://127.0.0.1:3306/hive?useUnicode=true&amp;characterEncoding=UTF-8</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>******</value>
  </property>


For some reasons, the CDH 4.6 CM does not update this file when generating the Hive configuration after a CM GUI change.  This might be a fault due to the installation but it happened twice on two different clusters...

Fixing the properties file solves the problem. The proper configuration options can be found in the latest CM generated config file (e.g.):
 /run/cloudera-scm-agent/process/5011-hive-HIVEMETASTORE/hive-site.xmlHere 5011 is the latest revision number.

Copying the latest generated file is a potential solution:
 cp /run/cloudera-scm-agent/process/5011-hive-HIVEMETASTORE/hive-site.xml hive-site.xml
 # make it readable by all
 chmod 744 hive-site.xml

Updating the file manually is another.

In both cases though, the problem is that any updates in the Cloudera CM of Hive properties will not be reflected in the Hive shell.