Installing and configuring the HIVE metastore with a MySQL backend.
Hive Metastore
Hive, a powerful data warehousing system built on top of Hadoop, relies on a component known as the metastore to efficiently manage metadata about the data stored within it. This metadata is crucial for organizing, querying, and processing data in Hive. In this blog post, we’ll explore the role of the Hive Metastore and the significance of selecting the right relational database management system (RDBMS) for it.
Metadata Management
The Hive Metastore serves as a central repository for storing and managing metadata related to datasets within Hive. This metadata includes essential information such as:
Table schemas
Data types
Column names
Physical data file locations
Without the metastore, Hive would struggle to efficiently keep track of these critical metadata elements.
Schema Definition
In Hive, tables are defined using a schema that specifies column names and their associated data types. This schema is fundamental for querying and processing data correctly. The metastore takes on the responsibility of storing and managing these table schemas, offering users the flexibility to:
Create new tables
Modify existing schemas
Drop tables that are no longer needed
Metadata Scaling
In large-scale data environments, metadata can become extensive. Fortunately, the Hive Metastore can scale horizontally to handle substantial volumes of metadata efficiently. Additionally, it can be configured for high availability, ensuring continuous access to metadata even in the face of hardware failures or other issues.
Metastore and the Relational Database Management System (RDBMS)
By default, Hive uses Apache Derby as the RDBMS for its metastore. Derby is a lightweight, open-source, Java-based RDBMS, and it serves as the default choice for small to medium-scale Hive installations, testing, and development purposes. Notably, Derby comes bundled with Hive, simplifying the initial setup
While Derby is suitable for certain use cases, in this blog post, we will use MySQL as the RDBMS for Hive’s metastore for various reasons, such as:
Scalability: MySQL excels in handling larger and more complex workloads, making it ideal for scaling Hive in data-rich environments.
Performance: MySQL often delivers superior performance for complex queries and large datasets, making it well-suited for high-performance scenarios.
Backup and Recovery: Advanced backup and recovery options in MySQL enhance data integrity and reliability.
Enterprise Support: Organizations familiar with MySQL can leverage existing expertise for Hive’s metadata management.
Integration: MySQL’s compatibility with various tools simplifies data integration and management processes.
Community and Documentation: A large user community and extensive documentation resources make MySQL a dependable choice for Hive’s metadata management.
Prerequisites
The following prerequisites will be needed prior to installing MySQL and altering Hive’s configuration.
A properly configured Hadoop, YARN, Spark, and HDFS cluster.
An installation of Hive.
Setting up MySQL
My head node runs on RHEL 8. This guide will be for RHEL 8 Linux, but the general overview should be adaptable to other Unix-like systems.
Download the MySQL Repo onto the node
The yum repo for MySQL needs to be downloaded to the system. The appropriate repo for your system can be found here.
Install the repo using yum
I downloaded the repo to /opt/mysql80-community-release-el8-8.noarch.rpm
in my file directory.
Install the contents of the repo using
yum localinstall /opt/mysql80-community-release-el8-8.noarch.rpm -y
Disabling the local mysql
module
Disable the local mysql
module to prevent conflicts with the installation of the mysql-community-server
module.
yum -y module disable mysql
Installing mysql-community-server
Install the Community Server version of MySQL using:
yum -y install mysql-community-server
Installing mysql-connector-java
mysql-connector-java
, or the MySQL JDBC (Java Database Connectivity) driver, is a Java-based library that provides connectivity between Java applications and the MySQL database management system. It allows Java programs to interact with MySQL databases by facilitating the exchange of data and SQL queries.
Install it by running
yum -y install mysql-connector-java
Initializing the mysql
service
Initialize MySQL by running:
service mysqld start
This will initialize the MySQL service. However, this installation of MySQL needs to be secured!
Securing MySQL and initializing the server
Running mysql_secure_installation
MySQL version 8.0 or higher generates a temporary random password in /var/log/mysqld.log
after installation.
Find your random password by running
grep 'A temporary password' /var/log/mysqld.log |tail -1 |awk '{split($0,a,": "); print a[2]}'
Once you have that, run mysql_secure_installation
.
This script will guide you through setting up a root password, removing anonymous users, disallowing remote root login, and other security-related configurations.
Adding new users and setting up the metastore
Initialize MySQL by running
mysql -u root -p
and enter your password.
Run the following:
mysql> CREATE DATABASE metastore;
Query OK, 1 row affected (0.02 sec)
mysql> use metastore;
Database changed
mysql> create user 'hive'@'localhost' identified by '<root_password>';
Query OK, 0 rows affected (0.02 sec)
mysql> grant all PRIVILEGES on *.* to 'hive'@'localhost' with grant option;
Query OK, 0 rows affected (0.02 sec)
mysql> create user 'hdfs'@'localhost' identified by '<root_password>';
Query OK, 0 rows affected (0.04 sec)
mysql> grant all PRIVILEGES on *.* to 'hdfs'@'localhost' with grant option;
Query OK, 0 rows affected (0.03 sec)
mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.02 sec)
mysql> exit
Bye
Replace <root_password>
with your actual password. This will create the metastore and also create the users hive
and hdfs
and grant them all necessary privileges to work within the metastore and it’s data.
Bonus! Automate this!
If you’ve been following my posts, you’ll know that I’ve been using Ansible to automate this process. The above command is interactive! How do we get around this? If you’re deploying MySQL for production level systems, then this might be an unnecessary expenditure of time.
In this section, I’ve added a bash script that automates the process of securing MySQL, adding new users and creating the metastore database.
#!/bin/bash
# Obtaining the Temporary Password
root_temp_pass=$(grep 'A temporary password' /var/log/mysqld.log |tail -1 |awk '{split($0,a,": "); print a[2]}')
echo "root_temp_pass:"$root_temp_pass
# Creating a mysql_secure_installation.sql script to secure MySQL
cat > mysql_secure_installation.sql << EOF
# Ensuring that the server cannot be accessed without a password
# The password will be the cluster password
# Adding a user 'hdfs' and a user 'hive'
ALTER USER 'root'@'localhost' IDENTIFIED BY 'root_password';
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'root_password';
CREATE USER 'hdfs'@'localhost' IDENTIFIED BY 'root_password';
# Removing the anonymous users
DELETE FROM mysql.user WHERE User='';
# Disallowing remote login for root
DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');
# Removing the demo database
DROP DATABASE IF EXISTS test;
DELETE FROM mysql.db WHERE Db='test' OR Db='test\\_%';
# Make our changes take effect
FLUSH PRIVILEGES;
EOF
# Running the SQL script to secure the installation of MySQL
mysql -uroot -p"$root_temp_pass" --connect-expired-password <mysql_secure_installation.sql
# Creating another script to create the metastore and grant privileges to the hive, hdfs and root users for it
cat > mysql_privilege_grant.sql << EOF
# Granting privileges to the users hive, hdfs and root
CREATE DATABASE metastore;
use metastore;
grant all PRIVILEGES on *.* to "hive"@"localhost" with grant option;
grant all PRIVILEGES on *.* to "hdfs"@"localhost" with grant option;
grant all PRIVILEGES on *.* to "root"@"localhost" with grant option;
FLUSH PRIVILEGES;
EOF
# Running the SQL script to create the metastore and grant priviliges to the root, hive and hdfs users for the metastore
mysql -u root -p"root_password" --connect-expired-password <mysql_privilege_grant.sql
Replace root_password
with your actual password and run this script. Remember to delete this script and the ones created by this script once it finishes running!
Linking MySQL and Hive (and Spark!)
Deleting a duplicate library
We will need to remove the log4j-slf4j-impl-2.17.1.jar
file since MySQL has the same library, and the presence of multiple duplicate libraries in the CLASSPATH
may cause Hive to fail.
Change directories to your Hive installation, and then enter the lib
directory, and delete the log4j-slf4j-impl-2.17.1.jar
file.
Adding the MySQL JDBC library to Hive
Copy the mysql-connector-java.jar
file located at /usr/share/java/mysql-connector-java.jar
to the lib
directory of your Hive installation. This ensures that Hive can utilize the MySQL JDBC library to connect to MySQL databases seamlessly.
For example,
cp /usr/share/java/mysql-connector-java.jar /opt/apache-hive-3.1.3-bin/lib
Adding the MySQL JDBC library to Spark
Copy the mysql-connector-java.jar
file located at /usr/share/java/mysql-connector-java.jar
to the jars
directory of your Spark installation.
This will enable Spark to also utilize the MySQL JDBC library to connect to MySQL databases seamlessly.
For example,
cp /usr/share/java/mysql-connector-java.jar /opt/spark-3.3.2-bin-hadoop3/jars
Creating a hive-site.xml
file
Create a file named hive-site.xml
in the conf
directory of your Hive installation with the following content:
configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
<description>URL for establishing a connection to the Hive Metastore database.</description>
<property>
</property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Fully qualified class name of the JDBC driver used for connecting Hive and MySQL.</description>
<property>
</property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>User name used to connect to the MySQL database for Hive Metastore.</description>
<property>
</property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root_password</value>
<description>Password for authenticating and connecting to the MySQL database.</description>
<property>
</property>
<name>datanucleus.autoCreateSchema</name>
<description>
<
When set to 'true,' this option allows automatic creation of the schema (database structure)
if it doesn't already exist when Hive Metastore initializes.description>
</value>true</value>
<property>
</property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
<property>
</property>
<name>datanucleus.autoCreateTables</name>
<description>
<
When set to 'true,' it indicates that the datastore (MySQL database) structure is fixed
and should not be altered automatically by Hive Metastore.description>
</value>True</value>
<property>
</configuration> </
Editing the spark-env.sh
file
Add the following line to your spark-env.sh
file located in the conf
directory of your Spark installation.
export SPARK_CLASSPATH=/opt/apache-hive-3.1.3-bin/jars/mysql-connector-java.jar
Editing the spark-defaults.conf
file
Add the following line to your spark-defaults.conf
file located in the conf
directory of your Spark installation.
spark.sql.warehouse.dir=hdfs://localhost:9000/user/hive/warehouse
Restarting the MySQL service
Run the following commands to restart the MySQL service:
service mysqld stop
service mysqld start
Initializing the Metastore
Run the following commands:
hdfs dfs -mkdir /tmp
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/hive
hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod g+w /tmp
hdfs dfs -chmod g+w /user/hive/warehouse
schematool -dbType mysql -initSchema
This will initialize the Hive Metastore.
This should successfully have Hive configured to use the MySQL database as the backend RDBMS for metadata storage!
Testing
Now we will load a table into Hive and see if it is present where it should be.
Download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
Unzip the files
unzip ml-100k.zip
Initialize the Hive server.
Then, create a table by running the following:
CREATE TABLE u_data (
INT,
userid INT,
movieid INT,
rating
unixtime STRING)ROW FORMAT DELIMITED
BY '\t'
FIELDS TERMINATED AS TEXTFILE; STORED
And load u.data
into the table that was just created:
DATA LOCAL INPATH '<path>/ml-100k//u.data'
LOAD INTO TABLE u_data; OVERWRITE
Replace <path>
with your path to the ml-100k
folder.
Count the number of rows in table u_data.
SELECT COUNT(*) FROM u_data;
Enter the mysql
shell.
mysql -u hive -p
and enter your password.
Run the following in the command line for MySQL:
use metastore;
Followed by
tables; show
This will be your output:
+-------------------------------+
| Tables_in_metastore |
+-------------------------------+
| AUX_TABLE |
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| COMPACTION_QUEUE |
| COMPLETED_COMPACTIONS |
| COMPLETED_TXN_COMPONENTS |
| CTLGS |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| DELEGATION_TOKENS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| HIVE_LOCKS |
| IDXS |
| INDEX_PARAMS |
| I_SCHEMA |
| KEY_CONSTRAINTS |
| MASTER_KEYS |
| MATERIALIZATION_REBUILD_LOCKS |
| METASTORE_DB_PROPERTIES |
| MIN_HISTORY_LEVEL |
| MV_CREATION_METADATA |
| MV_TABLES_USED |
| NEXT_COMPACTION_QUEUE_ID |
| NEXT_LOCK_ID |
| NEXT_TXN_ID |
| NEXT_WRITE_ID |
| NOTIFICATION_LOG |
| NOTIFICATION_SEQUENCE |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| REPL_TXN_MAP |
| ROLES |
| ROLE_MAP |
| RUNTIME_STATS |
| SCHEMA_VERSION |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TXNS |
| TXN_COMPONENTS |
| TXN_TO_WRITE_ID |
| TYPES |
| TYPE_FIELDS |
| VERSION |
| WM_MAPPING |
| WM_POOL |
| WM_POOL_TO_TRIGGER |
| WM_RESOURCEPLAN |
| WM_TRIGGER |
| WRITE_SET |
+-------------------------------+
74 rows in set (0.01 sec)
The created table should be located in TBLS
.
Run the following command:
select * from TBLS;
This is what you should see:
+--------+-------------+-------+------------------+-------+------------+-----------+-------+----------------------------------------+---------------+--------------------+--------------------+----------------------------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | OWNER_TYPE | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | IS_REWRITE_ENABLED |
+--------+-------------+-------+------------------+-------+------------+-----------+-------+----------------------------------------+---------------+--------------------+--------------------+----------------------------------------+
| 1 | 1695343066 | 1 | 0 | root | USER | 0 | 1 | u_data | MANAGED_TABLE | NULL | NULL | 0x00
+--------+-------------+-------+------------------+-------+------------+-----------+-------+----------------------------------------+---------------+--------------------+--------------------+----------------------------------------+