Joke Collection Website - Bulletin headlines - HiveHive Join Introduction

HiveHive Join Introduction

[TOC]

Join in Hive only supports equivalent join, that is to say, the connection condition between tables in on in Join on can only be =, not <, > and other symbols. In addition, the equivalent connections in on can only be and, not or.

The Hive execution engine will "translate" HQL into a map-reduce task. When performing the Join operation of the table, if each table in multiple tables uses the same column to connect (appears in Join on clause), only one MR Job will be generated:

The three tables a, b, and c all use the same field for connection, that is, the same field appears in two Join subclauses at the same time. sentence, thereby generating only one MR Job.

If there are multiple tables and one table uses at least 2 fields for connection (at least 2 columns of the same table appear in the Join clause), at least 2 MR Jobs will be generated:

The three tables are connected based on 2 fields, and these two fields b.key1 and b.key2 appear in table b at the same time. The connection process is like this: First, tables a and b are connected based on a.key and b.key1, corresponding to the first MR Job; the results of the connection between tables a and b are then connected to c, corresponding to the second MR Job.

This is because the column in the Join on condition is used as the key when the Map is output. If the Join has multiple associated keys, the combination of these associated keys is used as the key. The Map distributes data to the Reduce side based on the Key. The specific Join is completed in the Reduce operation. Therefore, if multiple tables are joined based on different columns, all related data cannot be shuffled into the same Reduce operation in one round of MR tasks.

Hive supports commonly used SQL Join statements, such as inner joins, left outer joins, right outer joins, and Hive's unique map-side joins. Among them, map-side connection is an important technique for optimizing Hive connection queries.

First prepare three tables.

employee table:

dept department table:

Salary table:

When performing inner join operations on multiple tables, only all Only the data in the table that matches the on condition will be displayed, similar to taking an intersection.

All records in the table on the left side of the JOIN operator that match the where condition will be retained. If there are no records in the table on the right side of the JOIN operator that match the join condition after on, the columns selected from the right table will be NULL. , if there is no where condition, the records in the left table will be retained.

The execution order of standard query keywords is from->on->where->group by->having->order by. On is to filter the table first and then associate it. For left association, on is only Valid for the right table, the left table must be selected.

For a large amount of data, it is beneficial to try to use where conditions to filter out data that does not meet the conditions when writing SQL. But for left outer joins and right outer joins, the where condition will be executed after the on condition is executed. The on condition will generate a temporary table, and the where condition will filter this temporary table.

Therefore, in order to optimize the efficiency of Hive SQL execution, in scenarios where outer joins are required, if the conditional query is required before the connection is made, the query piece should be placed after on. If it is to be filtered after the connection is completed, The condition should be placed after where, and the where condition should be used to filter the main table.

It is important to note that if you need to filter the main table and then perform a left association with the slave table, it is best to write the main table in the form of a subquery, which can reduce the amount of data in the main table:

p>

RIGHT OUTER JOIN, as opposed to LEFT OUTER JOIN, all records in the table on the right side of the JOIN operator that meet the where condition will be retained. If there are no records in the table on the left side of the JOIN operator that meet the join condition after on, the records from the left side will be The selected column in the table is NULL.

Keep the data of the two tables that meet the where condition, similar to a union. Fields that do not meet the join condition are filled with NULL.

Take the table in front of the LEFT SEMI JOIN keyword as the main table, and return the records whose KEY of the main table is also in the secondary table. In earlier Hive versions, the IN or EXISTS functions in standard SQL were not supported. You can use LEFT SEMI JOIN to achieve similar functions.

What needs to be emphasized is:

The Cartesian product is a connection that represents the number of rows in the left table multiplied by the number of rows in the right table.

Join in Hive can be divided into Common Join (join is completed in the Reduce stage) and Map Join (join is completed in the Map stage).

If Map Join is not specified or the conditions for Map Join are not met, the Hive parser will execute Common Join by default, that is, the join will be completed in the Reduce stage. The entire process includes Map, Shuffle, and Reduce stages.

Take the following HQL as an example to illustrate the process:

Map Join is usually used in scenarios where a small table and a large table are joined, depending on how small the small table is. Determined by the parameter hive.mapjoin.smalltable.filesize, this parameter indicates the total size of the small table. The default value is 25000000 bytes, which is 25M.

Before Hive 0.7, you need to use hint hint / + mapjoin(table) / to execute Map Join, otherwise Common Join will be executed. However, after version 0.7, Map Join will be automatically converted by default, and the parameter hive. auto.convert.join to control, the default is true.

As shown in the process in the figure above, first Task A is executed locally on the client, responsible for scanning the data of small table b, converting it into a HashTable data structure, and writing it to a local file, and then The file is loaded into DistributeCache.

Next is Task B. This task is an MR without Reduce. MapTasks is started to scan the large table a. In the Map stage, each record of a is associated with the HashTable corresponding to table b in DistributeCache. and output the results directly.

Since MapJoin does not have Reduce, Map directly outputs the result file. There are as many result files as there are Map Tasks.

Map Join is more efficient than Common Join, but there will always be times when the "small table" condition is not met. This requires bucket map join.

Bucket map join requires the two tables to be joined to be bucketed on the connection field (each bucket corresponds to a file on HDFS), and the number of buckets in the small table is required to be the number of buckets in the large table. multiple.

Example of creating a bucket table:

In this way, the my_user table corresponds to 32 buckets, and the data is divided into different buckets based on the modulus of the hash value of uid and 32. .

If the two tables are bucketed on the connection field, you can perform bucket map join. Specifically:

For the two tables in bucket map join, if each bucket If the inner partition fields are also ordered, you can also perform sort merge bucket map join.

The table creation statement is:

In this way, when the buckets on both sides need to be partially joined, you only need to traverse the two buckets sequentially using a merge operation similar to the merge sort algorithm. It can be completed in one pass. The data in the small table can be read only part of the time, and then the large table is used to match row by row. This kind of join does not limit the size of the memory. It can also perform full outer joins.

When performing sort merge bucket map join, the attributes that need to be set are:

During the Join process, after the Map ends, the data with the same Key will be shuffled into the same Reduce , if the data is evenly distributed, the amount of data processed by each Reduce is generally relatively balanced. However, if there is obvious data skew, the amount of data processed by some Reducers will be too large, resulting in excessive processing time for the node. long and becomes a bottleneck.

Large tables are associated with large tables. If one of the tables has many null values ??or 0 values, it is easy to shuffle a reduce, causing slow operation.

In this case, you can assign a random value to the abnormal value to disperse the key, and evenly distribute it to multiple reduce for execution, such as:

When the key values ????are all valid values, solve The method is:

Set the following parameters:

Hive cannot determine which key will cause skew when running, so use the hive.skewjoin.key parameter to control the skew threshold. If it exceeds this value, the new value will be sent to those reducers that have not yet been reached. Generally, it can be set to 2-4 times the total number of records to be processed/the number of reducers.

1. Hive basics 2 (join principle and mechanism, several types of join, simple processing of data skew)

2. Hive: JOIN and JOIN optimization

3. Handling of common data skew in hive