Get top 1 row of each group hiveql năm 2024

Currently MySQL does not support ROW_NUMBER() function that can assign a sequence number within a group, but as a workaround we can use MySQL session variables.

These variables do not require declaration, and can be used in a query to do calculations and to store intermediate results.

@current_country := country

This code is executed for each row and stores the value of country column to @current_country variable.

@country_rank := IF(@current_country = country, @country_rank + 1, 1)

In this code, if @current_country is the same we increment rank, otherwise set it to 1. For the first row @current_country is NULL, so rank is also set to 1.

For correct ranking, we need to have ORDER BY country, population DESC

So if we just execute the subquery:

SELECT city, country, population,

   @country_rank := IF(@current_country = country, @country_rank + 1, 1) AS country_rank,
   @current_country := country 
FROM cities ORDER BY country, population DESC

We get the list of cities ranked by population within the country:

city country population country_rank current_country Paris France 2181000 1 France Marseille France 808000 2 France Lyon France 422000 3 France London United Kingdom 7825300 1 United Kingdom Birmingham United Kingdom 1016800 2 United Kingdom Leeds United Kingdom 770800 3 United Kingdom New York United States 8175133 1 United States Los Angeles United States 3792621 2 United States Chicago United States 2695598 3 United States

  • Selecting Range

When we have a rank assigned to each city within its country, we can retrieve the required range:

Get top 2 for each country SELECT city, country, population FROM (/subquery above/) ranked WHERE country_rank <= 2; Get the city with 3rd population for each country SELECT city, country, population FROM (/subquery above/) ranked WHERE country_rank = 3;

ROW_NUMBER() - Oracle, SQL Server and PostgreSQL

In Oracle, SQL Server and PostgreSQL you can achieve the same functionality using ROW_NUMBER function:

In most of the big data scenarios, it will be required to group by the rows that have the same values and we sort the rows in ascending or descending order as required . Tables have varying number of columns and using * in the select statement will all the retrieve data but is sometimes used to mention all the column names in the select query manually. It is very hard to do because of the high number of columns. So, here our requirement is to exclude column(s) from select query in hive.

System requirements :

  • Install ubuntu in the virtual machine click here
  • Install single node hadoop machine click here
  • Install apache hive click here

Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Get top 1 row of each group hiveql năm 2024

Before create a table open the hive shell and we need to create a database as follows : Open the hive shell as below

Get top 1 row of each group hiveql năm 2024

To create database using below queries :

Create database dezyre_db; use dezyre_db;

As follows below:

Get top 1 row of each group hiveql năm 2024

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:

CREATE TABLE employee ( employee_id int, company_id int, seniority int, salary int, join_date string, quit_date string, dept string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Get top 1 row of each group hiveql năm 2024

Loading the data into hive table and verifying the data

load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

Get top 1 row of each group hiveql năm 2024

Step 3 : Group by usage in hive

The GROUP BY clause is used to group all the rows in a result set using a particular collection column. It is used to query a group of rows.

Here we are going run an example query using group by on the hive table as follows

Select dept, count(*) as countof_emp from employee group by dept ;

Output of the above query : the above query will give count of the employees in each dept as a result as below:

Get top 1 row of each group hiveql năm 2024

Step 4: Order by usage in Hive

The ORDER BY is used to retrieve the rows based on one column and sort the rows set by ascending or descending order, the default order value is ascending order Here we are going run an example query using order by on the hive table as follows

Select * from employee order by salary desc;

Output of the above query: The above query give the highest salaries of the employees details in the descending order as follows below

Get top 1 row of each group hiveql năm 2024

Step 5: Excluding the column

The use of the Exclude column is when you have less columns in your data then we will use the select the column names using the select statements, but your table contains many columns ex 90 columns then that time you need to exclude a few columns in the select statement as follows:

Before going to exclude column we need to set or enable a property

hive> set hive.support.quoted.identifiers=NONE;

And also enable property to print the columns as follows

hive> set hive.cli.print.header=true;

Here we are going to use below query to excluding the quit date column as follows and also we use the regulate expression to exclude columns

How do I get the first row of every group in SQL?

To select the first row of each group in SQL, you can use the ' GROUP BY ' clause with the ' MIN ' or ' MAX ' aggregate function.

What is the difference between SQL and Hive?

Hive and SQL DifferencesHive is better for analyzing complex data sets. SQL is better for analyzing less complicated data sets very quickly. SQL supports Online Transactional Processing (OLTP). Hive doesn't support OLTP.

What is Hive in big data?

Hive is a data warehouse system that is used to query and analyze large datasets stored in the HDFS. Hive uses a query language called HiveQL, which is similar to SQL. As seen from the image below, the user first sends out the Hive queries.

What is querying data in Hive?

Hive enables data summarization, querying, and analysis of data. Hive queries are written in HiveQL, which is a query language similar to SQL. Hive allows you to project structure on largely unstructured data. After you define the structure, you can use HiveQL to query the data without knowledge of Java or MapReduce.