How to parse information from URL in Hive? - Big Data In Real World

How to parse information from URL in Hive?

How to pivot and unpivot a DataFrame in Spark?
March 23, 2022
How to fix could not connect to the endpoint URL issue in S3?
April 6, 2022
How to pivot and unpivot a DataFrame in Spark?
March 23, 2022
How to fix could not connect to the endpoint URL issue in S3?
April 6, 2022

How to parse information from URL in Hive?

Hive offers 2 functions to work with URLS – parse_url and parse_url_tuple.

With both functions you can extract information like – PROTOCOL, HOST, PATH, QUERY, Query parameters etc.

Let’s see them in action.

parse_url

parse_url is a user defined function. It takes in the URL and the information that we are trying to extract.

We use the key words HOST and PROTOCOL to extract host and protocol respectively.

  SELECT parse_url('https://hadoopinrealworld.com/pages?key1=value1&key2=value2', 'PROTOCOL'), parse_url('https://hadoopinrealworld.com/pages?key1=value1&key2=value2', 'HOST');
 +--------+------------------------+--+
 |  _c0   |          _c1           |
 +--------+------------------------+--+
 | https  | hadoopinrealworld.com  |
 +--------+------------------------+--+
 1 row selected (0.183 seconds) 

parse_url_tuple

parse_url_tuple() is a User Defined Table generating Function (UDTF). <<Check out the post, if you want know the difference between UDF, UDAF and UDTFs)

  SELECT parse_url_tuple('https://hadoopinrealworld.com/pages?key1=value1&key2=value2', 'PROTOCOL', 'HOST', 'PATH', 'QUERY', 'QUERY:key1')
 0: jdbc:hive2://ms2.hirw.com:2181,wk1.hirw.co> ;
 +--------+------------------------+---------+--------------------------+---------+--+
 |   c0   |           c1           |   c2    |            c3            |   c4    |
 +--------+------------------------+---------+--------------------------+---------+--+
 | https  | hadoopinrealworld.com  | /pages  | key1=value1&key2=value2  | value1  |
 +--------+------------------------+---------+--------------------------+---------+--+
 1 row selected (0.615 seconds) 

parse_url vs. parse_url_tuple – which is better?

With parse_url, if we are trying to extract 5 values from the URL we would have to call the function 5 times and this means the same URL is parsed through 5 times.

With parse_url_tuple, we parse the URL only once even when we need to parse or extract 5 values from the URL.

So always use parse_url_tuple in place of parse_url for better performance.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to parse information from URL in Hive?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X