Tuesday, 26 May 2015

Tableau offers several capabilities for connections to Hadoop Hive that are not available for other data sources.


XML processing

While many traditional databases provide XML support, the XML content must first be loaded into the database. Because Hive tables can be linked to a collection of XML files or document fragments stored in the Hadoop file system, Hadoop is much more flexible in analyzing XML content.
Tableau provides a number of functions for processing XML data, which allows you to extract content, perform analysis or computation, and filter the XML data. These functions leverage XPath, a web standard used by Hive and described in more detail in the Hive XPath documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF.

Web and text processing

In addition to XPath operators, the Hive query language offers several ways to work with common web and text data. Tableau exposes the following functions that you can use in calculated fields:
  • JSON Objects: GET_JSON_OBJECT retrieves data elements from strings containing JSON objects.
  • URLs: Tableau offers PARSE_URL to extract the components of a URL such as the protocol type or the host name. Additionally, PARSE_URL_QUERY can retrieve the value associated with a given query key in a key/value parameter list.
  • Text Data: The regular expression find and replace functions in Hive are available in Tableau for complex text processing.
For more information on these functions, refer to the Hive documentation: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF.

On-the-fly ETL

Custom SQL gives you the flexibility of using arbitrary queries for your connection, which allows complex join conditions, pre-filtering, pre-aggregation and more. Traditional databases rely heavily on optimizers, but they can struggle with complex Custom SQL and lead to unexpected performance degradation as you build views. The batch-oriented nature of Hadoop allows it to handle layers of analytical queries on top of complex Custom SQL with only incremental increases to query time.
Because Custom SQL is a natural fit for the complex layers of data transformations seen in ETL, a Tableau connection to Hive based on Custom SQL is essentially on-the-fly ETL. Refer to the Hadoop and Tableau Demo on the Tableau blog to see how you can use Custom SQL to unpivot nested XML data directly in the Custom SQL connection, yielding views built from the unpivoted data.

Initial SQL

Tableau supports initial SQL for Hadoop Hive connections, which allows you to define a collection of SQL statements to perform immediately after the connection is established. For example, you can set Hive and Hadoop configuration variables for a given connection from Tableau to tune performance characteristics. Refer to the Designing for Performance article for more information. You can also register custom UDFs as scripts, JAR files, etc., that reside on the Hadoop cluster. Registering these allows you, other developers, and analysts to collaborate on developing custom data processing logic and quickly incorporating that into Tableau views.
Because initial SQL supports arbitrary Hive query statements, you can use Hive to accomplish a variety of interesting tasks when connecting to Tableau.

Custom analysis with UDFs and Map/Reduce

Although Hive offers additional UDFs that Tableau does not yet support as functions for you to use in calculated fields, Tableau does offer "Pass Through" functions for using UDFs, UDAFs (for aggregation) and arbitrary SQL expressions in the SELECT list. For example, to determine the co-variance between two fields 'f1' and 'f2', the following Tableau calculated field takes advantage of a UDAF in Hive: RAWSQLAGG_REAL("covar_pop(%1, %2)", [f1], [f2])
Similarly, Tableau allows you to take advantage of custom UDFs and UDAFs built by the Hadoop community or by your own development team. Often these are built as JAR files that Hadoop can easily copy across the cluster to support distributed computation. To take advantage of JAR files or scripts, inform Hive of the location of these files and Hive will take care of the rest.

Note: You can also do this with Initial SQLwith one or more SQL statements separated by semicolons:

add JAR /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u1.jar;
add FILE /mnt/hive_backlink_mapper.py;


For more information, refer to the Hive language manual section on CLI: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveResources.

Hive supports explicit control over how to perform the Map/Reduce operations. While Tableau allows you to perform sophisticated analysis without having to learn the Hive query language, as an advanced Hive and Hadoop user, you can take full advantage of this knowledge in Tableau. Using Custom SQL, you can define arbitrary Hive query expressions, including the MAP, REDUCE, and TRANSFORM operators described in the Hive language manual. As with custom UDFs, using custom transform scripts may require you to register the location of those scripts using Initial SQL.

Refer to the following blog for an interesting example of using custom scripts and explicit Map/Reduce transforms: http://www.cloudera.com/blog/2009/09/grouping-related-trends-with-hadoop.

1 comment:


  1. Thanks for sharing valuable information. Your blogs were helpful to tableau learners. I
    request to update the blog through step-by-step. Also, find the tableau news at
    Tableau Online Training Bangalore Blog.

    ReplyDelete