Custom Processing using Apache Pig UDFs (User Defined Functions)

Pig UDFs can be easily implemented in Java, below are the Steps to create a UDF using eclipse.

Create a normal java project and a java class (UDF), which extends one of the Eval, Store, Load or Filter classes.
Override the exec() function to write the implementation.

We need to make sure that we download the latest pig.jar and include it in the build path, otherwise the code will not compile. Every new function has to either extend ‘EvalFunc’ class or any other class like ‘LoadFunc’. All these dependent classes reside in the pig.jar.

Using the settings given below, export the Java project as Jar File

In order to use this jar, we will have to register it on the grunt prompt as below:

grunt&gt; register 'your_path_to_jar/NewUDF.jar';

Define a name for the UDF

A single word can be defined for the whole method, so as to make the code more readable and also to avoid writing the full method specification at every part of code where it is needed.

grunt&gt;define TRIM com.hadoop.pig.Trim();

Using the UDF

grunt>divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends:float);
grunt>trimmed = foreach divs generate TRIM(symbol);

This function can also be used in order to include a set of paths on the command line for Pig to search, while looking for UDFs.

So we change our invocation to:

pig -Dudf.import.list=org.apache.pig.piggybank.evaluation.string register.pig

Using yet another property, we can get rid of the register command as well.

If we add the below code to our command line, then the register command is no longer necessary.

Set Dpig.additional.jars=/usr/local/pig/piggybank/piggybank.jar

Creating a UDF (without eclipse)

Create a folder myNewUdf
Create a UDF in a java file, say 'TrimTo.java'

The class name should also be TrimTo.java. The package name should be same as the folder name where the Java file resides. i.e 'myNewUdf'.

 $ cd myNewUdf/

 $ ls -l

 total 8

-rw-rw-r-- 1 userName userName 1162 Feb 21 16:33 TrimTo.java

Compile the Java file

:~/pig/myNewUdf$ javac –classpath /home/userName/pig/trunk/pig.jar TrimTo.java

Now, the class file would be visible

:~/pig/myNewUdf$ ls -l

total 8

-rw-rw-r-- 1 userName userName 1917 Feb 21 16:45 TrimTo.class
-rw-rw-r-- 1 userName userName 1162 Feb 21 16:33 TrimTo.java

Come 1 level up, create the jar with same name as of the folder i.e 'myNewUdf.jar'

:~/pig/myNewUdf$ cd ..

:~/pig$ jar cf myNewUdf.jar myNewUdf

In the Grunt Prompt

grunt&gt; REGISTER /home/userName /pig/myNewUdf.jar;

grunt&gt;GrocPricesTrim= FOREACH GrocPrices generate myNewUdf.TrimTo(PRODUCTNAME);

grunt&gt; ILLUSTRATE GrocPricesTrim;

Creating and using Macros

Macros are declared with the define statement. A macro takes a set of input parameters, which are string values that will be substituted for the parameters when the macro is expanded. The name of output relation is given in a return statement. The operators of the macro are enclosed in {} (braces).

-------- Macro.pig --------

<strong>define dividend_analysis (daily, year, daily_symbol, daily_open, daily_close)</strong>

<strong>returns analyzed</strong> {

divs = load 'NYSE_dividends' as(exchange:chararray,symbol:chararray,
date:chararray, dividends:float);

divsthisyear = filter divs by date matches '$year-.*';

dailythisyear = filter $daily by date matches '$year-.*';

jnd = join divsthisyear by symbol, dailythisyear by $daily_symbol;

$analyzed = foreach jnd generate dailythisyear::symbol, $daily_close
- $daily_open;

};

------- on the Grunt shell ---------

daily = load '/home/share/Customer-Bigdata-Analysis/NYSE_daily.txt'
as (exchange:chararray, symbol:chararray,date:chararray, open:float,
high:float, low:float, close:float,volume:int, adj_close:float);

import '/home/cs246/PigPPT/macro.pig';

results = dividend_analysis(daily, '2009', 'symbol', 'open', 'close');

describe results;

If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at Salesforce@coforge.com

Other useful links:

Email Classifier using Mahout on Hadoop

Spark Cluster Setup on EC2

Installing SolrCloud on Hadoop

Coforge-Salesforce BU

Connect with our experts
Copy link
1. Share

About Coforge.

We are a global digital services and solutions provider, who leverage emerging technologies and deep domain expertise to deliver real-world business impact for our clients. A focus on very select industries, a detailed understanding of the underlying processes of those industries, and partnerships with leading platforms provide us with a distinct perspective. We lead with our product engineering approach and leverage Cloud, Data, Integration, and Automation technologies to transform client businesses into intelligent, high-growth enterprises. Our proprietary platforms power critical business processes across our core verticals. We are located in 23 countries with 30 delivery centers across nine countries.

Custom Processing using Apache Pig UDFs (User Defined Functions)

Related reads.

Custom Processing | UDFs | Apache Pig | Apache | Analytics & BI

Benefits of using SharePoint with Salesforce (and how to do it)

Custom Processing | UDFs | Apache Pig | Apache | Analytics & BI

Load balancing with APACHE web server

Custom Processing | UDFs | Apache Pig | Apache | Analytics & BI

What is Kafka? The top 5 things you should know

About Coforge.

WHAT WE DO.