Pig Latin Statements
A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. (This definition applies to all Pig Latin operators except LOAD and STORE which read data from and write data to the file system.) Pig Latin statements can span multiple lines and must end with a semi-colon ( ; ). Pig Latin statements are generally organized in the following manner:
- A LOAD statement reads data from the file system.
- A series of "transformation" statements process the data.
- A STORE statement writes output to the file system; or, a DUMP statement displays output to the screen.
Pig Latin is a relatively simple language that executes statements. A statement is an operation that takes input (such as a bag, which represents a set of tuples) and emits another bag as its output. A bag is a relation, similar to table, that you'll find in a relational database (where tuples represent the rows, and individual tuples are made up of fields).
A script in Pig Latin often follows a specific format in which data is read from the file system, a number operations are performed on the data (transforming it in one or more ways), and then the resulting relation is written back to the file system.
BigDataTraining.IN - India's Leading BigData Consulting & Training Provider, Request a Quote!
BigDataTraining.IN - India's Leading BigData Consulting & Training Provider, Request a Quote!
Pig has a rich set of data types, supporting not only high-level concepts like bags, tuples, and maps, but also simple data types such as
int
s, long
s, float
s, double
s, chararray
s, and bytearray
s. With the simple types, you'll find a range of arithmetic operators (such as add
, subtract
, multiply
, divide
, and module
) in addition to a conditional operator called bincond
that operates similar to the C ternary
operator. And as you'd expect, a full suite of comparison operators, including rich pattern matching using regular expressions.
All Pig Latin statements operate on relations (and are called relational operators). there's an operator for loading data from and storing data in the file system. There's a means to
FILTER
data by iterating the rows of a relation. This functionality is commonly used to remove data from the relation that is not needed for subsequent operations. Alternatively, if you need to iterate the columns of a relation instead of the rows, you can use the FOREACH
operator. FOREACH
permits nested operations such as FILTER
and ORDER
to transform the data during the iteration.
The
ORDER
operator provides the ability to sort a relation based on one or more fields. The JOIN
operator performs an inner or outer join of two or more relations based on common fields. The SPLIT
operator provides the ability to split a relation into two or more relations based on a user-defined expression. Finally, the GROUP
operator groups the data in one or more relations based on some expression.
No comments:
New comments are not allowed.