SeqPig Manual


January 23, 2013

Contents

1 Introduction
2 Installation
 2.1 Dependencies
 2.2 Environment variables
 2.3 Instructions for building SeqPig
  2.3.1 Note
 2.4 Running on Amazon Elastic MapReduce
 2.5 Tests
 2.6 Usage
  2.6.1 Pig grunt shell for interactive operations
  2.6.2 Starting scripts from the command line for non-interactive use
3 Examples
 3.1 Operations on BAM files
  3.1.1 Filtering out unmapped reads and PCR or optical duplicates
  3.1.2 Filtering out reads with low mapping quality
  3.1.3 Filtering by regions (samtools syntax)
  3.1.4 Sorting BAM files
  3.1.5 Computing read coverage
  3.1.6 Computing base frequencies (counts) for each reference coordinate
  3.1.7 Pileup
  3.1.8 Collecting read-mapping-quality statistics
  3.1.9 Collecting per-base statistics of reads
  3.1.10 Collecting per-base statistics of base qualities for reads
  3.1.11 Filtering reads by mappability threshold
 3.2 Processing Qseq and Fastq data
  3.2.1 Converting Qseq to Fastq and vice versa
  3.2.2 Clipping bases and base qualities
4 Further information
 4.1 Hadoop parameters
 4.2 Compression

1 Introduction

SeqPig is a library for Apache Pig http://pig.apache.org/ for the distributed analysis of large sequencing datasets on Hadoop clusters. With SeqPig one can easily take advantage of the advanced high-level features of Pig to manipulate and analyze sequencing data, thus writing simple scripts that automatically run as scalable distributed programs on potentially very large Hadoop clusters.

SeqPig provides a number of features and functionalities. It provides import and export functions for file formats commonly used in bioinformatics, as well as a collection of Pig user-defined-functions (UDF's) that add to Pig functionality specifically designed for processing aligned and unaligned sequencing data. Currently SeqPig supports BAM, SAM, FastQ and Qseq input and output, thanks in part to the functionality provided by the Hadoop-BAM library.

This document can also be found under http://seqpig.sourceforge.net/.
Contact information: mailto:seqpig-users@lists.sourceforge.net

Releases of SeqPig come bundled with Picard/Samtools, which is developed at the Wellcome Trust Sanger Institute, and Seal, which is developed at CRS4. See
http://samtools.sourceforge.net/ and http://biodoop-seal.sourceforge.net/ for more details.

For more examples of SepPig scripts see also the wiki of two past SEQAHEAD COST hackathons:
http://seqahead.cs.tu-dortmund.de/meetings:fastqpigscripting
http://seqahead.cs.tu-dortmund.de/meetings:2012-05-hackathon:pileuptask
http://seqahead.cs.tu-dortmund.de/meetings:2012-05-hackathon:seqpig_life_savers_page

2 Installation

2.1 Dependencies

  1. A Hadoop cluster (we have tested with Hadoop 0.20.2)
  2. Pig (at least version 0.10)
  3. Hadoop-BAM (https://sourceforge.net/projects/hadoop-bam/)
  4. Seal (http://biodoop-seal.sourceforge.net/)

2.2 Environment variables

  1. Set Hadoop-related variables (e.g., HADOOP_HOME) for your installation
  2. Set PIG_HOME to point to your Pig installation

On a Cloudera Hadoop installation with Pig a suitable environment configuration would be:

export HADOOP_HOME=/usr/lib/hadoop 
export PIG_HOME=/usr/lib/pig

2.3 Instructions for building SeqPig

  1. Download hadoop-bam from https://sourceforge.net/projects/hadoop-bam/.
  2. Download and build the latest Seal git master version from http://biodoop-seal.sourceforge.net/. Note that this requires setting HADOOP_BAM to the installation directory of hadoop-bam.
  3. Inside the cloned SeqPig git repository create a lib/ subdirectory and copy (or link) the jar files from hadoop-bam and Seal to this new directory. The files should be:
    1. ${HADOOP_BAM}/⋆.jar
    2. from the Seal directory, run  find build/ -name seal.jar

    Note: the Picard and Sam jar files are contained in the hadoop-bam release for convenience.

  4. Run ant to build SeqPig.jar.

Once you've built SeqPig, you can move the directory to a location of your preference (if on a shared system, perhaps /usr/local/java/seqpig, else even your home directory could be fine).

Set the environment variable SEQPIG_HOME to point to the installation directory of SeqPig; e.g.,

export SEQPIG_HOME=/usr/local/java/seqpig

For your convenience, you can add the bin directory to your PATH:

$ export PATH=${PATH}:${SEQPIG_HOME}/bin

This way, you'll be able to start a SeqPig-enabled Pig shell by running the seqpig command.

2.3.1 Note

Some of the example scripts in this manual (e.g., Section 3.2.2) require functions from PiggyBank, which is a collection of publicly available User-Defined Functions (UDF's) that are distributed with Pig but may need to be built separately, depending on your Pig distribution. For more details see https://cwiki.apache.org/confluence/display/PIG/PiggyBank. Verify that PiggyBank has been compiled by looking for the file piggybank.jar under $PIG_HOME:

$ find $PIG_HOME -name piggybank.jar

If PiggyBank hasn't been compiled, go into $PIG_HOME/contrib/piggybank/java and run ant.

2.4 Running on Amazon Elastic MapReduce

Assuming you have started an interactive Pig Job Flow (for example via the AWS console), you can login into the master node and copy the SeqPig release to the Hadoop user home directory. Then set both SEQPIG_HOME and PIG_HOME correctly (HADOOP_HOME should be set by default). Note that the Pig version installed does not necessarily match the latest Pig release. The advantage, however, is the ability to use S3 buckets for input and output.

Consider the following example for starting SeqPig on EMR that assumes the SeqPig release was installed into /home/hadoop/seqpig.

$ export SEQPIG_HOME=/home/hadoop/seqpig 
$ export PIG_HOME=/home/hadoop/.versions/pig-0.9.2 
$ /home/hadoop/seqpig/bin/seqpig

2.5 Tests

After building SeqPig it may be a good idea to run tests to verify that the environment has been set up correctly. When inside the SeqPig directory execute

$ test/test_all.sh

This test requires that Hadoop is set up correctly. It first imports a BAM file, sorts the reads by coordinate and converts it to SAM for comparing the results. The test should end with the line

TEST: all tests passed

Alternatively, the same script can execute Pig in local mode, which does not require a working Hadoop instance. To run the tests in local mode execute

$ test/test_all.sh -l

If you intend to run SeqPig on an Amazon Elastic MapReduce instance, you can also test input from S3 by providing an S3 path to the file data/input.bam:

$ test/test_all.sh -s <s3_path>

for example:

$ test/test_all.sh -s seqpig/data/input.bam

where seqpig is the name of the S3 bucket.

2.6 Usage

2.6.1 Pig grunt shell for interactive operations

Assuming that all the environment variables have been set as described in the previous sections, you can start the SeqPig-enabled “grunt” shell by running

$ seqpig

If you prefer to tun SeqPig in local mode (without Hadoop), which is useful for debugging scripts, you can start it by running

$ seqpig -x local
2.6.2 Starting scripts from the command line for non-interactive use

Alternatively to using the interactive Pig grunt shell, users can write scripts that are then submitted to Pig/Hadoop for automated execution. This type of execution has the advantage of being able to handle parameters; for instance, one can parametrize input and output files. See the /scripts directory inside the SeqPig distribution and Section 3 for examples.

3 Examples

This section lists a number of examples of the types of operations that can be performed with SeqPig. Of course, what can be done is not limited to these operations.

3.1 Operations on BAM files

To access sequencing data in BAM files, SeqPig uses Hadoop-BAM, which provides access to all fields and optional attributes in the data records.

All the examples below assume that an input BAM file is initially imported to HDFS via

prepareBamInput.sh input.bam

and then loaded in the grunt shell via

A = load 'input.bam' using BamUDFLoader('yes');

The 'yes' parameter to BamUDFLoader chooses read attributes to be loaded; choose 'no' whenever these are not required).

Once some operations have been performed, the resulting (modified) read data can then be stored into a new BAM file via

store A into 'output.bam' using BamUDFStorer('input.bam.asciiheader');

and can also be exported from HDFS to the local filesystem via

prepareBamOutput.sh output.bam
Note
the Pig store operation requires a valid header for the BAM output file, for example the header of the source file used to generate it, which is generated automatically by the prepareBamInput.sh script used to import it)

Writing the BAM data to the screen (similarly to samtools view) can be done simply by

dump A;

Another very useful Pig command is describe, which returns the schema that Pig uses for a given data bag. Example:

A = load 'input.bam' using BamUDFLoader('yes'); 
describe A;

returns

 BamData: {name: chararray,start: int,end: int,read: chararray,cigar: chararray, 
  basequal: chararray,flags: int,insertsize: int,mapqual:int,matestart: int, 
  materefindex: int,refindex: int,refname: chararray,attributes: map[]}

Notice that all fields except the attributes are standard data types (strings or integers). Specific attributes can be accessed via attributes#'name'. For example,

B = FOREACH A GENERATE name, attributes#'MD'; 
dump B;

will output all read names and their corresponding MD tag. Other useful commands are LIMIT and SAMPLE, which can be used to select a subset of reads from a BAM/SAM file.

B = LIMIT A 20;

will assign the first 20 records of A to B, while

B = SAMPLE A 0.01;

will sample from A with sampling probability 0.01.

3.1.1 Filtering out unmapped reads and PCR or optical duplicates

Since the flags field of a SAM record is exposed to Pig, one can simply use it to filter out all tuples (i.e., SAM records) that do not have the corresponding bit set.

A = FILTER A BY (flags/4)%2==0 and (flags/1024)%2==0;

For convenience SeqPig provides a set of filters that allow a direct access to the relevant fields. The previous example is equivalent to

run scripts/filter_defs.pig 
A = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);

For of a full list of the available filters look at the file scripts/filter_defs.pig.

3.1.2 Filtering out reads with low mapping quality

Other fields can also be used for filtering, for example the read mapping quality value as shown below.

A = FILTER A BY mapqual > 19;
3.1.3 Filtering by regions (samtools syntax)

SeqPig also supports filtering by samtools region syntax. The following examples selects base positions 1 to 44350673 of chromosome 20.

DEFINE myFilter CoordinateFilter('input.bam.asciiheader','20:1-44350673'); 
A = FILTER A BY myFilter(refindex,start,end);

Note that filtering by regions requires a valid SAM header for mapping sequence names to sequence indices. This file is generated automatically when BAM files are imported via the prepareBamInput.sh script.

3.1.4 Sorting BAM files

Sorting an input BAM file by chromosome, reference start coordinate, strand and readname (in this hierarchical order):

A = FOREACH A GENERATE name, start, end, read, cigar, basequal, flags, insertsize, 
mapqual, matestart, materefindex, refindex, refname, attributes, (flags/16)%2; 
A = ORDER A BY refname, start, $14, name;
Note
This is roughly equivalent to executing from the command line
$ pig -param inputfile=input.bam -param outputfile=input_sorted.bam ${SEQPIG_HOME}/scripts/sort_bam.pig
3.1.5 Computing read coverage

Computing read coverage over reference-coordinate bins of a fixed size, for example:

B = GROUP A BY start/200; 
C = FOREACH B GENERATE group, COUNT(A); 
dump C;

will output the number of reads that lie in any non-overlapping bin of size 200 base pairs.

3.1.6 Computing base frequencies (counts) for each reference coordinate
A = FOREACH A GENERATE read, flags, refname, start, cigar, basequal, mapqual; 
A = FILTER A BY (flags/4)%2==0; 
RefPos = FOREACH A GENERATE ReadRefPositions(read, flags, refname, start, cigar, basequal), mapqual; 
flatset = FOREACH RefPos GENERATE flatten($0), mapqual; 
grouped = GROUP flatset BY ($0, $1, $2); 
base_counts = FOREACH grouped GENERATE group.chr, group.pos, group.base, COUNT(flatset); 
base_counts = ORDER base_counts BY chr,pos; 
store base_counts into 'input.basecounts';
Note
This is roughly equivalent to executing from the command line
$ pig -param inputfile=input.bam -param outputfile=input.basecounts -param pparallel=1 ${SEQPIG_HOME}/scripts/basefreq.pig
3.1.7 Pileup

Generating samtools compatible pileup (for a correctly sorted BAM file with MD tags aligned to the same reference, should produce the same output as samtools mpileup -A -f ref.fasta -B input.bam):

A = load 'input.bam' using BamUDFLoader('yes'); 
B = FILTER A BY (flags/4)%2==0 and (flags/1024)%2==0; 
C = FOREACH B GENERATE ReadPileup(read, flags, refname, start, cigar, 
    basequal, attributes#'MD', mapqual), start, flags, name; 
C = FILTER C BY $0 is not null; 
D = FOREACH C GENERATE flatten($0), start, flags, name; 
E = GROUP D BY (chr, pos); 
F = FOREACH E { G = FOREACH D GENERATE refbase, pileup, qual, start, 
    (flags/16)%2, name; G = ORDER G BY start, $4, name; GENERATE group.chr, 
    group.pos, PileupOutputFormatting(G, group.pos); } 
G = ORDER F BY chr, pos; 
H = FOREACH G GENERATE chr, pos, flatten($2); 
store H into 'input.pileup' using PigStorage('\t');
Note
This is equivalent to executing from the command line
$ pig -param inputfile=input.bam -param outputfile=input.pileup -param pparallel=1 
   ${SEQPIG_HOME}/scripts/pileup.pig

The script essentially does the following:

  1. Import BAM file and filter out unmapped or duplicate reads (A, B)
  2. Break up each read and produce per-base pileup output (C, D)
  3. Group all thus generated pileup output based on a (chromosome, position) coordinate system (E)
  4. For each of the groups, sort its elements by their position, strand and name; then format the output according to samtools (F)
  5. Sort the final output again by (chromosome, position) and perform some Pig operation by unnesting tuples (G, H)
  6. Store the output to a directory inside HDFS (last line)

There are two optional parameters for pileup.pig: min_map_qual and min_base_qual (both with default value 0) that filter out reads with either insufficient map quality or base qualities. Their values can be set the same way as the other parameters above.

There is an alternative pileup script which typically performs better but is more sensitive to additional parameters. This second script, pileup2.pig, is based on a binning of the reads according to intervals on the reference sequence. The pileup output is then generated on a by-bin level and not on a by-position level. This script can be invoked with the same paramters as pileup2.pig. However, it has tunable parameters that determine the size of the bins (binsize) and the maximum number of reads considered per bin (reads_cutoff), which is similar to the maximum depth parameter that samtools accepts. However, note that since this parameter is set on a per-bin level you may chose it dependent on the read length and bin size, as well as the amount of memory available on the compute nodes.

3.1.8 Collecting read-mapping-quality statistics

In order to evaluate the output of an aligner, it may be useful to consider the distribution of the mapping quality over the collection of reads. Thanks to Pig's GROUP operator this is fairly easy.

A = load 'input.bam' using BamUDFLoader('yes'); 
B = FILTER A BY (flags/4)%2==0 and (flags/1024)%2==0; 
read_stats_data = FOREACH B GENERATE mapqual; 
read_stats_grouped = GROUP read_stats_data BY mapqual; 
read_stats = FOREACH read_stats_grouped GENERATE group, COUNT($1); 
read_stats = ORDER read_stats BY group; 
STORE read_stats into 'mapqual_dist.txt';
Note
This is equivalent to executing from the command line
$ pig -param inputfile=input.bam -param outputfile=mapqual_dist.txt ${SEQPIG_HOME}/scripts/read_stats.pig
3.1.9 Collecting per-base statistics of reads

Sometimes it may be useful to analyze a given set of reads for a bias towards certain bases being called at certain positions inside the read. The following simple script generates for each reference base and each position inside a read the distribution of the number of read bases that were called.

A = load 'input.bam' using BamUDFLoader('yes'); 
B = FILTER A BY (flags/4)%2==0 and (flags/1024)%2==0; 
C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,attributes#'MD'); 
D = FOREACH C GENERATE FLATTEN($0); 
base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase; 
base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase); 
base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount; 
base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos); 
base_stats = FOREACH base_stats_grouped { 
      TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount; 
      TMP2 = ORDER TMP1 BY bcount desc; 
      GENERATE group.$0, group.$1, TMP2; 
  } 
STORE base_stats into 'outputfile_readstats.txt';

Here is an example output (for a BAM file with 50 reads):

A     0     {(A,19),(G,2)} 
A     1     {(A,10)} 
A     2     {(A,18)} 
A     3     {(A,16)} 
A     4     {(A,14)} 
A     5     {(A,15)} 
A     6     {(A,16),(G,2)} 
... 
A     98    {(A,7)} 
A     99    {(A,14)} 
C     0     {(C,6)} 
C     1     {(C,11)} 
C     2     {(C,9)} 
...
Note
This example script is equivalent to executing from the command line
$ pig -param inputfile=input.bam -param outputfile=outputfile_readstats.txt $SEQPIG_HOME/scripts/basequal_stats.pig

Figure 1 shows the distribution obtained from a sample BAM file.


PIC


Figure 1: 3-D Histogram of base qualities over read length for a sample BAM file. The x-axis (values range 1 to 100) shows the index of bases in the read, while the y-axis shows base quality. The z-axis is the scaled frequency. The plot was generated by converting the output of basequal_stats.pig using tools/basequal_stats2matrix.pl and then plotted using tools/plot_basequal_stats.R.

3.1.10 Collecting per-base statistics of base qualities for reads

Analogously to the previous example collecting statistics for the read bases, we can also collect frequencies for base qualities conditioned on the position of the base inside the reads. If these fall off too quickly for later positions, it may indicate some quality issues with the run. The resulting script is actually fairly similar to the previous one with the difference of not grouping over the reference bases.

A = load 'input.bam' using BamUDFLoader('yes'); 
B = FILTER A BY (flags/4)%2==0 and (flags/1024)%2==0; 
C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,attributes#'MD'); 
D = FOREACH C GENERATE FLATTEN($0); 
base_stats_data = FOREACH D GENERATE basepos, basequal; 
base_stats_grouped = GROUP base_stats_data BY (basepos, basequal); 
base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 as basepos, group.$1 AS basequal, COUNT($1) AS qcount; 
base_stats_grouped = GROUP base_stats_grouped_count BY basepos; 
base_stats = FOREACH base_stats_grouped { 
      TMP1 = FOREACH base_stats_grouped_count GENERATE basequal, qcount; 
      TMP2 = ORDER TMP1 BY basequal; 
      GENERATE group, TMP2; 
} 
STORE base_stats into 'outputfile_basequalstats.txt';

Here is an example output (for a BAM file with 50 reads):

0     {(37,10),(42,1),(51,20),(52,1),(59,1),(61,1),(62,1),(67,2),(68,2),(70,2),(71,4),(72,3),(73,1),(75,2)} 
1     {(53,1),(56,1),(61,1),(63,1),(64,1),(65,2),(67,4),(68,3),(69,2),(70,7),(71,3),(72,3),(73,1),(74,4),(75,2),(76,5),(77,6),(78,2),(80,1)} 
2     {(45,1),(46,1),(51,2),(57,1),(61,1),(65,2),(66,3),(67,2),(69,3),(71,4),(72,2),(73,6),(74,7),(75,1),(76,8),(77,2),(78,3),(80,1)} 
3     {(58,1),(59,1),(60,1),(61,1),(62,1),(64,1),(65,2),(67,2),(68,1),(69,5),(70,1),(71,3),(72,7),(73,2),(74,4),(75,6),(76,2),(77,4),(78,3),(79,1),(81,1)} 
4     {(55,1),(60,1),(61,1),(62,1),(64,1),(66,1),(67,3),(68,2),(69,1),(70,7),(71,2),(72,1),(73,4),(74,2),(75,2),(76,2),(77,2),(78,3),(79,7),(80,4),(81,2)} 
5     {(51,1),(52,2),(54,1),(58,2),(62,2),(63,1),(66,3),(68,4),(70,1),(71,1),(72,2),(73,3),(74,1),(75,8),(76,1),(77,5),(78,1),(79,6),(80,3),(81,3)} 
...
Note
This example script is equivalent to executing from the command line
$ pig -param inputfile=input.bam -param outputfile=outputfile_basequalstats.txt $SEQPIG_HOME/scripts/basequal_stats.pig
3.1.11 Filtering reads by mappability threshold

The script filter_mappability.pig filters reads in a given BAM file based on a given mappability threshold. Both input BAM and mappability file need to reside inside HDFS

$ pig -param inputfile=/user/hadoop/input.bam -param outputfile=/user/hadoop/output.bam -param regionfile=/user/hadoop/mappability.100kbp.txt -param threshold=90 $SEQPIG_HOME/scripts/filter_mappability.pig

Note that since the script relies on distributing the bam file header and the mappability file via Hadoop's distributed cache, it is not possible to run it with Pig in local mode.

3.2 Processing Qseq and Fastq data

SeqPig supports the import and export of non-aligned reads stored in Qseq and Fastq data. Due to Pig's model that all records correspond to tuples, which form bags, reads can be processed in very much the same way independent on for example whether they are stored in Qseq or Fastq.

3.2.1 Converting Qseq to Fastq and vice versa

The following two lines simply convert an input Qseq into Fastq.

reads = load 'input.qseq' using QseqUDFLoader(); 
STORE reads INTO 'output.fastq' using FastqUDFStorer();

The other direction works analogously.

3.2.2 Clipping bases and base qualities

Assuming there were some problems in certain cycles of the sequencer, it may be useful to clip bases from reads. This example removes the last 3 bases and their qualities and stores the data under a new filename. Note that here we rely on the SUBSTRING and LENGTH string functions, which is part of the PiggyBank (see Section 2.3.1).

DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING(); 
DEFINE LENGTH org.apache.pig.piggybank.evaluation.string.LENGTH(); 
reads = load 'input.qseq' using QseqUDFLoader(); 
B = FOREACH A GENERATE instrument, run_number, flow_cell_id, lane, tile, xpos, ypos, read, qc_passed, control_number, index_sequence, SUBSTRING(sequence, 0, LENGTH(sequence) - 3) AS sequence, SUBSTRING(quality, 0, LENGTH(quality) - 3) AS quality; 
store B into 'output.qseq' using QseqUDFStorer();
Note
This example script is equivalent to executing from the command line
$ pig -param inputfile=input.qseq -param outputfile=output.qseq -param backclip=3 $SEQPIG_HOME/scripts/clip_reads.pig

4 Further information

4.1 Hadoop parameters

Some scripts (such as pileup2.pig) require an amount of memory that depends on the choice of command line parameters. To tune the performance of such operations on the Hadoop cluster, consider the following Hadoop-specific parameters in mapred-site.xml.

Additionally, it is possible to pass command line parameters (such as mapper and reducer memory limits). For instance, consider the Pig invocation (see tools/tun_all_pileup2.sh)

${PIG_HOME}/bin/pig -Dpig.additional.jars=${SEQPIG_HOME}/lib/hadoop-bam-5.0.jar:${SEQPIG_HOME}/build/jar/SeqPig.jar:${SEQPIG_HOME}/lib/seal.jar:${SEQPIG_HOME}/lib/picard-1.76.jar:${SEQPIG_HOME}/lib/sam-1.76.jar -Dmapred.job.map.memory.mb=${MAP_MEMORY} -Dmapred.job.reduce.memory.mb=${REDUCE_MEMORY} -Dmapred.child.java.opts=-Xmx${CHILD_MEMORY}M -Dudf.import.list=fi.aalto.seqpig -param inputfile=$INPUTFILE -param outputfile=$OUTPUTFILE -param pparallel=${REDUCESLOTS} ${SEQPIG_HOME}/scripts/pileup2.pig

By setting appropriate values for MAP_MEMORY, REDUCE_MEMORY, CHILD_MEMORY and for the number of available reduce slots REDUCESLOTS one may be able to improve performance.

4.2 Compression

For optimal performance and space usage it may be advisable to enable the compression of Hadoop map (and possible reduce) output, as well as temporary data generated by Pig. Compression with Pig can be enabled by setting properties such as

 -Djava.library.path=/opt/hadoopgpl/native/Linux-amd64-64 
 -Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
 -Dmapred.output.compress=true 
 -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

on the Pig command line or the Hadoop configuration. Note that currently not all Hadoop compression codecs are supported by Pig. The details regarding which compression codec to use depend are beyond the scope of this manual.