SeqPig is a library for Apache Pig http://pig.apache.org/ for the distributed analysis of large sequencing datasets on Hadoop clusters. With SeqPig one can easily take advantage of the advanced high-level features of Pig to manipulate and analyze sequencing data, thus writing simple scripts that automatically run as scalable distributed programs on potentially very large Hadoop clusters.
SeqPig provides a number of features and functionalities. It provides import and export functions for file formats commonly used in bioinformatics, as well as a collection of Pig user-defined-functions (UDF's) that add to Pig functionality specifically designed for processing aligned and unaligned sequencing data. Currently SeqPig supports BAM, SAM, FastQ and Qseq input and output, thanks in part to the functionality provided by the Hadoop-BAM library.
This document can also be found under http://seqpig.sourceforge.net/.
Contact information: mailto:seqpig-users@lists.sourceforge.net
Releases of SeqPig come bundled with Picard/Samtools, which is developed at the Wellcome Trust Sanger Institute, and Seal,
which is developed at CRS4. See
http://samtools.sourceforge.net/ and http://biodoop-seal.sourceforge.net/ for more details.
For more examples of SepPig scripts see also the wiki of two past SEQAHEAD COST hackathons:
http://seqahead.cs.tu-dortmund.de/meetings:fastqpigscripting
http://seqahead.cs.tu-dortmund.de/meetings:2012-05-hackathon:pileuptask
http://seqahead.cs.tu-dortmund.de/meetings:2012-05-hackathon:seqpig_life_savers_page
On a Cloudera Hadoop installation with Pig a suitable environment configuration would be:
Note: the Picard and Sam jar files are contained in the hadoop-bam release for convenience.
Once you've built SeqPig, you can move the directory to a location of your preference (if on a shared system, perhaps /usr/local/java/seqpig, else even your home directory could be fine).
Set the environment variable SEQPIG_HOME to point to the installation directory of SeqPig; e.g.,
For your convenience, you can add the bin directory to your PATH:
This way, you'll be able to start a SeqPig-enabled Pig shell by running the seqpig command.
Some of the example scripts in this manual (e.g., Section 3.2.2) require functions from PiggyBank, which is a collection of publicly available User-Defined Functions (UDF's) that are distributed with Pig but may need to be built separately, depending on your Pig distribution. For more details see https://cwiki.apache.org/confluence/display/PIG/PiggyBank. Verify that PiggyBank has been compiled by looking for the file piggybank.jar under $PIG_HOME:
If PiggyBank hasn't been compiled, go into $PIG_HOME/contrib/piggybank/java and run ant.
Assuming you have started an interactive Pig Job Flow (for example via the AWS console), you can login into the master node and copy the SeqPig release to the Hadoop user home directory. Then set both SEQPIG_HOME and PIG_HOME correctly (HADOOP_HOME should be set by default). Note that the Pig version installed does not necessarily match the latest Pig release. The advantage, however, is the ability to use S3 buckets for input and output.
Consider the following example for starting SeqPig on EMR that assumes the SeqPig release was installed into /home/hadoop/seqpig.
After building SeqPig it may be a good idea to run tests to verify that the environment has been set up correctly. When inside the SeqPig directory execute
This test requires that Hadoop is set up correctly. It first imports a BAM file, sorts the reads by coordinate and converts it to SAM for comparing the results. The test should end with the line
Alternatively, the same script can execute Pig in local mode, which does not require a working Hadoop instance. To run the tests in local mode execute
If you intend to run SeqPig on an Amazon Elastic MapReduce instance, you can also test input from S3 by providing an S3 path to the file data/input.bam:
for example:
where seqpig is the name of the S3 bucket.
Assuming that all the environment variables have been set as described in the previous sections, you can start the SeqPig-enabled “grunt” shell by running
If you prefer to tun SeqPig in local mode (without Hadoop), which is useful for debugging scripts, you can start it by running
Alternatively to using the interactive Pig grunt shell, users can write scripts that are then submitted to Pig/Hadoop for automated execution. This type of execution has the advantage of being able to handle parameters; for instance, one can parametrize input and output files. See the /scripts directory inside the SeqPig distribution and Section 3 for examples.
This section lists a number of examples of the types of operations that can be performed with SeqPig. Of course, what can be done is not limited to these operations.
To access sequencing data in BAM files, SeqPig uses Hadoop-BAM, which provides access to all fields and optional attributes in the data records.
All the examples below assume that an input BAM file is initially imported to HDFS via
and then loaded in the grunt shell via
The 'yes' parameter to BamUDFLoader chooses read attributes to be loaded; choose 'no' whenever these are not required).
Once some operations have been performed, the resulting (modified) read data can then be stored into a new BAM file via
and can also be exported from HDFS to the local filesystem via
Writing the BAM data to the screen (similarly to samtools view) can be done simply by
Another very useful Pig command is describe, which returns the schema that Pig uses for a given data bag. Example:
returns
Notice that all fields except the attributes are standard data types (strings or integers). Specific attributes can be accessed via attributes#'name'. For example,
will output all read names and their corresponding MD tag. Other useful commands are LIMIT and SAMPLE, which can be used to select a subset of reads from a BAM/SAM file.
will assign the first 20 records of A to B, while
will sample from A with sampling probability 0.01.
Since the flags field of a SAM record is exposed to Pig, one can simply use it to filter out all tuples (i.e., SAM records) that do not have the corresponding bit set.
For convenience SeqPig provides a set of filters that allow a direct access to the relevant fields. The previous example is equivalent to
For of a full list of the available filters look at the file scripts/filter_defs.pig.
Other fields can also be used for filtering, for example the read mapping quality value as shown below.
SeqPig also supports filtering by samtools region syntax. The following examples selects base positions 1 to 44350673 of chromosome 20.
Note that filtering by regions requires a valid SAM header for mapping sequence names to sequence indices. This file is generated automatically when BAM files are imported via the prepareBamInput.sh script.
Sorting an input BAM file by chromosome, reference start coordinate, strand and readname (in this hierarchical order):
Computing read coverage over reference-coordinate bins of a fixed size, for example:
will output the number of reads that lie in any non-overlapping bin of size 200 base pairs.
Generating samtools compatible pileup (for a correctly sorted BAM file with MD tags aligned to the same reference, should produce the same output as samtools mpileup -A -f ref.fasta -B input.bam):
The script essentially does the following:
There are two optional parameters for pileup.pig: min_map_qual and min_base_qual (both with default value 0) that filter out reads with either insufficient map quality or base qualities. Their values can be set the same way as the other parameters above.
There is an alternative pileup script which typically performs better but is more sensitive to additional parameters. This second script, pileup2.pig, is based on a binning of the reads according to intervals on the reference sequence. The pileup output is then generated on a by-bin level and not on a by-position level. This script can be invoked with the same paramters as pileup2.pig. However, it has tunable parameters that determine the size of the bins (binsize) and the maximum number of reads considered per bin (reads_cutoff), which is similar to the maximum depth parameter that samtools accepts. However, note that since this parameter is set on a per-bin level you may chose it dependent on the read length and bin size, as well as the amount of memory available on the compute nodes.
In order to evaluate the output of an aligner, it may be useful to consider the distribution of the mapping quality over the collection of reads. Thanks to Pig's GROUP operator this is fairly easy.
Sometimes it may be useful to analyze a given set of reads for a bias towards certain bases being called at certain positions inside the read. The following simple script generates for each reference base and each position inside a read the distribution of the number of read bases that were called.
Here is an example output (for a BAM file with 50 reads):
Figure 1 shows the distribution obtained from a sample BAM file.

Analogously to the previous example collecting statistics for the read bases, we can also collect frequencies for base qualities conditioned on the position of the base inside the reads. If these fall off too quickly for later positions, it may indicate some quality issues with the run. The resulting script is actually fairly similar to the previous one with the difference of not grouping over the reference bases.
Here is an example output (for a BAM file with 50 reads):
The script filter_mappability.pig filters reads in a given BAM file based on a given mappability threshold. Both input BAM and mappability file need to reside inside HDFS
Note that since the script relies on distributing the bam file header and the mappability file via Hadoop's distributed cache, it is not possible to run it with Pig in local mode.
SeqPig supports the import and export of non-aligned reads stored in Qseq and Fastq data. Due to Pig's model that all records correspond to tuples, which form bags, reads can be processed in very much the same way independent on for example whether they are stored in Qseq or Fastq.
The following two lines simply convert an input Qseq into Fastq.
The other direction works analogously.
Assuming there were some problems in certain cycles of the sequencer, it may be useful to clip bases from reads. This example removes the last 3 bases and their qualities and stores the data under a new filename. Note that here we rely on the SUBSTRING and LENGTH string functions, which is part of the PiggyBank (see Section 2.3.1).
Some scripts (such as pileup2.pig) require an amount of memory that depends on the choice of command line parameters. To tune the performance of such operations on the Hadoop cluster, consider the following Hadoop-specific parameters in mapred-site.xml.
Additionally, it is possible to pass command line parameters (such as mapper and reducer memory limits). For instance, consider the Pig invocation (see tools/tun_all_pileup2.sh)
By setting appropriate values for MAP_MEMORY, REDUCE_MEMORY, CHILD_MEMORY and for the number of available reduce slots REDUCESLOTS one may be able to improve performance.
For optimal performance and space usage it may be advisable to enable the compression of Hadoop map (and possible reduce) output, as well as temporary data generated by Pig. Compression with Pig can be enabled by setting properties such as
on the Pig command line or the Hadoop configuration. Note that currently not all Hadoop compression codecs are supported by Pig. The details regarding which compression codec to use depend are beyond the scope of this manual.