Overview

Background

The development of bedtools was motivated by a need for fast, flexible tools with which to compare large sets of genomic features. Answering fundamental research questions with existing tools was either too slow or required modifications to the way they reported or computed their results. We were aware of the utilities on the UCSC Genome Browser and Galaxy websites, as well as the elegant tools available as part of Jim Kent’s monolithic suite of tools (“Kent source”). However, we found that the web-based tools were too cumbersome when working with large datasets generated by current sequencing technologies. Similarly, we found that the Kent source command line tools often required a local installation of the UCSC Genome Browser. These limitations, combined with the fact that we often wanted an extra option here or there that wasn’t available with existing tools, led us to develop our own from scratch. The initial version of bedtools was publicly released in the spring of 2009. The current version has evolved from our research experiences and those of the scientists using the suite over the last year. The bedtools suite enables one to answer common questions of genomic data in a fast and reliable manner. The fact that almost all the utilities accept input from “stdin” allows one to “stream / pipe” several commands together to facilitate more complicated analyses. Also, the tools allow fine control over how output is reported. The initial version of bedtools supported solely 6-column BED files. However, we have subsequently added support for sequence alignments in BAM format, as well as for features in GFF , “blocked” BED format, and VCF format. The tools are quite fast and typically finish in a matter of a few seconds, even for large datasets. This manual seeks to describe the behavior and available functionality for each bedtool. Usage examples are scattered throughout the text, and formal examples are provided in the last two sections, we hope that this document will give you a sense of the flexibility of the toolkit and the types of analyses that are possible with bedtools. If you have further questions, please join the bedtools discussion group, visit the Usage Examples on the Google Code site (usage, advanced usage), or take a look at the nascent “Usage From the Wild” page.

Summary of available tools.

bedtools support a wide range of operations for interrogating and manipulating genomic features. The table below summarizes the tools available in the suite.

Utility Description
annotate Annotate coverage of features from multiple files.
bamtobed Convert BAM alignments to BED (& other) formats.
bamtofastq Convert BAM records to FASTQ records.
bed12tobed6 Breaks BED12 intervals into discrete BED6 intervals.
bedpetobam Convert BEDPE intervals to BAM records.
bedtobam Convert intervals to BAM records.
closest Find the closest, potentially non-overlapping interval.
cluster Cluster (but don’t merge) overlapping/nearby intervals.
complement Extract intervals _not_ represented by an interval file.
coverage Compute the coverage over defined intervals.
expand Replicate lines based on lists of values in columns.
fisher Calculate Fisher statistic b/w two feature files.
flank Create new intervals from the flanks of existing intervals.
genomecov Compute the coverage over an entire genome.
getfasta Use intervals to extract sequences from a FASTA file.
groupby Group by common cols. & summarize oth. cols. (~ SQL “groupBy”)
igv Create an IGV snapshot batch script.
intersect Find overlapping intervals in various ways.
jaccard Calculate the Jaccard statistic b/w two sets of intervals.
links Create a HTML page of links to UCSC locations.
makewindows Make interval “windows” across a genome.
map Apply a function to a column for each overlapping interval.
maskfasta Use intervals to mask sequences from a FASTA file.
merge Combine overlapping/nearby intervals into a single interval.
multicov Counts coverage from multiple BAMs at specific intervals.
multiinter Identifies common intervals among multiple interval files.
nuc Profile the nucleotide content of intervals in a FASTA file.
overlap Computes the amount of overlap from two intervals.
pairtobed Find pairs that overlap intervals in various ways.
pairtopair Find pairs that overlap other pairs in various ways.
random Generate random intervals in a genome.
reldist Calculate the distribution of relative distances b/w two files.
tools/sample Sample random records from file using reservoir sampling.
shift Adjust the position of intervals.
shuffle Randomly redistribute intervals in a genome.
slop Adjust the size of intervals.
sort Order the intervals in a file.
tools/spacing Sample random records from file using reservoir sampling.
tools/split Split a file into multiple files with equal records or base pairs.
subtract Remove intervals based on overlaps b/w two files.
tag Tag BAM alignments based on overlaps with interval files.
unionbedg Combines coverage intervals from multiple BEDGRAPH files.
window Find overlapping intervals within a window around an interval.

Fundamental concepts.

What are genome features and how are they represented?

Throughout this manual, we will discuss how to use bedtools to manipulate, compare and ask questions of genome “features”. Genome features can be functional elements (e.g., genes), genetic polymorphisms (e.g. SNPs, INDELs, or structural variants), or other annotations that have been discovered or curated by genome sequencing groups or genome browser groups. In addition, genome features can be custom annotations that an individual lab or researcher defines (e.g., my novel gene or variant).

The basic characteristics of a genome feature are the chromosome or scaffold on which the feature “resides”, the base pair on which the feature starts (i.e. the “start”), the base pair on which feature ends (i.e. the “end”), the strand on which the feature exists (i.e. “+” or “-“), and the name of the feature if one is applicable.

The two most widely used formats for representing genome features are the BED (Browser Extensible Data) and GFF (General Feature Format) formats. bedtools was originally written to work exclusively with genome features described using the BED format, but it has been recently extended to seamlessly work with BED, GFF and VCF files.

Existing annotations for the genomes of many species can be easily downloaded in BED and GFF format from the UCSC Genome Browser’s “Table Browser” (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) or from the “Bulk Downloads” page (http://hgdownload.cse.ucsc.edu/downloads.html). In addition, the Ensemble Genome Browser contains annotations in GFF/GTF format for many species (http://www.ensembl.org/info/data/ftp/index.html)

Overlapping / intersecting features.

Two genome features (henceforth referred to as “features”) are said to overlap or intersect if they share at least one base in common. In the figure below, Feature A intersects/overlaps Feature B, but it does not intersect/overlap Feature C.

TODO: place figure here

Comparing features in file “A” and file “B”.

The previous section briefly introduced a fundamental naming convention used in bedtools. Specifically, all bedtools that compare features contained in two distinct files refer to one file as feature set “A” and the other file as feature set “B”. This is mainly in the interest of brevity, but it also has its roots in set theory. As an example, if one wanted to look for SNPs (file A) that overlap with exons (file B), one would use bedtools intersect in the following manner:

bedtools intersect –a snps.bed –b exons.bed

There are two exceptions to this rule: 1) When the “A” file is in BAM format, the “-abam” option must be used. For example:

bedtools intersect –abam alignedReads.bam –b exons.bed

And 2) For tools where only one input feature file is needed, the “-i” option is used. For example:

bedtools merge –i repeats.bed

BED starts are zero-based and BED ends are one-based.

bedtools users are sometimes confused by the way the start and end of BED features are represented. Specifically, bedtools uses the UCSC Genome Browser’s internal database convention of making the start position 0-based and the end position 1-based: (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1) In other words, bedtools interprets the “start” column as being 1 basepair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base:

chr1   0        1    first_base

Why, you might ask? The advantage of storing features this way is that when computing the length of a feature, one must simply subtract the start from the end. Were the start position 1-based, the calculation would be (slightly) more complex (i.e. (end-start)+1). Thus, storing BED features this way reduces the computational burden.

GFF starts and ends are one-based.

In contrast, the GFF format uses 1-based coordinates for both the start and the end positions. bedtools is aware of this and adjusts the positions accordingly. In other words, you don’t need to subtract 1 from the start positions of your GFF features for them to work correctly with bedtools.

VCF coordinates are one-based.

The VCF format uses 1-based coordinates. As in GFF, bedtools is aware of this and adjusts the positions accordingly. In other words, you don’t need to subtract 1 from the start positions of your VCF features for them to work correctly with bedtools.

File B is loaded into memory (most of the time).

Whenever a bedtool compares two files of features, the “B” file is loaded into memory. By contrast, the “A” file is processed line by line and compared with the features from B. Therefore to minimize memory usage, one should set the smaller of the two files as the B file. One salient example is the comparison of aligned sequence reads from a current DNA sequencer to gene annotations. In this case, the aligned sequence file (in BED format) may have tens of millions of features (the sequence alignments), while the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.

Feature files must be tab-delimited.

This is rather self-explanatory. While it is possible to allow BED files to be space-delimited, we have decided to require tab delimiters for three reasons:

  1. By requiring one delimiter type, the processing time is minimized.
  2. Tab-delimited files are more amenable to other UNIX utilities.
  3. GFF files can contain spaces within attribute columns. This complicates the use of space-delimited files as spaces must therefore be treated specially depending on the context.

All bedtools allow features to be “piped” via standard input.

In an effort to allow one to combine multiple bedtools and other UNIX utilities into more complicated “pipelines”, all bedtools allow features to be passed to them via standard input. Only one feature file may be passed to a bedtool via standard input. The convention used by all bedtools is to set either file A or file B to “stdin” or “-”. For example:

cat snps.bed | bedtools intersect –a stdin –b exons.bed
cat snps.bed | bedtools intersect –a - –b exons.bed

In addition, all bedtools that simply require one main input file (the -i file) will assume that input is coming from standard input if the -i parameter is ignored. For example, the following are equivalent:

cat snps.bed | bedtools sort –i stdin
cat snps.bed | bedtools sort

Most bedtools write their results to standard output.

To allow one to combine multiple bedtools and other UNIX utilities into more complicated “pipelines”, most bedtools report their output to standard output, rather than to a named file. If one wants to write the output to a named file, one can use the UNIX “file redirection” symbol “>” to do so. Writing to standard output (the default):

bedtools intersect –a snps.bed –b exons.bed
chr1 100100 100101 rs233454
chr1 200100 200101 rs446788
chr1 300100 300101 rs645678

Writing to a file:

bedtools intersect –a snps.bed –b exons.bed > snps.in.exons.bed

cat snps.in.exons.bed
chr1 100100 100101 rs233454
chr1 200100 200101 rs446788
chr1 300100 300101 rs645678

What is a “genome” file?

Some of the bedtools (e.g., genomecov, complement, slop) need to know the size of the chromosomes for the organism for which your BED files are based. When using the UCSC Genome Browser, Ensemble, or Galaxy, you typically indicate which species / genome build you are working. The way you do this for bedtools is to create a “genome” file, which simply lists the names of the chromosomes (or scaffolds, etc.) and their size (in basepairs). Genome files must be tab-delimited and are structured as follows (this is an example for C. elegans):

chrI 15072421
chrII 15279323
...
chrX 17718854
chrM 13794

bedtools includes predefined genome files for human and mouse in the /genomes directory included in the bedtools distribution. Additionally, the “chromInfo” files/tables available from the UCSC Genome Browser website are acceptable. For example, one can download the hg19 chromInfo file here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz

Paired-end BED files (BEDPE files).

We have defined a new file format (BEDPE) to concisely describe disjoint genome features, such as structural variations or paired-end sequence alignments. We chose to define a new format because the existing BED block format (i.e. BED12) does not allow inter-chromosomal feature definitions. Moreover, the BED12 format feels rather bloated when one want to describe events with only two blocks.

Use “-h” for help with any bedtool.

Rather straightforward. If you use the “-h” option with any bedtool, a full menu of example usage and available options (when applicable) will be reported.

BED features must not contain negative positions.

bedtools will typically reject BED features that contain negative positions. In special cases, however, BEDPE positions may be set to -1 to indicate that one or more ends of a BEDPE feature is unaligned.

The start position must be <= to the end position.

bedtools will reject BED features where the start position is greater than the end position.

Headers are allowed in GFF and BED files

bedtools will ignore headers at the beginning of BED and GFF files. Valid header lines begin with a “#” symbol, the work “track”, or the word “browser”. For example, the following examples are valid headers for BED or GFF files:

track name=aligned_read description="Illumina aligned reads”
chr5 100000 500000 read1 50 +
chr5 2380000 2386000 read2 60 -

#This is a fascinating dataset
chr5 100000 500000 read1 50 +
chr5 2380000 2386000 read2 60 -

browser position chr22:1-20000
chr5 100000 500000 read1 50 +
chr5 2380000 2386000 read2 60 -

GZIP support: BED, GFF, VCF, and BEDPE file can be “gzipped”

bedtools will process gzipped BED, GFF, VCF and BEDPE files in the same manner as uncompressed files. Gzipped files are auto-detected thanks to a helpful contribution from Gordon Assaf.

Support for “split” or “spliced” BAM alignments and “blocked” BED features

As of Version 2.8.0, five bedtools (intersect, coverage, genomecov, bamToBed, and bed12ToBed6) can properly handle “split”/”spliced” BAM alignments (i.e., having an “N” CIGAR operation) and/or “blocked” BED (aka BED12) features.

intersect, coverage, and genomecov will optionally handle “split” BAM and/or “blocked” BED by using the -split option. This will cause intersects or coverage to be computed only for the alignment or feature blocks. In contrast, without this option, the intersects/coverage would be computed for the entire “span” of the alignment or feature, regardless of the size of the gaps between each alignment or feature block. For example, imagine you have a RNA-seq read that originates from the junction of two exons that were spliced together in a mRNA. In the genome, these two exons happen to be 30Kb apart. Thus, when the read is aligned to the reference genome, one portion of the read will align to the first exon, while another portion of the read will align ca. 30Kb downstream to the other exon. The corresponding CIGAR string would be something like (assuming a 76bp read): 30M*3000N*46M. In the genome, this alignment “spans” 3076 bp, yet the nucleotides in the sequencing read only align “cover” 76bp. Without the -split option, coverage or overlaps would be reported for the entire 3076bp span of the alignment. However, with the -split option, coverage or overlaps will only be reported for the portions of the read that overlap the exons (i.e. 30bp on one exon, and 46bp on the other).

Using the -split option with bamToBed causes “spliced/split” alignments to be reported in BED12 format. Using the -split option with bed12tobed6 causes “blocked” BED12 features to be reported in BED6 format.

Writing uncompressed BAM output.

When working with a large BAM file using a complex set of tools in a pipe/stream, it is advantageous to pass uncompressed BAM output to each downstream program. This minimizes the amount of time spent compressing and decompressing output from one program to the next. All bedtools that create BAM output (e.g. intersect, window) will now optionally create uncompressed BAM output using the -ubam option.

Implementation and algorithmic notes.

bedtools was implemented in C++ and makes extensive use of data structures and fundamental algorithms from the Standard Template Library (STL). Many of the core algorithms are based upon the genome binning algorithm described in the original UCSC Genome Browser paper (Kent et al, 2002). The tools have been designed to inherit core data structures from central source files, thus allowing rapid tool development and deployment of improvements and corrections. Support for BAM files is made possible through Derek Barnett’s elegant C++ API called BamTools.

comments powered by Disqus