The GenomicVectors Types and Methods

Index

Types

# GenomicVectors.GenomeInfoType.

GenomeInfo Type

A GenomeInfo holds information about a genome including its name, chromosome names, chromosome lengths and chromosome offsets into a concatenated, linear genome (genopos). Indexing returns the genopos end of the indexed chromosome.

Examples

chrinfo = GenomeInfo("hg19",["chr1","chr2","chrX"],Int64[3e5,2e5,1e4])
genome(chrinfo)
chr_names(chrinfo)
chr_lengths(chrinfo)
chr_ends(chrinfo)
chr_offsets(chrinfo)
chrinfo[2] # 5e5

## GenomeInfo Interface
Much of the functionality of the types in the GenomicVectors.jl
package via the GenomeInfo Interface, which provides
access to the genome (name, chromosome lengths, etc.). This Interface requires
the following methods:

- chr_info
which returns a GenomeInfo object, which then provides these methods:

- chr_names
- chr_lengths
- chr_ends
- chr_offsets
- genome
- same_genome

The GenoPos Interface provides access to positional information in the linearized
genome or in chromosome coordinate (e.g. chr4:1000-1020).

source

# GenomicVectors.GenomicPositionsType.

GenomicPositions(chrpos, chromosomes, genomeinfo)
GenomicPositions(genopos, genomeinfo)

Represents single-nucleotide positions in a genome.

This type uses its (immutable) GenomeInfo slot object to describe corresponding genome and positions can be expressed relative to this concatenated, linearized genome or relative to the chromosome containing a given position.

Sorting is by chromosome, as ordered by chrinfo,

By convention, all postions in a GenomicPositions are considered to be on the plus strand.


Examples

    genomeinfo = GenomeInfo("hg19",["chr1","chr2","chrX"],Int64[3e5,2e5,1e4])
    chrs = ["chr2","chr1","chr2","chrX"]
    pos = Int64[3e4,4.2e3,1.9e5,1e4]
    gpos = genopos(pos,chrs,chrinfo)
    x = GenomicPositions(pos,chrs,genomeinfo)
    y = GenomicPositions(gpos,genomeinfo)
    same_genome(x, y)
    sort!(y)
    convert(DataTable, y)

source

# GenomicVectors.GenomicRangesType.

GenomicRanges

GenomicRanges represent closed ranges in a genome. This type uses its (immutable) GenomeInfo slot object to describe corresponding genome and positions can be expressed relative to this concatenated, linearized genome or relative to the chromosome containing a given position.

Examples

chrinfo = GenomeInfo("hg19",["chr1","chr2","chrX"],Int64[3e5,2e5,1e4])
chrs = ["chr1","chr2","chr2","chrX"]
starts = [100, 200, 300, 400]
ends = [120, 240, 350, 455]
gr = GenomicRanges(chrs,starts,ends,chrinfo)

Indexing

Indexing a GenomicRanges with an array produces a new GenomicRanges.

Getting/setting by a scalar gives/takes a Bio.Intervals.Interval. The leftposition and rightposition in this Interval must be in genome location units and correspond to the same chromosome. The seqname must match the genome of the GenomicRanges. Outgoing Intervals will have the index i as their metadata. This makes it possible to obtain the original ordering if Intervals after conversion to, say, an IntervalCollection. Any metadata for an incoming Interval is ignored.

The each function produces an iterator of (start,end) two-tuples in genome location units. This is use for many internal functions, like sorting. This is intentionally similar to RLEVectors.each.

source

Interfaces

AbstractGenomicVector

Accessing position info

# RLEVectors.startsFunction.

RLEVectors

RLEVectors is an alternate implementation of the Rle type from Bioconductor's IRanges package by H. Pages, P. Aboyoun and M. Lawrence. RLEVectors represent a vector with repeated values as the ordered set of values and repeat extents. In the field of genomics, data of various types measured across the ~3 billion letters in the human genome can often be represented in a few thousand runs. It is useful to know the bounds of genome regions covered by these runs, the values associated with these runs, and to be able to perform various mathematical operations on these values.

RLEVectors can be created from a single vector or a vector of values and a vector of run ends. In either case runs of values or zero length runs will be compressed out. RLEVectors can be expanded to a full vector with collect.

Aliases

Several aliases are defined for specific types of RLEVector (or collections thereof).

FloatRle              RLEVector{Float64,UInt32}
IntegerRle            RLEVector{Int64,UInt32}
BoolRle               RLEVector{Bool,UInt32}
StringRle             RLEVector{String,UInt32}
RLEVectorList{T1,T2}  Vector{ RLEVector{T1,T2} }

Constructors

RLEVectors can be created by specifying a vector to compress or the runvalues and run ends.

x = RLEVector([1,1,2,2,3,3,4,4,4])
x = RLEVector([4,5,6],[3,6,9])

Describing RLEVector objects

RLEVectors implement the usual descriptive functions for an array as well as some that are specific to the type.

  • length(x) The full length of the vector, uncompressed
  • size(x) Same as length, as for any other vector
  • size(x,dim) Returns (length(x),1) for dim == 1
  • starts(x) The index of the beginning of each run
  • widths(x) The width of each run
  • ends(x) The index of the end of each run
  • values(x) The data value for each run
  • isempty(x) Returns boolean, as for any other vector
  • nrun(x) Returns the number of runs represented in the array
  • eltype(x) Returns the element type of the runs
  • endtype(x) Returns the element type of the run ends

source

# RLEVectors.widthsFunction.

RLEVectors

RLEVectors is an alternate implementation of the Rle type from Bioconductor's IRanges package by H. Pages, P. Aboyoun and M. Lawrence. RLEVectors represent a vector with repeated values as the ordered set of values and repeat extents. In the field of genomics, data of various types measured across the ~3 billion letters in the human genome can often be represented in a few thousand runs. It is useful to know the bounds of genome regions covered by these runs, the values associated with these runs, and to be able to perform various mathematical operations on these values.

RLEVectors can be created from a single vector or a vector of values and a vector of run ends. In either case runs of values or zero length runs will be compressed out. RLEVectors can be expanded to a full vector with collect.

Aliases

Several aliases are defined for specific types of RLEVector (or collections thereof).

FloatRle              RLEVector{Float64,UInt32}
IntegerRle            RLEVector{Int64,UInt32}
BoolRle               RLEVector{Bool,UInt32}
StringRle             RLEVector{String,UInt32}
RLEVectorList{T1,T2}  Vector{ RLEVector{T1,T2} }

Constructors

RLEVectors can be created by specifying a vector to compress or the runvalues and run ends.

x = RLEVector([1,1,2,2,3,3,4,4,4])
x = RLEVector([4,5,6],[3,6,9])

Describing RLEVector objects

RLEVectors implement the usual descriptive functions for an array as well as some that are specific to the type.

  • length(x) The full length of the vector, uncompressed
  • size(x) Same as length, as for any other vector
  • size(x,dim) Returns (length(x),1) for dim == 1
  • starts(x) The index of the beginning of each run
  • widths(x) The width of each run
  • ends(x) The index of the end of each run
  • values(x) The data value for each run
  • isempty(x) Returns boolean, as for any other vector
  • nrun(x) Returns the number of runs represented in the array
  • eltype(x) Returns the element type of the runs
  • endtype(x) Returns the element type of the run ends

source

# RLEVectors.endsFunction.

RLEVectors

RLEVectors is an alternate implementation of the Rle type from Bioconductor's IRanges package by H. Pages, P. Aboyoun and M. Lawrence. RLEVectors represent a vector with repeated values as the ordered set of values and repeat extents. In the field of genomics, data of various types measured across the ~3 billion letters in the human genome can often be represented in a few thousand runs. It is useful to know the bounds of genome regions covered by these runs, the values associated with these runs, and to be able to perform various mathematical operations on these values.

RLEVectors can be created from a single vector or a vector of values and a vector of run ends. In either case runs of values or zero length runs will be compressed out. RLEVectors can be expanded to a full vector with collect.

Aliases

Several aliases are defined for specific types of RLEVector (or collections thereof).

FloatRle              RLEVector{Float64,UInt32}
IntegerRle            RLEVector{Int64,UInt32}
BoolRle               RLEVector{Bool,UInt32}
StringRle             RLEVector{String,UInt32}
RLEVectorList{T1,T2}  Vector{ RLEVector{T1,T2} }

Constructors

RLEVectors can be created by specifying a vector to compress or the runvalues and run ends.

x = RLEVector([1,1,2,2,3,3,4,4,4])
x = RLEVector([4,5,6],[3,6,9])

Describing RLEVector objects

RLEVectors implement the usual descriptive functions for an array as well as some that are specific to the type.

  • length(x) The full length of the vector, uncompressed
  • size(x) Same as length, as for any other vector
  • size(x,dim) Returns (length(x),1) for dim == 1
  • starts(x) The index of the beginning of each run
  • widths(x) The width of each run
  • ends(x) The index of the end of each run
  • values(x) The data value for each run
  • isempty(x) Returns boolean, as for any other vector
  • nrun(x) Returns the number of runs represented in the array
  • eltype(x) Returns the element type of the runs
  • endtype(x) Returns the element type of the run ends

source

# GenomicVectors.genoposFunction.

Given chromosome and chromosome position information and a description of the chromosomes (a GenoPos object), calculate the corresponding positions in the linear genome.

source

# GenomicVectors.chrposFunction.

Given positions in the linear genome, calculate the position on the relevant chromosome.

source

# GenomicVectors.chromosomesFunction.

Given positions in the linear genome, calculate the position on the relevant chromosome.

source

Modifying positions

slide
slide!

Querying positions

As in Bioconductor, location query operations discriminate between exact and overlapping matches. In addition to exact versus overlapping coordinates, exact matching includes strand, while overlap matching does not. In GenomcVectors.jl, the standard set operations use exact matching and custom overlap functions are defined for AbstractGenomicVector.

indexin
findin
overlap
overlaps # Hmm
hasoverlap
overlapin
overlapindex
nearest