The RLEVectors Types and Methods

Index

Types

# RLEVectors.RLEVectorConstant.

RLEVectors

RLEVectors is an alternate implementation of the Rle type from Bioconductor's IRanges package by H. Pages, P. Aboyoun and M. Lawrence. RLEVectors represent a vector with repeated values as the ordered set of values and repeat extents. In the field of genomics, data of various types measured across the ~3 billion letters in the human genome can often be represented in a few thousand runs. It is useful to know the bounds of genome regions covered by these runs, the values associated with these runs, and to be able to perform various mathematical operations on these values.

RLEVectors can be created from a single vector or a vector of values and a vector of run ends. In either case runs of values or zero length runs will be compressed out. RLEVectors can be expanded to a full vector with collect.

Aliases

Several aliases are defined for specific types of RLEVector (or collections thereof).

FloatRle              RLEVector{Float64,UInt32}
IntegerRle            RLEVector{Int64,UInt32}
BoolRle               RLEVector{Bool,UInt32}
StringRle             RLEVector{String,UInt32}
RLEVectorList{T1,T2}  Vector{ RLEVector{T1,T2} }

Constructors

RLEVectors can be created by specifying a vector to compress or the runvalues and run ends.

x = RLEVector([1,1,2,2,3,3,4,4,4])
x = RLEVector([4,5,6],[3,6,9])

Describing RLEVector objects

RLEVectors implement the usual descriptive functions for an array as well as some that are specific to the type.

  • length(x) The full length of the vector, uncompressed
  • size(x) Same as length, as for any other vector
  • size(x,dim) Returns (length(x),1) for dim == 1
  • starts(x) The index of the beginning of each run
  • widths(x) The width of each run
  • ends(x) The index of the end of each run
  • values(x) The data value for each run
  • isempty(x) Returns boolean, as for any other vector
  • nrun(x) Returns the number of runs represented in the array
  • eltype(x) Returns the element type of the runs
  • endtype(x) Returns the element type of the run ends

source

# RLEVectors.RLEDataFrameConstant.

An RLEDataFrame extends DataFrame and contains a colection of like-length and like-type RLEVectors. In a way, this creates a type like an RLE matrix. But, we deliberately avoid the complexity of matrix operations, such as factorization. It is expected that most operations will be column-wise. Based on RleDataFrame from Bioconductor's genoset package (also by Peter Haverty).

Constructors

DataFrame(columns::Vector{RLEVector},  names::Vector{Symbol})
DataFrame(kwargs...)

Examples

RLEDataFrame( [RLEVector([1, 1, 2]),  RLEVector([2, 2, 2])], [:a, :b] )

source

Standard Vector API methods

Working with runs

RLEVectors.jl has a collection of functions for working with runs in standard vectors. These are mostly for internal use, but are exported as they may be of general use.

# RLEVectors.numrunsFunction.

numruns(x)

Count the number of runs of repeated values present in a vector.

source

numruns(runvalues, runends)

Given run values and run ends for a RLEVector, determine the number of runs that would be present if it were re-compressed. RLEVectors.jl does this operation after modifying an RLEVector, for example.

source

# RLEVectors.reeFunction.

ree(x)

Run End Encode a vector

Like RLE, but returns (runvalues,runends) rather than (runvalues,runlengths)

source

ree(runvalues, runends)

Tidy up an existing (mostly) Run End Encoded vector, dropping zero length runs and fixing any adjacent identical values. RLEVectors.jl does this operation after modifying an RLEVector, for example.

source

# RLEVectors.inverse_reeFunction.

inverse_ree(rle)

Uncompress the runs and runends of an RLEVector.

Examples

collect(rle)
inverse_ree( runvalues(rle), runends(rle) )

source

Working with run boundaries / ranges

We define some functions for comparing bins defined by our run end values.

# RLEVectors.disjoinFunction.

disjoin(x, y)

Takes runends from two RLEVectors, make one new runends breaking the pair into non-overlapping runs. Basically, this is an optimized sort!(unique([x,y]))). This is useful when comparing two RLEVector objects. The values corresponding to each disjoint run in x and y can then be compared directly.

Returns

An integer vector, of a type that is the promotion of the eltypes of the runends of x and y.

Examples

x = [2, 4, 6]
y = [3, 4, 5, 6]
disjoin(x,y)
5-element Array{Int64,1}:
 2
 3
 4
 5
 6

source

disjoin(rle_x, rle_y)

Examples

x = RLEVector([1,1,2,2,3,3])
y = RLEVector([1,1,1,2,3,4])
disjoin(x,y)
([2,3,4,5,6],[1,2,2,3,3],[1,1,2,3,4])

source

# RLEVectors.disjoin_lengthFunction.

disjoin_length(x, y)

Take two runends vectors (strictly increasing integers) and find the number of unique values for the disjoin operation. This is essentially an optimized length(unique( vcat(x, y) )).

source

split and tapply -like operations

An RLEVector can be used like R's factor type to apply a function over (contiguous) sections of another vector. For example, here we break a vector into 5 groups and take the average of each group. In the second example, we also scale each mean by the RLE run value corresponding to each group.

# RLEVectors.tapplyFunction.

tapply(data_vector, rle, function)
tapply(data_vector, factor_vector, function)

Map a function to blocks of vector, like tapply in R. The first and second argument must be of the same length. For the case of a standard vector as the second argument, this vector need not be sorted.

Examples

factor = repeat( ["a","b","c","d","e"], inner=20 )
rle = RLEVector( factor )
x = collect(1:100)
tapply( x, factor, mean )
tapply( x, rle, mean )

source

Summaries on RLEVectors

Often we want to summarize sections of our RLEVectors. For example, if the RLEVector represent data along a genome, what are the average values associated with each of a set of regions/genes?

# RLEVectors.rangeMeansFunction.

rangeMeans(ranges::Vector{UnitRange{T}}, rle::RLEVector)

Subset an RLEVector by one or more ranges, returning the average value within each range. Really, an
optimized `[ mean(x[ r ]) for r in ranges ]`.

source

Utility Functions

We also define some utility functions for working with repeated values and binary searching in bins/sorted integers like our run end values.

# RLEVectors.repFunction.

rep(x::Union{Any,Vector}; each::Union{Int,Vector{Int}} = ones(Int,length(x)), times::Int = 1)

Construct a vector of repeated values, just like R's rep function. We do not have a length_out argument at this time.

Examples

rep(["Go", "Fight", "Win"], times=2)

# output
6-element Array{String,1}:
 "Go"   
 "Fight"
 "Win"  
 "Go"   
 "Fight"
 "Win"  
rep(["A", "B", "C"], each=3)

# output
9-element Array{String,1}:
 "A"
 "A"
 "A"
 "B"
 "B"
 "B"
 "C"
 "C"
 "C"

source

# Base.Sort.searchsortedfirstMethod.

RLEVectors.jl define new methods for this binary search function. The method for two vectors is like R's findinterval. The index of the position of each x in v is determined searching within the index bounds lo and hi, inclusive. For the returned indices i, v[i] <= x[i] < v[i+1].

This operation is helpful for finding the RLE run corresponding to each of a set of indices or general tasks such as binning values for empirical density functions.

Examples

v = [2, 4, 6, 8, 10]
x = [1, 3, 4, 8, 11]
searchsortedfirst(v, x)
5-element Array{Int64,1}:
 1
 2
 2
 4
 6

source

# Base.Sort.searchsortedfirstMethod.

The four argument version substitutes customized ordering for a hard-coded '<'. This is a is a silly optimization that I hope to get rid of soon.

source