The RLEVectors Types and Methods
Index
RLEVectors.RLEDataFrame
RLEVectors.RLEVector
Base.Sort.searchsortedfirst
Base.Sort.searchsortedfirst
RLEVectors.disjoin
RLEVectors.disjoin_length
RLEVectors.inverse_ree
RLEVectors.numruns
RLEVectors.rangeMeans
RLEVectors.ree
RLEVectors.rep
RLEVectors.tapply
Types
#
RLEVectors.RLEVector
— Constant.
RLEVectors
RLEVectors
is an alternate implementation of the Rle type from Bioconductor's IRanges package by H. Pages, P. Aboyoun and M. Lawrence. RLEVectors represent a vector with repeated values as the ordered set of values and repeat extents. In the field of genomics, data of various types measured across the ~3 billion letters in the human genome can often be represented in a few thousand runs. It is useful to know the bounds of genome regions covered by these runs, the values associated with these runs, and to be able to perform various mathematical operations on these values.
RLEVectors
can be created from a single vector or a vector of values and a vector of run ends. In either case runs of values or zero length runs will be compressed out. RLEVectors can be expanded to a full vector with collect
.
Aliases
Several aliases are defined for specific types of RLEVector (or collections thereof).
FloatRle RLEVector{Float64,UInt32}
IntegerRle RLEVector{Int64,UInt32}
BoolRle RLEVector{Bool,UInt32}
StringRle RLEVector{String,UInt32}
RLEVectorList{T1,T2} Vector{ RLEVector{T1,T2} }
Constructors
RLEVector
s can be created by specifying a vector to compress or the runvalues and run ends.
x = RLEVector([1,1,2,2,3,3,4,4,4])
x = RLEVector([4,5,6],[3,6,9])
Describing RLEVector
objects
RLEVector
s implement the usual descriptive functions for an array as well as some that are specific to the type.
length(x)
The full length of the vector, uncompressedsize(x)
Same aslength
, as for any other vectorsize(x,dim)
Returns(length(x),1) for dim == 1
starts(x)
The index of the beginning of each runwidths(x)
The width of each runends(x)
The index of the end of each runvalues(x)
The data value for each runisempty(x)
Returns boolean, as for any other vectornrun(x)
Returns the number of runs represented in the arrayeltype(x)
Returns the element type of the runsendtype(x)
Returns the element type of the run ends
#
RLEVectors.RLEDataFrame
— Constant.
An RLEDataFrame extends DataFrame and contains a colection of like-length and like-type RLEVectors. In a way, this creates a type like an RLE matrix. But, we deliberately avoid the complexity of matrix operations, such as factorization. It is expected that most operations will be column-wise. Based on RleDataFrame from Bioconductor's genoset
package (also by Peter Haverty).
Constructors
DataFrame(columns::Vector{RLEVector}, names::Vector{Symbol})
DataFrame(kwargs...)
Examples
RLEDataFrame( [RLEVector([1, 1, 2]), RLEVector([2, 2, 2])], [:a, :b] )
Standard Vector API methods
Working with runs
RLEVectors.jl
has a collection of functions for working with runs in standard vectors. These are mostly for internal use, but are exported as they may be of general use.
#
RLEVectors.numruns
— Function.
numruns(x)
Count the number of runs of repeated values present in a vector.
numruns(runvalues, runends)
Given run values and run ends for a RLEVector, determine the number of runs that would be present if it were re-compressed. RLEVectors.jl does this operation after modifying an RLEVector, for example.
#
RLEVectors.ree
— Function.
ree(x)
Run End Encode a vector
Like RLE, but returns (runvalues,runends) rather than (runvalues,runlengths)
ree(runvalues, runends)
Tidy up an existing (mostly) Run End Encoded vector, dropping zero length runs and fixing any adjacent identical values. RLEVectors.jl
does this operation after modifying an RLEVector, for example.
#
RLEVectors.inverse_ree
— Function.
inverse_ree(rle)
Uncompress the runs and runends of an RLEVector.
Examples
collect(rle)
inverse_ree( runvalues(rle), runends(rle) )
Working with run boundaries / ranges
We define some functions for comparing bins defined by our run end values.
#
RLEVectors.disjoin
— Function.
disjoin(x, y)
Takes runends from two RLEVectors, make one new runends breaking the pair into non-overlapping runs. Basically, this is an optimized sort!(unique([x,y])))
. This is useful when comparing two RLEVector objects. The values corresponding to each disjoint run in x
and y
can then be compared directly.
Returns
An integer vector, of a type that is the promotion of the eltypes of the runends of x and y.
Examples
x = [2, 4, 6]
y = [3, 4, 5, 6]
disjoin(x,y)
5-element Array{Int64,1}:
2
3
4
5
6
disjoin(rle_x, rle_y)
Examples
x = RLEVector([1,1,2,2,3,3])
y = RLEVector([1,1,1,2,3,4])
disjoin(x,y)
([2,3,4,5,6],[1,2,2,3,3],[1,1,2,3,4])
#
RLEVectors.disjoin_length
— Function.
disjoin_length(x, y)
Take two runends vectors (strictly increasing integers) and find the number of unique values for the disjoin operation. This is essentially an optimized length(unique( vcat(x, y) ))
.
split and tapply -like operations
An RLEVector can be used like R's factor type to apply a function over (contiguous) sections of another vector. For example, here we break a vector into 5 groups and take the average of each group. In the second example, we also scale each mean by the RLE run value corresponding to each group.
#
RLEVectors.tapply
— Function.
tapply(data_vector, rle, function)
tapply(data_vector, factor_vector, function)
Map a function to blocks of vector, like tapply
in R. The first and second argument must be of the same length. For the case of a standard vector as the second argument, this vector need not be sorted.
Examples
factor = repeat( ["a","b","c","d","e"], inner=20 )
rle = RLEVector( factor )
x = collect(1:100)
tapply( x, factor, mean )
tapply( x, rle, mean )
Summaries on RLEVectors
Often we want to summarize sections of our RLEVectors. For example, if the RLEVector represent data along a genome, what are the average values associated with each of a set of regions/genes?
#
RLEVectors.rangeMeans
— Function.
rangeMeans(ranges::Vector{UnitRange{T}}, rle::RLEVector)
Subset an RLEVector by one or more ranges, returning the average value within each range. Really, an
optimized `[ mean(x[ r ]) for r in ranges ]`.
Utility Functions
We also define some utility functions for working with repeated values and binary searching in bins/sorted integers like our run end values.
#
RLEVectors.rep
— Function.
rep(x::Union{Any,Vector}; each::Union{Int,Vector{Int}} = ones(Int,length(x)), times::Int = 1)
Construct a vector of repeated values, just like R's rep
function. We do not have a length_out
argument at this time.
Examples
rep(["Go", "Fight", "Win"], times=2)
# output
6-element Array{String,1}:
"Go"
"Fight"
"Win"
"Go"
"Fight"
"Win"
rep(["A", "B", "C"], each=3)
# output
9-element Array{String,1}:
"A"
"A"
"A"
"B"
"B"
"B"
"C"
"C"
"C"
#
Base.Sort.searchsortedfirst
— Method.
RLEVectors.jl define new methods for this binary search function. The method for two vectors is like R's findinterval. The index of the position of each x
in v
is determined searching within the index bounds lo
and hi
, inclusive. For the returned indices i
, v[i] <= x[i] < v[i+1]
.
This operation is helpful for finding the RLE run corresponding to each of a set of indices or general tasks such as binning values for empirical density functions.
Examples
v = [2, 4, 6, 8, 10]
x = [1, 3, 4, 8, 11]
searchsortedfirst(v, x)
5-element Array{Int64,1}:
1
2
2
4
6
#
Base.Sort.searchsortedfirst
— Method.
The four argument version substitutes customized ordering for a hard-coded '<'. This is a is a silly optimization that I hope to get rid of soon.