template<typename T, typename CountT>
vt::util::adt::HistogramApprox struct

A bounded-size, online/streaming approximate histogram.

This is an implementation of the paper "A Streaming Parallel Decision Tree Algorithm" by Ben-Haim and Tom-Tov:

http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf

The "histogram sketch" (as the authors call it) approximates a histogram with a fixed/maximum number of of centroids (value, count) sorted in ascending order. Each time a new value, V, is added, it is either added to an existing centroid (if the value matches exactly) by increasing the count; or, a new centroid (V, 1) is created. After creation of a new centroid, if at the centroid count limit, the best pair of centroids (ones with the minimum distance) are merged, weighted by their values and associated counts. Quantiles and counts are estimated by finding bordering centroids and using them to estimate the area of a trapezoid.

Public types

using CountType = CountT
using CentroidType = detail::Centroid<T, CountType>

Constructors, destructors, conversion operators

HistogramApprox() defaulted
HistogramApprox(CountType in_max_centroids) explicit
Construct a new approximate histogram with a bound on storage.

Public functions

auto getMax() const -> T
Get the maximum value ever inserted in the histogram.
auto getMin() const -> T
Get the minimum value ever inserted in the histogram.
auto getCount() const -> CountType
Get the total number of values inserted.
void add(T value, CountType num_times = 1)
Add a value to the histogram.
auto quantile(double const p) -> double
Estimate the p'th quantile defined as the smallest data point, d, such that p*total_count_ <= d.
auto estimateNumValues(double p) -> double
Estimate the number of values <= p in the histogram.
void mergeIn(HistogramApprox<T, CountType> const& in_hist)
Merge in another histogram.
auto computeFixedBuckets(int n_buckets) -> std::vector<CountType>
Build a histogram with fixed buckets, each bucket containing the number of values in that range.
auto getCentroids() const -> std::vector<CentroidType> const &
Get the actual centroid array that makes up the histogram.
void dump()
Dump the histogram for debugging.
auto buildContainerString() -> std::string
Build a readable string that contains all internal data for testing.
template<typename SerializerT>
void serialize(SerializerT& s)
Serialize the centroid.
auto getMaxCentroids() const -> CountType
Get the max number of centroids.
auto getNumCentroids() const -> CountType
Get the number of centroids that are actually used.

Function documentation

template<typename T, typename CountT>
vt::util::adt::HistogramApprox<T, CountT>::HistogramApprox(CountType in_max_centroids) explicit

Construct a new approximate histogram with a bound on storage.

Parameters
in_max_centroids in max number of centroids (bounds space)

template<typename T, typename CountT>
T vt::util::adt::HistogramApprox<T, CountT>::getMax() const

Get the maximum value ever inserted in the histogram.

Returns the max value inserted

template<typename T, typename CountT>
T vt::util::adt::HistogramApprox<T, CountT>::getMin() const

Get the minimum value ever inserted in the histogram.

Returns the min value inserted

template<typename T, typename CountT>
CountType vt::util::adt::HistogramApprox<T, CountT>::getCount() const

Get the total number of values inserted.

Returns the number of values inserted

template<typename T, typename CountT>
void vt::util::adt::HistogramApprox<T, CountT>::add(T value, CountType num_times = 1)

Add a value to the histogram.

Parameters
value in the value
num_times in how many times to add it (default 1)

template<typename T, typename CountT>
double vt::util::adt::HistogramApprox<T, CountT>::quantile(double const p)

Estimate the p'th quantile defined as the smallest data point, d, such that p*total_count_ <= d.

Parameters
in the quantile between [0.0, 1.0]
Returns data point, d

template<typename T, typename CountT>
double vt::util::adt::HistogramApprox<T, CountT>::estimateNumValues(double p)

Estimate the number of values <= p in the histogram.

Parameters
in the value
Returns number of values <= p

This is implemented based on the math in the referenced paper in algorithm 3 "Sum Procedure".

template<typename T, typename CountT>
void vt::util::adt::HistogramApprox<T, CountT>::mergeIn(HistogramApprox<T, CountType> const& in_hist)

Merge in another histogram.

Parameters
in_hist in the histogram to merge in

template<typename T, typename CountT>
std::vector<CountType> vt::util::adt::HistogramApprox<T, CountT>::computeFixedBuckets(int n_buckets)

Build a histogram with fixed buckets, each bucket containing the number of values in that range.

Parameters
n_buckets in the number of buckets

If n_buckets == 4, the buckets segments will be: { [0, 0.25); [0.25, 0.5); [0.5, 0.75); [0.75, 1.0) }

Then, for each one of these bucket segments, the number of values that fall in that proportion of the range of values.

template<typename T, typename CountT>
std::vector<CentroidType> const & vt::util::adt::HistogramApprox<T, CountT>::getCentroids() const

Get the actual centroid array that makes up the histogram.

Returns the centroids

template<typename T, typename CountT> template<typename SerializerT>
void vt::util::adt::HistogramApprox<T, CountT>::serialize(SerializerT& s)

Serialize the centroid.

Parameters
in the serializer

template<typename T, typename CountT>
CountType vt::util::adt::HistogramApprox<T, CountT>::getMaxCentroids() const

Get the max number of centroids.

Returns the max centroids allowed

template<typename T, typename CountT>
CountType vt::util::adt::HistogramApprox<T, CountT>::getNumCentroids() const

Get the number of centroids that are actually used.

Returns the number of centroids being used