template<typename T, typename CountT>
HistogramApprox struct
A bounded-size, online/streaming approximate histogram.
This is an implementation of the paper "A Streaming Parallel Decision Tree Algorithm" by Ben-Haim and Tom-Tov:
http:/
The "histogram sketch" (as the authors call it) approximates a histogram with a fixed/maximum number of of centroids (value, count) sorted in ascending order. Each time a new value, V, is added, it is either added to an existing centroid (if the value matches exactly) by increasing the count; or, a new centroid (V, 1) is created. After creation of a new centroid, if at the centroid count limit, the best pair of centroids (ones with the minimum distance) are merged, weighted by their values and associated counts. Quantiles and counts are estimated by finding bordering centroids and using them to estimate the area of a trapezoid.
Public types
- using CountType = CountT
-
using CentroidType = detail::
Centroid<T, CountType>
Constructors, destructors, conversion operators
- HistogramApprox() defaulted
- HistogramApprox(CountType in_max_centroids) explicit
- Construct a new approximate histogram with a bound on storage.
Public functions
- auto getMax() const -> T
- Get the maximum value ever inserted in the histogram.
- auto getMin() const -> T
- Get the minimum value ever inserted in the histogram.
- auto getCount() const -> CountType
- Get the total number of values inserted.
- void add(T value, CountType num_times = 1)
- Add a value to the histogram.
- auto quantile(double const p) -> double
- Estimate the p'th quantile defined as the smallest data point, d, such that p*total_count_ <= d.
- auto estimateNumValues(double p) -> double
- Estimate the number of values <= p in the histogram.
- void mergeIn(HistogramApprox<T, CountType> const& in_hist)
- Merge in another histogram.
- auto computeFixedBuckets(int n_buckets) -> std::vector<CountType>
- Build a histogram with fixed buckets, each bucket containing the number of values in that range.
- auto getCentroids() const -> std::vector<CentroidType> const &
- Get the actual centroid array that makes up the histogram.
- void dump()
- Dump the histogram for debugging.
- auto buildContainerString() -> std::string
- Build a readable string that contains all internal data for testing.
-
template<typename SerializerT>void serialize(SerializerT& s)
- Serialize the centroid.
- auto getMaxCentroids() const -> CountType
- Get the max number of centroids.
- auto getNumCentroids() const -> CountType
- Get the number of centroids that are actually used.
Function documentation
template<typename T, typename CountT>
vt:: util:: adt:: HistogramApprox<T, CountT>:: HistogramApprox(CountType in_max_centroids) explicit
Construct a new approximate histogram with a bound on storage.
Parameters | |
---|---|
in_max_centroids in | max number of centroids (bounds space) |
template<typename T, typename CountT>
T vt:: util:: adt:: HistogramApprox<T, CountT>:: getMax() const
Get the maximum value ever inserted in the histogram.
Returns | the max value inserted |
---|
template<typename T, typename CountT>
T vt:: util:: adt:: HistogramApprox<T, CountT>:: getMin() const
Get the minimum value ever inserted in the histogram.
Returns | the min value inserted |
---|
template<typename T, typename CountT>
double vt:: util:: adt:: HistogramApprox<T, CountT>:: quantile(double const p)
Estimate the p'th quantile defined as the smallest data point, d, such that p*total_count_ <= d.
Parameters | |
---|---|
p in | the quantile between [0.0, 1.0] |
Returns | data point, d |
template<typename T, typename CountT>
double vt:: util:: adt:: HistogramApprox<T, CountT>:: estimateNumValues(double p)
Estimate the number of values <= p in the histogram.
Parameters | |
---|---|
p in | the value |
Returns | number of values <= p |
This is implemented based on the math in the referenced paper in algorithm 3 "Sum Procedure".
template<typename T, typename CountT>
void vt:: util:: adt:: HistogramApprox<T, CountT>:: mergeIn(HistogramApprox<T, CountType> const& in_hist)
Merge in another histogram.
Parameters | |
---|---|
in_hist in | the histogram to merge in |
template<typename T, typename CountT>
std::vector<CountType> vt:: util:: adt:: HistogramApprox<T, CountT>:: computeFixedBuckets(int n_buckets)
Build a histogram with fixed buckets, each bucket containing the number of values in that range.
Parameters | |
---|---|
n_buckets in | the number of buckets |
If n_buckets
== 4, the buckets segments will be: { [0, 0.25); [0.25, 0.5); [0.5, 0.75); [0.75, 1.0) }
Then, for each one of these bucket segments, the number of values that fall in that proportion of the range of values.
template<typename T, typename CountT>
std::vector<CentroidType> const & vt:: util:: adt:: HistogramApprox<T, CountT>:: getCentroids() const
Get the actual centroid array that makes up the histogram.
Returns | the centroids |
---|
template<typename T, typename CountT>
template<typename SerializerT>
void vt:: util:: adt:: HistogramApprox<T, CountT>:: serialize(SerializerT& s)
Serialize the centroid.
Parameters | |
---|---|
s in | the serializer |
template<typename T, typename CountT>
CountType vt:: util:: adt:: HistogramApprox<T, CountT>:: getMaxCentroids() const
Get the max number of centroids.
Returns | the max centroids allowed |
---|
template<typename T, typename CountT>
CountType vt:: util:: adt:: HistogramApprox<T, CountT>:: getNumCentroids() const
Get the number of centroids that are actually used.
Returns | the number of centroids being used |
---|