Open Source Release: postgresql-hll

February 4, 2013 by timonk 22 Comments

We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.

What is it and what can I do with it?

The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.

A quick overview of what’s included in the release:

C-based extension that provides the hll data structure and algorithms
Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
Full storage specification in STORAGE.markdown
Full function reference in REFERENCE.markdown
.spec file for rpmbuild
Full test suite

A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.

Why did you build it?

The short answer is to power these two UIs:

: Unique Users Over Time

: Provider Overlap Heatmap

On the left is a simple plot of the number of unique users seen per day and the number of cumulative unique users seen over the days in the month. The SQL behind this is very very straightforward:

SELECT report_date,
       #users as by_day,
       #hll_union_agg(users) as cumulative_by_day OVER (ORDER BY report_date ASC)
FROM daily_uniques
WHERE report_date BETWEEN '2013-01-01' AND '2013-01-31'
ORDER BY report_date ASC;

where daily_uniques is basically:

   Column    | Type | Modifiers 
-------------+------+-----------
 report_date | date | 
 users       | hll  |

Briefly, # is the cardinality operator which is operating on the hll result of the hll_union_agg aggregate function which unions the previous days’ hlls.

On the right is a heatmap of the percentage of an inventory provider’s users that overlap with another inventory provider. Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:

SELECT ip1.id as provider1,
       ip2.id as provider2,
       (#ip1.users + #ip2.users - #hll_union(ip1.users, ip2.users))/#ip1.users as overlap
FROM inventory_provider_stats ip1, inventory_provider_stats ip2
WHERE ip1.id <> ip2.id;

where inventory_provider_stats is basically:

   Column    | Type | Modifiers 
-------------+------+-----------
 id          | date | 
 users       | hll  |

(Some of you may note that the diagonal is labeled “exclusive reach” and is not represented in the query’s result set. That’s because the SQL above is a simplification of what’s happening. There’s some extra work done that replaces that the useless diagonal entries with the percent of the inventory provider’s users that are only seen on that inventory provider.)

We’ve been running this type of code in production for over a year now and are extremely pleased with its performance, ease of use, and expressiveness. Everyone from engineers to researchers to ops people to analysts have been using hlls in their daily reports and queries. We’re seeing product innovation coming from all different directions in the organization as a direct result of having these powerful data structures in an easily accessed and queried format. Dynamic COUNT(DISTINCT ...) queries that would have taken minutes or hours to compute from a fact table or would have been impossible in traditional cube aggregates return in milliseconds. Combine that speed with PostgreSQL’s window and aggregate functions and you have the ability to present interactive, rich distinct-value reporting over huge data sets. I’ll point you to the README and our blog posts on HyperLogLog for more technical details on storage, accuracy, and in-depth use cases.

I believe that this pattern of in-database probabilistic sketching is the future of interactive analytics. As our VP of Engineering Steve Linde said to me, “I can’t emphasize enough how much business value [sketches] deliver day in and day out.”

Our Commitment

Obviously we’re open-sourcing this for both philanthropic and selfish reasons: we’d love for more people to use this technology so that they can tell us all the neat uses for it that we haven’t thought of yet. In exchange for their insight, we’re promising to stay active in terms of stewardship and contribution of our own improvements. Our primary tool for this will be the GitHub Issues/Pull Request mechanism. We’d considered a mailing list but that seems like overkill right now. If people love postgresql-hll, we’ll figure something out as needed.

Please feel free to get in touch with us about the code on GitHub and about the project in general in the comments here. We hope to release additional tools that allow seamless Java application integration with the raw hll data in the future, so stay tuned!

Update

Looks Dimitri Fontaine wrote up a basic “how-to” post on using postgresql-hll here and another on unions here. (Thanks, Dimitri!) He brings up the issue that hll_add_agg() returns NULL when aggregating over an empty set when it should probably return an empty hll. Hopefully we’ll have a fix for that soon. You can follow the progress of the issue here.

Filed Under: General Tagged With: Big Data, Databases, Hash Function, HLL, HyperLogLog, PostgreSQL, Probabilistic Sketching

Comments

Peter Geoghegan says:

February 5, 2013 at 6:27 am

Have you considered the application of HyperLogLog to database statistics? It’s been suggested that it could be a suitable replacement for the existing PostgreSQL ndistinct estimator (which is Haas and Stokes in IBM Research Report RJ 10025 – see src/backend/commands/analyze.c). HyperLogLog certainly isn’t suitable as a drop-in replacement, because it isn’t sampling based, but some people are currently doing crazy things like calculating ndistinct using regular SQL, and then updating the ndistinct value for particular columns using DDL: ALTER TABLE foo ALTER COLUMN foocol set (N_DISTINCT = 5.0). It certainly seems to me like this could be helpful there, at the very least.

Reply
- timonk says:
  
  February 5, 2013 at 12:59 pm
  
  Hi Peter,
  
  We’ve run into many papers that offer query planning/optimization as a primary use case for their distinct value estimator. In fact, the LogLog paper mentions it in the first sentence of the introduction. I’ve been poking around analyze.c and it’s clear that a good bit of work would need to be done to adapt the methodology to something like HyperLogLog or KMV but based on the errors proposed in the IBM paper you referenced, it seems that almost any “modern” DV estimator would perform better. I think there are two real issues at hand: the sampling model and stream deletions.
  
  Cardinality estimation of a column of values that only grows is a “simple” problem and HLL is well-suited to it, assuming you can get around the sampling model and allow the sketch to observe every value (say at insert time). However, if you allow deletions of rows, then you’re into a more complicated multiset or “bag” model. There are sketches which support multiset models, like KMV, which can be used but at a non-trivial cost.
  
  I can’t speak to why the current model was chosen, but if you could move to a “whole stream” model you’d likely have a broader, more accurate set of tools at your disposal.
  
  Best,
  Timon
  
  Reply
Sean Carr says:

February 5, 2013 at 1:28 pm

Very cool. For the overlap heatmap how do you address the increase in relative error when comparing inventory providers of significantly different cardinality? I’ve read your post on HLL Intersections and was thinking of building something very similar to your heatmap but for us many providers differ by 100x or more. Do you have a cardinality ratio threshold at which you stop estimating overlap? Do you only estimate overlap for the top N providers?

Thanks again for another great post!

Reply
- timonk says:
  
  February 5, 2013 at 2:02 pm
  
  Hi Sean,
  
  Great question! The providers are sorted in descending order by size, from left to right, top to bottom. This orientation tends to focus on overlaps among comparable inventory sources which is the main use of the tool. However, if a user moves the viewing window to the top-right or bottom-left corners, they might encounter overlaps which would violate the heuristic rules set out in the post you mention. In this case our approach is to blank out that cell in the table. Since this happens in the marginal areas of the chart, if at all, this limitation doesn’t really affect our users in practice.
  
  Glad you’re enjoying the blog!
  
  -Timon
  
  Reply
Girish Rao (@girishra0) says:

February 13, 2013 at 10:11 am

Just came across your blog. Great stuff and look forward to reading more.
Question: What tools do you use to generate the graphs, diagrams, and equations? Nice aesthetic. This question relates to many of the AK blog posts, not just this one.

Thanks.

Reply
- timonk says:
  
  February 13, 2013 at 11:43 am
  Hi Girish!
  
  Speaking for my posts, my primary tools for:
  - plotting: R and ggplot2
  - diagrams: Google Drawings
  - equations: WordPress’ latex plugin
  Hope this helps!
  
  Best,
  Timon
  Reply
Nam Nguyen says:

March 18, 2013 at 11:48 pm

Hello,
I just try to demo this extension on postgrest for counting distinct visitors to my website. Well, it is very to understand and using with full feature like set, but in comparison to redis or set data structure on python, it is very slow, about 500 insert query per second. Below is my code on python:
cur.execute(‘TRUNCATE TABLE daily_uniques’)
for id in time_id:
cur.execute(‘INSERT INTO daily_uniques(id, visitors) VALUES ({0}, hll_empty())’.format(id))
for i in range(self.total_visit):
visit_id = random.randint(0, self.unique_visit)
rs.sadd(id, visit_id)
self.hloglog.add(str(visit_id))
cur.execute(‘UPDATE daily_uniques SET visitors = hll_add(visitors, hll_hash_text(\'{0}\’))\
WHERE id = {1}’.format(str(visit_id), id))
//I am using hloglog code by myself too!
In your document, i have read your review:
“Empirically, the insertion rate of `EMPTY`, `EXPLICIT`, and `SPARSE` representations is measured in 200k/s – 300k/s range, while the throughput of the `FULL` representation is in the millions of inserts per second on relatively new hardware (’10 Xeon).”
So would you mind to tell me the issue of the low perfomance.

Reply
- timonk says:
  
  March 20, 2013 at 11:45 am
  Hi Nam,
  
  I’m afraid your test isn’t really measuring the throughput of the hll implementation but rather the speed of psycopg2‘s submission of statements to Postgres.
  
  I wrote up a quick test to demonstrate my point: https://gist.github.com/timonk/5207225
  
  From Python, I can do:
  - no-op statements (SELECT 1) at a rate of 8k/sec,
  - BIGINT increments (UPDATE ... SET ct = ct + 1 ...) at a rate of 2500/sec, and
  - hll insertions with text hashing (UPDATE ... SET visitors = hll_add(visitors, hll_hash_text(...)) ...) at a rate of 1400/sec.
  When I stated that the throughput of the FULL representation of hll could perform millions of inserts per second, I was measuring its performance with the aggregate function hll_add_agg.
  
  You can see the last two tests measuring this with and without BIGINT hashing. Respectively, they showed throughput of 5m/sec and 8m/sec, which is in line with my statement.
  
  In general, the hll_add function should be used sparingly since it incurs the overhead of packing and unpacking the data structure for each insert. hll_add_agg only unpacks and packs the structure once, performing all the insertions in between.
  
  Best,
  Timon
  Reply
shark911 says:

March 19, 2013 at 1:51 am

Hello,
I read in document about benmark of hll extension perfomance on postgrest:
“Empirically, the insertion rate of `EMPTY`, `EXPLICIT`, and `SPARSE` representations is measured in
200k/s – 300k/s range, while the throughput of the `FULL` representation is in the millions of inserts
per second on relatively new hardware (’10 Xeon).”
But actually, when i demo a test for counting distinct visitors on my laptop (Corei5, Ubuntu 12.04 64bit), it just about 500 insert query persecond for
approximate 50000 total visitors, not good perfomance.

Reply
- shark911 says:
  
  March 19, 2013 at 2:31 am
  
  Here is my code in python, my apologize for this unconvinent:
  con = psycopg2.connect(host=’localhost’, user=’postgres’, port=’5432′, database=’novanet’)
  cur = con.cursor()
  
  K = 10 ** 3
  
  total_visit = 10 * K
  unique_visit = 10 ** 6
  print ‘Total visitor: ‘ + str(total_visit)
  print ‘Unique visitor: ‘ + str(unique_visit)
  
  cur.execute(“TRUNCATE TABLE daily_uniques”)
  cur.execute(‘INSERT INTO daily_uniques(id, visitors) VALUES (0, hll_empty())’)
  
  tic = time.time()
  print ‘Start:\t’ + str(time.time())
  for id in range(total_visit):
  visit_id = random.randint(0, unique_visit)
  cur.execute(‘UPDATE daily_uniques SET visitors = hll_add(visitors, hll_hash_text(\'{0}\’))\
  WHERE id = 0′.format(str(visit_id)))
  
  toc = time.time()
  print ‘End:\t’ + str(time.time())
  print ‘Time:\t:’ + str(toc – tic)
  tgian = toc – tic
  
  con.close()
  
  And the result is:
  Total visitor: 10000
  Unique visitor: 1000000
  Start: 1363685361.25
  End: 1363685374.26
  Time: :13.015775919
  
  Reply
  - timonk says:
    
    March 20, 2013 at 11:47 am
    
    Hello!
    
    I’m afraid you ran into the same problem as the prior commenter. Please see my response to him above.
    
    Best,
    Timon
shark911 says:

March 20, 2013 at 9:20 pm

Thanks for your answer, using the function hll_add_agg() perform increase a lot of performance. We should handle block by block of visitors instead of each visitor for one time, right. So we can insert a block of visitors_id into table fact and aggregate to daily_uniques. But have you ever think about hll_add(hll, [hash_value_1 , hash_value_2, … hash_value_n]), it like insert multiple value query “insert into table(col1, col2) values (), (), ….” ?

Reply
- timonk says:
  
  March 21, 2013 at 11:19 am
  
  My pleasure. We hadn’t considered the option of adding a version of hll_add that accepted an array. I’ll bring it up for discussion with the team. Thanks for your interest!
  
  Reply
Philip Moore says:

May 9, 2013 at 9:16 pm

Hello, this is AWESOME stuff!

Do you know if anyone has built an Oracle PL/SQL-based implementation of a hyperloglog aggregate function like the one for PostgreSQL?

Thanks!

Reply
- Philip Moore says:
  
  May 9, 2013 at 9:17 pm
  
  Forgot to request follup comment notification via email…
  
  Reply
  - timonk says:
    
    May 10, 2013 at 12:48 pm
    
    Hi Philip, we’re not aware of any HLL libs for Oracle, unfortunately. I know our Postgres implementation may not lend itself to easy translation, but Armon Dadgar has a solid implementation of Google’s HLL++, here. It doesn’t have all of the extensions the Google folks present in the paper, but it’s still quite useful as it contains all of the relevant original HLL features. You may have better luck porting that to Oracle-friendly code than the stuff we have in our repo.
    
    Best,
    Timon
Sahil says:

June 17, 2013 at 10:22 am

This is some awesome stuff, amazing library to be able to use!

Quick question, are you guys thinking of adding functionality for intersection of two sets, similar to your implementation of union, or is there a way to use the current union operations to get the intersection outcome?

Thanks!

Reply
- rgrzywinski says:
  
  June 17, 2013 at 1:34 pm
  
  We have written quite a bit about intersections (for example http://blog.aggregateknowledge.com/2012/12/17/hll-intersections-2/). We don’t plan to include anything specifically in the postgresql-hll library since you can wrap with your own functions. For example, in our case we have some heuristics around when two sets would likely produce spurious intersection results.
  
  Reply
Hans says:

March 26, 2014 at 10:45 am

This looks pretty cool. Are there Windows binaries available?

Reply
- timonk says:
  
  March 26, 2014 at 11:09 am
  
  Thanks! There are not, unfortunately. We don’t really have any experience with Windows environments. Sorry!
  
  Reply