Thoughts Out

Technical blog to improve your Rails application. Follow me onGitHub

How to represent a world tree structure

Apr 25, 2015

In our main product (BRIDGE), a Store Locator, the SEO has to be really good. As such, we have to store a world tree structure in the database.

Use Case

A point of interest is attached to a world_node_id which has to be a leaf.

In a perfect world, every country would share the same structure, but that’s definitely not the case.

Just to take a few, in France, we have 2 sub-levels:

  • the region (Rhône-Alpes)

  • the department (Savoie) :metal:

In USA, there is only one sub-level (New-York). The United Kingdom, well… it’s complicated :smile:

So what if we need to efficiently retrieve the country name for all the points of interest in a single query?

Bonus: What if we need to count the number of points of interest per country or attached node/level?

The MySQL way

Let’s take as an example how it would look like with awesome_nested_set (the gem we were actually using).

Schema

This is how the structure would look like:

                                         Table "public.world_nodes"
      Column       |            Type             |                        Modifiers
-------------------+-----------------------------+----------------------------------------------------------
 id                | integer                     | not null default nextval('world_nodes_id_seq'::regclass)
 parent_id         | integer                     |
 name              | character varying(255)      |
 created_at        | timestamp without time zone |
 updated_at        | timestamp without time zone |
 lft               | integer                     |
 rgt               | integer                     |
 depth             | integer                     |
  • parent_id retrieves the closest parent.
  • depth identifies how deep we are in the tree.
  • rgt and lft are there to query the relationship over the nodes.

rgt and lft are used in a lot of queries like for ancestors here, descendants here or even leaf here.

JOINs vs CTEs

Back to our use case.

As a node is only aware about its closest parent, we need some JOINs and for the case of France it would look like this:

  SELECT countries.name
  FROM point_of_interests
  JOIN world_nodes AS departments ON departments.id = point_of_interests.world_node_id
  JOIN world_nodes AS regions ON regions.id = departments.parent_id
  JOIN world_nodes AS countries ON countries.id = regions.parent_id;

But as the number of levels is different over the countries, we can’t do that. This is where a RECURSIVE CTE would be helpful (This is one of the reason I :heart: PostgreSQL).

  WITH RECURSIVE country_ids (point_of_interest_id, country_id, depth) AS (
    SELECT id, world_node_id, 0
    FROM point_of_interests
    WHERE id IN (?, ?, ?, ?, ?)

    UNION ALL

    SELECT country_ids.point_of_interests_id, wn.parent_id, country_ids.depth + 1
    FROM world_nodes wn
    JOIN country_ids ON wn.id = country_ids.country_id
    WHERE wn.parent_id IS NOT NULL
  )

  SELECT point_of_interests.name, countries.name
  FROM
    point_of_interests,
    world_nodes AS countries,
    (
      SELECT DISTINCT ON (point_of_sale_id) point_of_sale_id, country_id
      FROM country_ids
      ORDER BY point_of_sale_id, depth DESC
    ) pos_countries
  WHERE world_nodes.id = pos_countries.country_id
  AND point_of_interests.id = pos_countries.point_of_interest_id;

The WITH RECURSIVE part will create a temporary table with the point_of_interest_id and the country_id as an output. We will then be able to use that table as any other table into another query.

That works, but at some point, we had to admit it was painful.

Slow is slow

As seen before, awesome_nested_set maintains a rgt and lft for every node.

This has a pretty negative impact when it comes to add more nodes to the tree. This because we have to rebuild all this stuff to make it usable.

The PostgreSQL way

What could be better than a ltree datatype to represent a tree?

The gem ltree_hierarchy (again wrote by the awesome Rob Worley :heart:) has been built to answer that question.

ltree_hierarchy shares the same API with awesome_nested_set.

Let’s see how it works and how it changes the thing up.

Schema

The structure would now look like this:

                                         Table "public.world_nodes"
      Column       |            Type             |                        Modifiers
-------------------+-----------------------------+----------------------------------------------------------
 id                | integer                     | not null default nextval('world_nodes_id_seq'::regclass)
 parent_id         | integer                     |
 name              | character varying(255)      |
 tree_path         | ltree                       |
  • tree_path, as its name suggests, will store the whole path of the current node.
  • parent_id will only store the closest parent.

tree_path changes everything.

JOINs the return

Back to the use case, here is a query that retrieves all the country names for every point of interest:

  SELECT countries.name
  FROM point_of_interests
  JOIN world_nodes AS leafs ON leafs.id = point_of_interest.world_node_id
  JOIN world_nodes AS countries ON subpath(leafs.tree_path, 0, 1) = countries.tree_path;

This is almost equivalent to the query that we would do for France in the previous case, but this time it’s compatible with everything.

Fast is fast

As we don’t have to compute and maintain the rgt/lft columns for every node when we update a single node, it’s really faster to populate the tree.

Benchmarks

I wanted to measure the impact on a script that adds 1200+ nodes in the existing tree.

With awesome_nested_set, the script runs in 8 minutes. With ltree_hierarchy, the script runs in 1 minutes.

For this test, I only switched the library from awesome_nested_set to ltree_hierarchy. 8 times faster… not bad.

Conclusion

Few months after the switch from awesome_nested_set to ltree_hierarchy and so far so good. We were massively caching the tree structure to answer faster to some queries. Every time we were performing an update in the tree we had to rebuild the cache.

The main problem is that we were not able to easily retrieve the informations we needed.

SQL query + Cache together aren’t good friends, are they?

We came back to the basics and we just got rid of the extra cache layer. We were also able to provide some JSON out of the database to reduce the impact of not relying on a caching layer anymore.

Note: the 14th of October, Rob transfered the ownership of his 2 gems (ltree_hierarchy and hstore_translate) to Leadformance. The idea of these 2 gems has been discussed internally when we were exchanging about how to improve the things. Out of these discussions, he developed the idea and open sourced them (faster than me :smile:).

:heart: Rob.