Its That Time of the Year Again to Work on Your Snowplow
Back in 2014 we published a series of blog post on using Snowplow event data in the graph database Neo4j. Three years on, they're withal among our most popular blog posts. (See below for links to the original posts.)
A lot has inverse since and so. Neo4j has strengthened its position as a leading graph database solution. Its query language, Cypher, has grown with the platform. It has changed to the point where some of the queries from the original posts no longer work verbatim. Nosotros've come up up with a more than straightforward model to fit Snowplow data in a graph environs.
At a contempo hackathon in Vienna we had a chance to dive back into the topic. We discovered our erstwhile blog posts could do with an update. This is information technology.
With Snowplow, we want to empower our users to get the nigh out of their data. Where your data lives has big implications for the types of query and analyses you can run on information technology.
Most of the time, nosotros're analyzing data with SQL. This is great for a whole class of OLAP style analytics. Information technology enables u.s. to piece and die unlike combinations of dimensions and metrics. We can aggregate to the level of the user, session, page or other entity that we care about.
However, when nosotros're doing event analytics, we frequently want to understand the sequence of events. We want to know, for case:
- How long does it take users to get from betoken A to point B on our website or mobile app?
- What are the different paths that people take to get to point C?
- What are the different paths that people have from point D?
This type of path analysis is non well supported by traditional SQL databases. Information technology results in many table scans. The window functions we use to first order events and then sequence them are expensive.
Graph databases represent a new approach to storing and querying data. We've started experimenting with using them to try and answer some of the questions in a higher place. In this blog mail, nosotros'll encompass the basics of graph databases. We'll share some of the experimentation we've washed with Neo4j. And then we'll show how to load upshot-level Snowplow data into Neo4j. Finally, we'll demonstrate how to perform the path assay mentioned above.
Modeling issue data in graph databases
Social networks employ graph databases to model information where relationships are important. (Facebook has a search tool chosen 'Graph Search'.) A graph database consists of:
- nodes, which we can consider to exist objects,
- and directed edges or relationships, which connect nodes.
So on Facebook, both you and your friends are nodes; photos are nodes as well. Various relationships connect all these objects to each other. Adding a photo is a relationship betwixt a user node and a photograph node. And then is liking a photo. And friendship is a relationship between ii user nodes.
To discover out who liked a particular photo, we outset need to identify the node representing that photo. So we follow its incoming [:LIKED]
relationships and see where they end up. Past contrast, in Redshift we would need to do a full table browse of all photos to identify the one that interests us. And so nosotros would have to scan another table of likes to identify the users linked to the photograph. Finally, we would accept to browse the full users tabular array to identify details almost the user that liked the photo.
For our needs, we want to describe a user's journey through a website. We need to decide how to model that journey as a graph. To get-go off, nosotros want to model page_view
events. For this we will need some User
nodes (domain_userid
) and some Page
nodes (page_urlpath
, let'due south say). We can add together a simple (User)-[:VIEWS]->(Page)
relationship between them. Only that volition non let usa track the club of the events. Pages exercise non 'happen' ane afterward another. They all exist at the same fourth dimension. For path analysis, we need the actual page_view
issue to be the Page
node.
In the example below in that location are two nodes: a User
node and a Page
node. The Page
node has been passed the boosted property {url: "Page ane"}
, which allows us to encounter on which page the issue happened.
(This is a departure from how we did things terminal time around. Then, the page view
effect was a node that sat in-betwixt the User
node and the Page
node. Using the concept of outcome grammars, we set up a generalizable relationship: (User)-[:VERB]->[View]-[:OBJECT]->(Page)
.)
We tin can link the Folio
nodes together to put these events in order. The diagram below shows a user who has visited Page 1, then Page 2, then Page i over again, and finally Page 3. And we are not limited to page views. Nosotros tin model folio pings, link clicks, add to basket events and so on by adding actress nodes and relationships.
We volition be using Neo4j for our experiments because it has two features in detail that stand out:
- It has a browser-based interface which automatically creates visualizations of the graph similar the ones in this postal service. This is a existent help in building our initial graphs and developing queries against them.
- We get to utilise Cypher, Neo4j'due south expressive query language, to ask questions of our data. Past style of example, to create the relationships betwixt the user and the commencement two folio views above, we first
CREATE
the nodes we're interested in and then describe the relationships between them:
Loading Snowplow data into Neo4j
Allow'southward now walk through taking Snowplow data from Redshift and loading it into Neo4j.
Nosotros'll first past figuring out how to transform and fetch the data out of our Snowplow Redshift database. Then, nosotros'll wait at how to import information technology into Neo4j.
Our graph data model (visualized above) contains two types of nodes:
-
User
nodes -
Page
nodes (for at present our involvement is in page views; but that can be any event)
and two types of relationships:
-
VIEWS
relationships that link users to the events that they accept performed -
NEXT
relationships that order the events that have occurred for a specific the states
er. They run from each outcome to the adjacent outcome for that user.
Nodes and relationships tin can as well have properties. For our experiment, the relationships don't need any properties. We'll assign the User
nodes their unique domain_userid
. The Folio
nodes will need a flake more than detail:
-
event_id
β to identify each effect -
page_urlhost
andpage_urlpath
combined into a single property β to identify the folio -
derived_tstamp
β so we know when it occurred -
domain_sessionidx
β to distinguish sequences of events for a specific user betwixt sessions -
refr_urlhost
andrefr_urlpath
combined into a single property β to infer how users motility from one page to the next -
domain_userid
β as a shortcut so we can identify which user an event belongs to, without walking the graph
Getting the data out of Redshift
The post-obit SQL query fetches i year'southward worth of data for our Page
nodes:
{% highlight sql linenos %} WITH step1 AS ( β Select the data for the properties nosotros want to populate
SELECT event_id AS event_id, CONCAT(page_urlhost, page_urlpath) AS page_url, derived_tstamp Equally derived_tstamp, domain_sessionidx As domain_sessionidx, CONCAT(refr_urlhost, refr_urlpath) Every bit refr_url, domain_userid Every bit domain_userid
FROM diminutive.events
WHERE derived_tstamp :: Date Between '2016-07-02' AND '2017-07-02' AND page_urlpath IS NOT Nothing AND domain_userid IS Not Goose egg AND event_name = 'page_view'
Grouping By one, 2, 3, 4, five, 6 ),
β Deduplicate the events
step2 AS ( SELECT *, ROW_NUMBER() OVER (Sectionalisation Past event_id ORDER BY derived_tstamp) As north
FROM step1 )
β Only select the fields from step 1
SELECT event_id, page_url, derived_tstamp, domain_sessionidx, refr_url, domain_userid
FROM step2
WHERE due north = 1; {% endhighlight %}
This query gives u.s.a. 1 line per page_view
event. Nosotros also have all the domain_userid
that we need for the User
nodes. Even so, in many cases there are many events for each domain_userid
. To speed upward the process, it's better to requite Neo4j a deduplicated listing of users so it does not have to check for duplicates when adding the nodes.
To fetch a unique list of users, run this query:
{% highlight sql linenos %} SELECT domain_userid
FROM atomic.events
WHERE derived_tstamp :: Appointment Betwixt '2016-07-02' AND '2017-07-02' AND page_urlpath IS Not Zip AND domain_userid IS Non Cipher AND issue = 'page_view'
Group BY i; {% endhighlight %}
You lot could as well include any actress information you want to capture about users, eg location or email. The fastest way to do this is while you build the database, just you can even so add properties to nodes later.
Nosotros'll likewise demand to pull some data to help us build the NEXT
relationships between the different page view events. To do this, we tin can use a window role to place each event'due south follow-up page view. Nosotros need to partitioning our information by domain_userid
and domain_sessionidx
. We likewise need to club it by a timestamp that preserves the original society of events, such as the dvce_created_tstamp
or the derived_tstamp
(not the collector_tstamp
!):
{% highlight sql linenos %} WITH step1 Equally ( SELECT event_id, LEAD(event_id, 1) OVER (PARTITION BY domain_userid, domain_sessionidx Order BY derived_tstamp ) Every bit next_event_id, derived_tstamp β to deduplicate
FROM atomic.events
WHERE derived_tstamp :: DATE Between '2016-07-02' AND '2017-07-02' AND page_urlpath IS NOT Zilch AND domain_userid IS NOT NULL AND event = 'page_view' ),
βdeduplicate the events
step2 AS ( SELECT *, ROW_NUMBER() OVER (Partitioning BY event_id Social club Past derived_tstamp) AS n
FROM step1 ),
step3 Every bit
β only select relevant fields ( SELECT event_id, next_event_id
FROM step2
WHERE n = 1 )
SELECT *
FROM step3
WHERE next_event_id IS Non NULL; β filter out events that have no followup page view (considering they are the last event in the session) {% endhighlight %}
(A side note. In step2 in the above query nosotros're deduplicating the results to ensure that we end up with a list of unique page_view
events and their follow-upward events if any. We do this to make the query universally applicable. However, it'due south much ameliorate if there are no duplicates in your atomic
tables to begin with. Since Snowplow R88 we are taking care of almost all duplicates during processing, and then they don't always brand it into Redshift β but you have to have cross-batch natural deduplication turned on, which is something we practice by request as it has an associated cost. For historical, pre-R88 duplicates, you tin use the deduplication queries that we released in R72.)
Let'south save the results from the in a higher place queries as three .csv
files: page_nodes.csv
, user_nodes.csv
and next_relationships.csv
.
Getting the nodes and relationships into Neo4j
Previously, we got data into Neo4j by writing CREATE
statements direct in the browser console. This was fine for a few nodes, but doing information technology for hundreds of thousands of nodes is somewhat unwieldy. Now that we have our information in .csv
format, at that place is a improve option: Cipher's LOAD CSV
clause.
If you are loading the .csv
files from your local automobile (as nosotros are), make sure they are in the Neo4j/default.graphdb/import
folder (or its equivalent on your computer).
Let's start past loading the User
nodes. We can do it past running the following Nix query:
LOAD CSV WITH HEADERS FROM "file:///user_nodes.csv" AS line CREATE (u:User {id: line.domain_userid});
Then, allow's load our Page
nodes:
LOAD CSV WITH HEADERS FROM "file:///view_nodes.csv" Every bit line CREATE (p:Page {id: line.event_id , user: line.domain_userid, page: line.page_url, tstamp: line.derived_tstamp, referrer: line.refr_url, session: line.domain_sessionidx});
Before we become on, permit's besides create a uniqueness constraint on the two labels we already have: User
and Folio
. That will serve equally a check to ensure that all nodes take unique id
properties β if they don't Neo4j won't let us create the constraint. (In this instance, the id
property in Neo4j corresponds to domain_userid
or event_id
from Redshift β depending on the blazon of node.) More importantly, the constraint will ensure that nosotros won't be able to accidentally introduces duplicates if we decide to add together more nodes.
CREATE CONSTRAINT ON (user:User) Assert user.id IS UNIQUE; CREATE CONSTRAINT ON (page:Page) Assert page.id IS UNIQUE;
Adjacent upward, permit's link our User
nodes to our Folio
nodes by creating the relationships betwixt them. We don't need to use the LOAD CSV
clause, since all the information we demand has already been loaded into Neo4j:
Friction match (u:User), (p:Page) WHERE u.id = p.user CREATE (u)-[:VIEWS]->(p);
If you lot open the Database Information tab (summit left on your screen), yous can meet lists of all the node labels, relationship types and holding keys in your database. Clicking on the VIEWS
relationship type will autoexecute a query that will prove y'all 25 such relationships. The results will look similar to this:
Finally, permit's create the relationships betwixt the different page view events that belong to the same user:
LOAD CSV WITH HEADERS FROM "file:///next_relationships.csv" Every bit line Lucifer (current:Page), (next:Page) WHERE line.event_id = current.id AND line.next_event_id = next.id CREATE (current)-[:Adjacent]->(side by side);
Now all our page views are linked to the user who visited the page and also betwixt each other if they were part of a series of visits.
On the left above we can see the history of a user who visited 5 pages during their starting time session. On the right, another user is seen visiting four pages and then hitting the refresh button on the terminal 1. (We've chosen to visualize the session alphabetize in this case, but any of the node'south properties could be surfaced, such every bit folio URL, referrer, etc.)
Visualizing the information
The Neo4j browser console does a great job of visualizing the information in the database. Nosotros can use information technology to search for some patterns that we wait to find, using the LIMIT control to avoid beingness inundated. For case:
Friction match (u:User)-[:VIEWS]->(p:Folio) Return u, p LIMIT x;
shows u.s. some 'user views page' relationships:
And we tin check that our Side by side
relationships are doing what we await with:
MATCH p = (:Page)-[:NEXT*1..5]->(:Page) RETURN p LIMIT 10;
[:NEXT*1..5]
tells Neo4j to follow between 1 and five relationships when walking the graph. This results in:
Using Neo4j to perform path analysis
In this section, nosotros'll reply some questions on the journeys that users take through our ain website. We'll start by answering some piece of cake questions to get used to working with Cypher. Some of these simpler queries could be easily written in SQL. Nosotros're but interested in checking out how Nada works at this stage. Later on on, nosotros'll move on to answering questions that cannot be easily answered using SQL.
Nosotros'll answer the following questions:
- How many visits were there to our homepage?
- What page were users on before arriving at the 'Nearly' page?
- What journeys do users take from the homepage?
- What are the well-nigh common journeys that end on a particular folio?
- How long does it have users to get from i folio to another?
- What are some common user journeys?
How many visits were there to our homepage?
We get-go by finding the blazon of journey nosotros are interested in ('user views homepage') in the MATCH
clause. We've named our variables user
, r
(short for 'relationship') and home
. We don't end up using the user
variable, only it's in the query just to make it friendlier.
Then, we return the page
aspect for the nodes that match the home
variable (in this case it always has the aforementioned value: snowplowanalytics.com/
), and a count of the incoming relationships from among the matching patterns.
MATCH (user:User)-[r:VIEWS]->(home:Page {page: "snowplowanalytics.com/"}) Render domicile.page AS url, count(r) AS visits;
This returns a tabular array that tells us the number of views of the home page:
ββββββββββββββββββββββββββ€βββββββββ β"url" β"visits"β ββββββββββββββββββββββββββͺβββββββββ‘ β"snowplowanalytics.com/"β82026 β ββββββββββββββββββββββββββ΄βββββββββ
Now we can look for 'bounces' β visitors who only went to the homepage and and then left the site. For this, we start past matching the same patterns, merely then limit them with a WHERE
clause and the NOT
operator.
Friction match (user:User)-[r:VIEWS]->(home:Page {page: "snowplowanalytics.com/"}) WHERE NOT (home)-[:NEXT]->() RETURN home.page As url, count(r) As visits;
ββββββββββββββββββββββββββ€βββββββββ β"url" β"visits"β ββββββββββββββββββββββββββͺβββββββββ‘ β"snowplowanalytics.com/"β31116 β ββββββββββββββββββββββββββ΄βββββββββ
And then, of the 82,026 homepage views in that period, 31,116 were non followed past some other page view inside the same session.
Now allow's consider a more than interesting questionβ¦
What page were users on before arriving at the 'Well-nigh' page?
Let'southward say that nosotros're interested in our 'Nearly' page considering this has our contact details. Nosotros want to detect out how users arrive
at this folio. That means nosotros need to follow the Side by side
relationships backwards from the events in our Page
nodes to identify the pages that were viewed earlier the 'Most' page.
We start by specifying a design that ends in the 'About' folio. Then we amass the results:
MATCH (near:Folio {page: "snowplowanalytics.com/visitor/"})<-[:Next]-(prev:Page) Render prev.page AS previous_page, count(prev) Every bit visits Club BY count(prev) DESC LIMIT 10;
This fourth dimension, we've asked Neo4j to order the results in descending order, and limit them to the top 10.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€βββββββββ β"previous_page" β"visits"β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺβββββββββ‘ β"snowplowanalytics.com/" β358 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/company/" β77 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-insights/" β76 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/services/" β73 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-open-source/"β47 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-react/" β45 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/customers/trusted-by/" β42 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/" β42 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/company/contact-united states of america/" β39 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/customers/" β20 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ
You will observe that a lot of people seem to have visited the 'About' folio two times in a row. In fact, more people have washed so than come from whatever other folio on the website autonomously from the homepage. These cases can be explained as page refreshes. Since they don't tell the states a lot nearly the user'southward behavior, let's exclude them from the results:
Match path = (well-nigh:Folio {folio: "snowplowanalytics.com/company/"})<-[:Side by side]-(prev:Folio) WHERE NOT prev.page = most.page Render prev.page AS previous_page, count(prev) AS visits ORDER BY count(prev) DESC LIMIT x;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€βββββββββ β"previous_page" β"visits"β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺβββββββββ‘ β"snowplowanalytics.com/" β358 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-insights/" β76 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/services/" β73 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-open up-source/"β47 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-react/" β45 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/customers/trusted-by/" β42 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/" β42 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/company/contact-united states/" β39 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/customers/" β20 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"discourse.snowplowanalytics.com/" βsixteen β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ
It is easy to better our query so it finds the page users were on 2 steps before they got to the 'Nigh' folio. We only have to add an actress Adjacent
relationship in the Lucifer
clause:
MATCH (almost:Page {page: "snowplowanalytics.com/company/"})<-[:Next]-()<-[:Next]-(prev:Page) WHERE NOT prev.folio = virtually.page Render prev.folio Equally previous_page, count(prev) Every bit visits ORDER BY count(prev) DESC LIMIT ten;
As a shortcut, we can instruct Neo4j to follow 2 relationships by writing [:NEXT*2]
:
Lucifer (about:Page {page: "snowplowanalytics.com/company/"})<-[:Next*two]-(prev:Page) WHERE Non prev.folio = most.folio Render prev.folio Equally previous_page, count(prev) Every bit visits ORDER BY count(prev) DESC LIMIT 10;
In either case the re
sult is the same:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€βββββββββ β"previous_page" β"visits"β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺβββββββββ‘ β"snowplowanalytics.com/" β182 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-insights/" β48 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/services/" β44 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-react/" β36 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/snowplow-open up-source/"β31 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/products/" β27 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/customers/trusted-past/" β23 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/company/careers/" β20 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/customers/" β14 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββ€ β"snowplowanalytics.com/blog/" β13 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ
Nosotros tin become back even further and find the page users were on five steps before the 'About' page:
MATCH (virtually:Page {folio: "snowplowanalytics.com/company/"})<-[:Next*5]-(prev:Page) WHERE Not prev.page = about.folio RETURN prev.folio Every bit previous_page, count(prev) AS visits Guild Past count(prev) DESC LIMIT x;
This is the kind of search that would exist hard in SQL because it would involve a full table scan for every footstep back we want to take from our destination page. Neo4j handles this type of query very comfortably, because executing it is simply a matter of identifying journeys that cease on the folio and and so walking the graphs just for those journeys. It returned the results of this particular query in 812ms.
What journeys exercise users have from the homepage?
In the final department we identified journeys that lead to a item page. Now allow's have a page equally a starting point, and see how journeys progress from that.
For this example, we'll kickoff on our homepage. Let'south identify the iii steps that a user takes from the homepage, as a sequence (rather than individual steps as we did in the previous example). We'll use the EXTRACT control to return just the URL fastened to the events in the path, rather than the nodes themselves. That'southward considering we're non looking for user IDs, timestamps, etc, and so this volition give the states some cleaner results.
Lucifer path = (abode:Page {folio: "snowplowanalytics.com/"})-[:Side by side*3]->(:Folio) RETURN EXTRACT(p IN NODES(path)[1..LENGTH(path)+1] | p.page) AS path, COUNT(path) Equally users ORDER By COUNT(path) DESC LIMIT 10;
This query gives the states the 10 most common paths from the homepage:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ββββββββ β"path" β"users"β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺββββββββ‘ β["snowplowanalytics.com/product/","snowplowanalytics.com/services/","sβ1025 β βnowplowanalytics.com/guides/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/product/","snowplowanalytics.com/product/","snβ698 β βowplowanalytics.com/services/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/product/","snowplowanalytics.com/services/","southβ676 β βnowplowanalytics.com/product/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/","snowplowanalytics.β542 β βcom/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/product/","snowplowanalytics.com/","snowplowanβ526 β βalytics.com/production/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/product/","snowplowanalytics.com/production/","snβ464 β βowplowanalytics.com/product/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/product/","snowplowanalytics.com/trial/","snowβ429 β βplowanalytics.com/product/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/production/","snowplowanalytics.com/trial/","snowβ288 β βplowanalytics.com/guides/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowan alytics.com/product/","snowplowanalytics.com/guides/","snoβ288 β βwplowanalytics.com/guides/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/product/","snowplowanalytics.com/","snowplowanβ240 β βalytics.com/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββ
(A quick side notation to explain what NODES(path)[1..LENGTH(path)+1]
does.
NODES(path)
produces a listing of all nodes that are function of the pattern we called path
. [ane..LENGTH(path)+1]
then but selects the certain elements from that list: everything from the 1st chemical element to the LENGTH(path)+1th chemical element. The index is 0-based. Since nosotros practice not want to include the starting page (the homepage) in the results, we desire to exclude the 0th element and commencement with the 1st ane. On the other finish, we want to see the last page in the pattern, and so we practice non desire to exclude anything. Furthermore, the square brackets notation in Zippo will extract from the showtime index and upwards to only non including the stop index. Then if we desire to see the last node, we must add 1 to the stop index.
The LENGTH()
function counts the number of relationships in the blueprint. In this case we're following upwards to 3 relationships so the result of the LENGTH()
function volition be in the range from 1 to 3. A design with length three has 4 nodes: (n1)-[r1]->(n2)-[r2]->(n3)-[r3]->(n4)
. If we want to see (n4)
in the results of the NODES()
function, the finish index must then exist LENGTH(path)+1
.)
What are the virtually mutual journeys that end on a particular page?
This time we'll wait at paths that lead to the 'About' page. The merely changes we need to make from our previous example is to alter the target page and contrary the path social club. Simply just to keep things varied, permit's also exclude paths that include the 'About' folio before the finish.
Match path = (:Folio)-[:Adjacent*three]->(about:Page {folio: "snowplowanalytics.com/visitor/"}) WHERE NONE(visit IN NODES(path)[0..LENGTH(path)] WHERE visit.folio = about.page) Return Extract(p IN NODES(path)[0..LENGTH(path)] | p.page) AS path, COUNT(path) Equally users ORDER BY COUNT(path) DESC LIMIT 10;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ββββββββ β"path" β"users"β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺββββββββ‘ β["snowplowanalytics.com/","snowplowanalytics.com/products/snowplow-insβ7 β βights/","snowplowanalytics.com/services/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/","snowplowanalytics.βv β βcom/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/products/snowplow-insights/","snowplowanalyticβfive β βsouthward.com/products/snowplow-react/","snowplowanalytics.com/products/snowplβ β βow-open-source/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/products/snowplow-insβfour β βights/","snowplowanalytics.com/products/snowplow-open-source/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/products/snowplow-insights/","snowplowanalyticβthree β βsouth.com/services/","snowplowanalytics.com/products/snowplow-insights/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/products/snowplow-insβ3 β βights/","snowplowanalytics.com/request-demo/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/services/","snowplowaβ3 β βnalytics.com/products/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/products/snowplow-react/","snowplowanalytics.cβ3 β βom/products/snowplow-insights/","snowplowanalytics.com/products/snowplβ β βow-open up-source/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/products/snowplow-opeβthree β βnorthward-source/","snowplowanalytics.com/services/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/products/snowplow-reaβ3 β βct/","snowplowanalytics.com/products/snowplow-insights/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββ
How many pages do users visit to go from 1 specific page to another?
In gild to understand how users are using a website, we may want to measure how many pages they viewed to become from one specified folio to some other specified page.
Kickoff, we need to match the pages we're interested in, as well every bit the pattern that joins them.
Then, nosotros'll desire to exclude journeys that accept either the kickoff or terminate page as intermediate s
teps. In that location are 2 good reasons for doing this. Consider a user who arrives at the homepage, reads some of the pages in the 'Services' section of the site, and then returns to the homepage and goes direct to the blog. According to our matching rules, this user would exist counted twice: once from his offset visit to the homepage, and again for his second visit. It also seems reasonable to rule out the longer journey: afterward all, maybe they weren't looking for the blog when they outset arrived at the home page.
Friction match path = (home:Page {page: "snowplowanalytics.com/"})-[:Side by side*..10]->(weblog:Page {page: "snowplowanalytics.com/blog/"}) WHERE NONE(visit IN NODES(path)[ane..LENGTH(path)] WHERE visit.page = habitation.page OR visit.page = blog.page) RETURN LENGTH(path) AS steps_from_homepage_to_blog, COUNT(LENGTH(path)) AS users ORDER By LENGTH(path) LIMIT 10;
βββββββββββββββββββββββββββββββ€ββββββββ β"steps_from_homepage_to_blog"β"users"β βββββββββββββββββββββββββββββββͺββββββββ‘ β1 β2012 β βββββββββββββββββββββββββββββββΌββββββββ€ β2 β635 β βββββββββββββββββββββββββββββββΌββββββββ€ β3 β443 β βββββββββββββββββββββββββββββββΌββββββββ€ β4 β312 β βββββββββββββββββββββββββββββββΌββββββββ€ β5 β273 β βββββββββββββββββββββββββββββββΌββββββββ€ βvi β145 β βββββββββββββββββββββββββββββββΌββββββββ€ β7 β129 β βββββββββββββββββββββββββββββββΌββββββββ€ β8 β67 β βββββββββββββββββββββββββββββββΌββββββββ€ β9 β42 β βββββββββββββββββββββββββββββββΌββββββββ€ β10 β24 β βββββββββββββββββββββββββββββββ΄ββββββββ
The above table shows that the nigh common route to get from the homepage to the blog page is directly, but that information technology is not uncommon to exercise this journeying in two, 3, iv and five steps.
What are some common user journeys?
And then far, we've been specifying pages to start or finish at. Just we can as well ask Neo4j to discover common journeys of a given length from anywhere and to anywhere on the website. Let's look for journeys of up to three steps, excluding repeat visits to the same page. Let's also brand sure that we only count journeys from a folio where a visit really started to a page where a visit really ended, ie not count partial journeys.
MATCH (start:Page), (end:Page), path = (offset)-[:NEXT*..three]->(end) WHERE Non (:Page)-[:NEXT]->(start) AND NOT (end)-[:Adjacent]->(:Page) AND NONE(p IN NODES(path)[i..LENGTH(path)+1] WHERE p.page = start.page) AND NONE(p IN NODES(path)[0..LENGTH(path)] WHERE p.page = end.page) RETURN Extract(p IN NODES(path)[0..LENGTH(path)+1] | p.page) Equally path, COUNT(path) As users Lodge BY COUNT(path) DESC LIMIT 10;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ββββββββ β"path" β"users"β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺββββββββ‘ β["snowplowanalytics.com/","snowplowanalytics.com/product/"] β6117 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/guides/"] β952 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/services/"] β883 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/product/","snowplowanβ743 β βalytics.com/services/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/products/snowplow-insβ599 β βights/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/product/","snowplowanβ580 β βalytics.com/trial/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/trial/"] β522 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/products/snowplow-opeβ506 β βn-source/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/production/","snowplowanβ436 β βalytics.com/guides/"] β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββ€ β["snowplowanalytics.com/","snowplowanalytics.com/weblog/"] β345 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββ
Since we were only interested in journeys of up to 3 steps, it was like shooting fish in a barrel to exclude paths where the offset or end folio were not repeated and that meant that no other pages were repeated equally well. For longer journeys though, we need a different approach. Nicole White's GraphGist explains how we tin use the UNWIND
clause to count the number of singled-out pages visited. By comparing the number of singled-out pages to the length of the path, we can exclude paths that have loops. (Nicole's example is based on the older versio
due north of this blog postal service.)
Now nosotros can find the 10 most common journeys of between v and half-dozen steps without repetitions:
MATCH (start:Page), (end:Folio), path = (first)-[:NEXT*5..6]->(stop) WHERE Non (:Page)-[:NEXT]->(start) AND NOT (end)-[:NEXT]->(:Page) WITH path, Excerpt(p IN NODES(path) | p.page) Equally pages UNWIND pages As views WITH path, COUNT(Singled-out views) AS distinct_views WHERE distinct_views = LENGTH(NODES(path)) RETURN Extract(p IN NODES(path)[0..LENGTH(path)+i] | p.folio) AS path, COUNT(path) Every bit users Guild BY COUNT(path) DESC LIMIT 10;
Summary
In this post, we experimented with using Neo4j to reply increasingly open up-ended questions almost how users travel through our website. This is very different from the traditional spider web analytics arroyo of defining a particular funnel and then seeing how many people make it through that funnel. Instead, nosotros're exploring how people actually conduct, in a manner that doesn't limit our assay with our own preconceptions nearly how people should acquit.
The results of these experiments have been very promising. Nosotros've seen how we can use Neo4j to perform open up-concluded path assay on our granular, issue-level Snowplow data. Such analysis would exist impossible or very hard in SQL.
Source: https://snowplowanalytics.com/blog/2017/07/17/loading-and-analysing-snowplow-event-data-in-Neo4j/
0 Response to "Its That Time of the Year Again to Work on Your Snowplow"
Post a Comment