The aim of this paper is to study the search behavior of users, based on their Google search query terms, and to find similarities between search behaviors of a pool of users. We want to identify the types of searches that are central to other searches. These searches would ideally lead to searches of other kinds, and it would be conducive to invest in Google ads for searches of this type. We can also trace how users beginning their search history from particular Categories perform across other Categories. A rough hypothesis here would be that different pools of users, separated by their interests, will exhibit different search behavior. Users interested in particular topics will tend to search more for certain topics, which will be different, compared to topics searched for by a different pool of users.
Search query terms are divided into broad level categories, with the aim of segregating query terms into identifiable groups. Some examples of Categories are Automotives, Transport, Home decor, Pet animals, etc.
While this analysis can be pursued using several different techniques, we chose to use Social Network Analysis, a technique that relies on a network of nodes that communicate with each other.
Brief Introduction to Social Network Analysis
Social Network Analysis (SNA) is a concept of studying the channels of communication in a group, with the intention of understanding the flow of communication. If a group of people was to be studied, each person would be a ‘node’, and each node would contribute in some way towards the network in terms of communication.
Nodes will exhibit different magnitudes of connectivity; some nodes will have more connections than the others. In this case, they will be the mutual points of passage of information. Such nodes will then have a certain form of a control over the communication, since it passes through them.
The centrality of such nodes is referred to as ‘betweenness centrality’, in SNA terminology. The higher the betweenness centrality, the higher is the connectedness, and the control, of such nodes.
Another important parameter for our analysis is that of ‘closeness centrality’, where we study the relative distance between any set of nodes, distance being the number of nodes that the information has to pass through. Nodes that are relatively ‘closer’ will have more access to one another and will not have too many nodes in their path of connection.
We can now use this logic and apply it to our pool of search Categories, and see which ones exhibit a higher betweenness centrality, and which ones demonstrate better closeness.
Categories with a higher betweenness centrality will be more central to the whole ‘network’. These categories will be conducive for placing paid ads, in order to drive search traffic and paid advertisements. Using the instance of the simple network diagram, nodes B and C would enjoy a higher betweenness, since they both are connected to three other nodes each, as compared to the other three nodes (A, D, and E) that are connected to just two nodes each.
Using the same instance to study ‘closeness’, B and D are closer to each other, as compared to A and E, for B and D have fewer (none) interruptions in their connection compared to A and E, that have one node (C) that is in their way of communication.
We can also measure the cumulative betweenness centrality of the whole network, and gauge if it’s a mostly centralized network or not. This parameter is referred to as the ‘network centralization’. If most nodes are connected to a centrally located node, the network centralization parameter ratio would be higher. In this case, a single node would dictate the information flow. Its failure could result in the failure of the network as a whole. If the network is spread out, where the centrality is more distributed amongst a number of nodes, the network centrality measure would be lower. In the case of such a relatively decentralized network, a number of nodes would exhibit a high centrality and control over the network. It will be interesting to see whether our network of sample searches is mostly central or not.
The centrality measure of the whole network in question is 0.32, which means that the centrality of the network is more distributed amongst a number of nodes, instead of being just a few nodes (Search Categories). This makes intuitive sense.
For our analysis, this means that there are a number of categories that exhibit strong betweenness centrality measures. There is no single category that gets searched the highest, and from which all the searches stem. This seems completely realistic. We, as humans, are diverse individuals. We share quite a few similar personality traits, but also have our own unique tendencies. How could we all search for the same Category at the beginning of our search activities? This diversity in personalities results in a low network centralization. The topics of interest of the pool of users is vast, hence, the communicational power is distributed amongst a number of nodes.
Here are the Betweenness Centrality measure for Search Categories:
As we see, Shopping enjoys the highest Betweeness Centrality measure across all categories. It is, however, closely followed by Restaurants. Navigation, Nutrition. Automotives are next, with somewhat similar measure in terms of their centrality. This indicates that these groups of Searches are similar to each other in their role towards the communication for the network.
Here are the Closeness Centrality measures for the Categories:
This chart lays out the categories in terms of their ‘closeness’ to other nodes. As we see, Navigation is the most accessible. The information needs to pass through the least number of nodes when it has to reach Navigation. This also makes intuitive sense. Users searching for any other Categories must also be searching for navigation. Nutrition, on the other hand, exhibits the least Closeness Centrality. This might be an indicator that it is used by just a certain niche of users, which also makes intuitive sense.
The beauty of Data Science is that different and seemingly disparate techniques can be applied to unearth innate findings. It is important however, to know that there is a high risk of violation of assumptions that might be necessary for particular methods. An emphasis needs to be placed on the cross-validation of such techniques.
Like with any research and analysis proceeding, the hypothesis should be validated from several angles. A robust method of verifying hypotheses in the case of our particular analysis would be to use curated datasets of searches controlled by geolocation, where the search tendency of the users is already known. Another method would be to use datasets of various sizes and duration cohorts. This would act as a buffer against seasonality trends and curves that might be embedded in the nature of the dataset.
An equal amount of effort should be invested in verifying the analysis, as it takes to conduct the analysis. A thing to remember would be that no finding is still a finding.