A Tricky Merge Executed Simply in R

R has a vast array of useful and versatile functions that excel at statistical data analysis. These functions can come in quite handy for data wrangling as well, contrary to the public notion of R being un-suitable for manipulating data. This article demonstrates the use of the merge function through a case example in R.

While R is not as intuitive at data manipulation as other applications, such as SQL, it is surprisingly adept when the right functions are used with the right techniques.

Summary of the Dataset

The dataset in question comprises of roughly 50,000 rows of search query terms. Search query terms are the query keywords that the end user inputs in an attempt to search the internet. The other attributes of the dataset are the userid, the query term, and the category that the search query belongs to. Here’s a snippet of the dataset:

fake dataset

 

 

 

 

The userid is a unique identifier of every person who has made a search query. Userids are repeated every time that a user makes a search, hence we have several userids that appear numerous times. The dataset holds the search histories of approximately 8,000 users, with total searches amounting to approximately 50,000. The search query term attribute holds a record of every search query made by the users. The search query terms are divided across roughly 100 search categories, that are a sub – categorization of search query terms based on their nature. A few examples of search categories would be Automotives, Nutrition, and Healthcare.

Summary of the Task at Hand

We want to study the search behavior of particular users who began their search activity from a specific search category, Automotive being the category in our instance. We want to track search behaviors of users who began their activity by searching for automobiles. We want to see what other categories are being searched by these users.

While this sounds relatively simple, it is a bit more than just a where clause (in SQL terms). We want to get users whose first search began under Automotives, and then we want all their searches from there on. We want to ignore all search history for users whose first search was under any category other than Automotives.

An Overview of the Determined Approach

We want to begin by creating a subset of just the first searches for all userids involved, where the first search was for under the category Automotives. Once we create this subset, we can merge this with the original subset where the userid for both the subsets match. The resulting dataset will get us the search history of users who began their activity by searching for automobiles.

Step 1: Creating the subset

subset

 

 

 

 

We begin by assigning a chronological count number to every query made by every user. This way, we can sort out the first, or the third query from the rest of the dataset, as per the requirements of the analysis. Our analysis needs us to separate the first query from the rest.

The query index column was added in the following way:

datasetset_ordered$queryindex = 1

for (i in 1:nrow(datasetset_ordered)) {
  if (datasetset_ordered[i+1," userid "] ==
    datasetset_ordered[i," userid "]) {
 datasetset_ordered[i+1,"queryindex"] = datasetset_ordered[i, "queryindex"] + 1
 }
 }

We first made a new column in our dataset, called queryindex, and made all the observations as 1. We then wrote an if statement, where, if the next (i + 1) userid was equal to the current userid (i), then add a 1 to the queryindex. The loop sifts through every userid and adds a 1 to the queryindex if the previous userid is similar to the current one.

In this way, every subsequent query from the same userid gets assigned an index numbers that helps us identify the sequence of searches.

We now know the sequences of searches for every user. We can easily create a subset of the first searches of all users where the first search is from the category Automotives.

The code for separating first searches for Automotives:

subset = dataset_ordered[(dataset _ordered$category == Automotives’ & dataset _ordered$queryindex == 1),]

The code literally means create an object called subset, where the category column in dataset _ordered is Automotives and the queryindex is 1.

Step 2: Executing the merge

We want to merge the subset with the original dataset (dataset_ordered) in such a way that we get the observations from dataset_ordered only where the userid matches the userid from the subset.

Let’s take a look at the code for merging the subset dataset with subsequent queries based on the common column userid

mergedset = dataset_ordered[dataset_ordered [,"userid"] %in% subset[,"userid"],]

Here, we are creating an object called mergedset, and are taking only those observations from dataset_ordered where the userids in dataset ordered and the userids in subset are the same. We use the same syntax as we saw earlier, hence, we have the control of userid for both dataset_merged and subset inside brackets ([]).

The newly created object mergedset thus attains our goal of creating a dataset of queryterms where the user began his lifetime searching for a queryterm under the category Automotives.

The biggest takeaway here is that we executed the entire operation using ordinary R allocators. What’s important is knowing that the right functions need to be married to the right techniques.

Apurv currently works as a statistical analyst for Five Element Analytics, an analytics firm based in NY. He graduated from Hofstra University in 2015 with an MBA in Business Analytics.