This is one part of an analysis I did for a talk “Data Science meets SEO” at the SEO Campixx in Berlin on March 1st, 2018. My main focus was on looking at a larger number of data and apply basic data science approaches. The whole series (in German) is available on my homepage
While Google uses more than 200 ranking signals according to their own blog, only a fraction of these is available to us:
This data relies on 500 search queries, i.e. 4.890 results, given that I have removed Google domains from the results. The dataset obviously is limited, mainly because access to APIs containing backlink data results in costs. The backlink data for this set comes from the beautiful German tool Sistrix. 500 queries are a tiny sample, and it is questionable if this is enough to derive reliable information from this. This is more about the approach itself, and some insights seem to be valid anyway.
This document has been written as a Markdown document in RStudio using R.
library(tidyverse)
library(digest)
all_data <- read.csv("/home/tom/data-science-seo/data-science-seo/data/all_data.csv")
(dataset <- all_data %>%
select(hash,position, secure2, year, SPEED, backlinks_raw, backlinks_log,title_word_count,desc_word_count,content_word_count,content_tf_idf,content_wdf_idf,totalResults,SwarchVolume) %>%
group_by(hash,position) %>%
arrange(hash, position))
dataset$backlinks_log[!is.finite(dataset$backlinks_log)] <- NA
pairs(dataset[2:12])
Visualization is always great to identify patterns but as we can see here, there are no interesting patterns. A correlation matrix may disclose more, however, it shows all correlation coefficients, even for those where p > 0.05.
cor(dataset[2:13], use = "complete.obs", method="pearson")
## position secure2 year SPEED
## position 1.000000000 -0.038356221 0.020731298 -0.02759370
## secure2 -0.038356221 1.000000000 -0.056481291 0.08849541
## year 0.020731298 -0.056481291 1.000000000 0.14196196
## SPEED -0.027593704 0.088495407 0.141961962 1.00000000
## backlinks_raw 0.018421465 0.087199717 0.027382804 0.05877407
## backlinks_log -0.059347963 0.235114256 -0.430682752 0.00779184
## title_word_count -0.010990383 -0.025079491 -0.004333514 -0.02713053
## desc_word_count -0.005155549 -0.055839650 0.013637362 -0.03073151
## content_word_count -0.005152130 -0.025663786 0.039421840 -0.06848420
## content_tf_idf -0.020095666 -0.037562046 0.030942858 0.02560561
## content_wdf_idf -0.017440558 -0.053175664 0.023868711 0.03538900
## totalResults -0.009186989 -0.007294347 -0.052269211 -0.03102022
## backlinks_raw backlinks_log title_word_count
## position 0.0184214648 -0.05934796 -0.010990383
## secure2 0.0871997167 0.23511426 -0.025079491
## year 0.0273828038 -0.43068275 -0.004333514
## SPEED 0.0587740667 0.00779184 -0.027130527
## backlinks_raw 1.0000000000 0.36591714 -0.213687201
## backlinks_log 0.3659171390 1.00000000 0.024943806
## title_word_count -0.2136872013 0.02494381 1.000000000
## desc_word_count -0.1125314088 -0.08507592 0.083838293
## content_word_count -0.0327708234 -0.05269331 0.009694726
## content_tf_idf -0.0253931259 -0.07992594 -0.058933273
## content_wdf_idf -0.0205034149 -0.06646933 -0.046400601
## totalResults -0.0002951899 0.03921995 0.001780916
## desc_word_count content_word_count content_tf_idf
## position -0.005155549 -0.005152130 -0.020095666
## secure2 -0.055839650 -0.025663786 -0.037562046
## year 0.013637362 0.039421840 0.030942858
## SPEED -0.030731506 -0.068484196 0.025605607
## backlinks_raw -0.112531409 -0.032770823 -0.025393126
## backlinks_log -0.085075921 -0.052693313 -0.079925941
## title_word_count 0.083838293 0.009694726 -0.058933273
## desc_word_count 1.000000000 0.179998982 -0.004068675
## content_word_count 0.179998982 1.000000000 -0.012556464
## content_tf_idf -0.004068675 -0.012556464 1.000000000
## content_wdf_idf -0.033608496 -0.021117337 0.941572002
## totalResults 0.019239501 -0.008309859 0.016201171
## content_wdf_idf totalResults
## position -0.01744056 -0.0091869894
## secure2 -0.05317566 -0.0072943470
## year 0.02386871 -0.0522692110
## SPEED 0.03538900 -0.0310202206
## backlinks_raw -0.02050341 -0.0002951899
## backlinks_log -0.06646933 0.0392199532
## title_word_count -0.04640060 0.0017809161
## desc_word_count -0.03360850 0.0192395009
## content_word_count -0.02111734 -0.0083098595
## content_tf_idf 0.94157200 0.0162011709
## content_wdf_idf 1.00000000 0.0127050011
## totalResults 0.01270500 1.0000000000
For those not looking at such numbers every day, correlation coeeficients are numbers between -1 and +1 where 0 means no correlation at all and -1 or 1 mean a high correlation. Correlation is not cause and effect but at least something to look into. As we can see in this matrix, we see correlations where it would be strange not to see them, e.g., obviously, there should be a correlation between WDF/IDF in content TF/IDF in content. However, we are mainly interested in the 1st column, the correlation of positon with all other signals that we have here. Total Results and Search Volume are of course no ranking signals but are interesting to look at later.
summary(dataset$SPEED, breaks=50)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 60.00 73.00 69.18 81.00 100.00 15
On average, pages in this sample have a mean of 69.18 but this seems to be influenced by outliers since the median is at 73. Unfortunately, R does not provide a mode, so we have to use our own function for this.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Mode(dataset$SPEED)
## [1] 80
This means, that most sites have a page speed of 80. Let’s look at a histogram:
hist(dataset$SPEED, breaks=50)
Now, is there a correlation between speed and position on a Google SERP?
boxplot(position~SPEED, data=dataset)
Looks like there is nothing in it. Let’s do the cor.test: ## Correlation test
cor.test(dataset$position, dataset$SPEED, use = "complete.obs", method="pearson")
##
## Pearson's product-moment correlation
##
## data: dataset$position and dataset$SPEED
## t = -2.0287, df = 4929, p-value = 0.04255
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0567504633 -0.0009718484
## sample estimates:
## cor
## -0.02888364
We don’t see a correlation between Speed and position on a Google SERP. One reason for this could be that having a slow page is like not having showered so that everyone will leave while noone notices when you have showered. In other words, having a fast page will not help you but a slow page may have an impact. However, we only look at the top 10 results. Maybe there would be a higher impact in the top 100.
cor.test(dataset$position,dataset$secure2)
##
## Pearson's product-moment correlation
##
## data: dataset$position and dataset$secure2
## t = -3.214, df = 4944, p-value = 0.001318
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07343806 -0.01781376
## sample estimates:
## cor
## -0.0456613
No. It is not.
hist(dataset$backlinks_raw, main="Histogram Backlinks", xlab="# of Backlinks")
As we can see, most sites do not come close to very few sites with many backlinks (the site with the most backlinks according to Sistrix is YouTube). Let’s plot this to positions:
plot(dataset$position,dataset$backlinks_raw, main="Position versus Backlinks", xlab = "Position", ylab="Number of Backlinks")
The outlier YouTube has positions everywhere on the SERP but is a bit less frequent on #1. Maybe YT has an artifical ranking so we should remove this outlier. But there are also other non-Google properties with many links.
cor.test(dataset$position,dataset$backlinks_raw)
##
## Pearson's product-moment correlation
##
## data: dataset$position and dataset$backlinks_raw
## t = 1.1236, df = 4938, p-value = 0.2612
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01190534 0.04385469
## sample estimates:
## cor
## 0.01598711
The test result is not valid due to p being above 0.05. However, data also seems to be skewed due to outliers. So we will use logarithms to normalize backlink data.
hist(dataset$backlinks_log, main="Histogram of Backlinks (Log)", xlab="# of Backlinks (Log)")
This looks much more “normal”, close to a normal distribution that we would probably expect for backlinks.
plot(dataset$position,dataset$backlinks_log, main="Position versus Backlinks (Log)", xlab="Position", ylab="Backlinks (Log)")
Unfortunately, we still don’t see a pattern here.
cor.test(dataset$position,dataset$backlinks_log)
##
## Pearson's product-moment correlation
##
## data: dataset$position and dataset$backlinks_log
## t = -3.7327, df = 4785, p-value = 0.0001916
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08208686 -0.02559257
## sample estimates:
## cor
## -0.05388283
This time, we do see a very low correlation but we would have expected much more, right? Let’s look at this again later.
hist(dataset$year, breaks=25, main="Histogram Domain Ages", xlab="Year in which a domain was seen first")
Obviously, there are by far more older domains in the dataset which could result in the interpretation that they are more likely to rank. However, that is not the case.
boxplot(position~year,data = dataset, varwidth=TRUE, main="Boxplot of domain age and position", xlab="Domain Age", ylab="Position")
The boxplot displays the range of 50 percent of all results in a box with the median as a line. As we can see, the median of the oldest domains here is at position 6 whereas a domain from 2017 has a median of position 5. However, there are by far less domains from 2017. Domains from 2018 rank worst.
cor.test(dataset$position,dataset$year)
##
## Pearson's product-moment correlation
##
## data: dataset$position and dataset$year
## t = 1.4669, df = 4929, p-value = 0.1425
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.007027885 0.048772927
## sample estimates:
## cor
## 0.02088879
The result of this correlation test did not pass with a p value of 0.1425. But the plot alone shows that is is impossible to spot a relationship. For all readers of the Google blog, this is not news. Google said that domain age doesn’t play a role years ago. Having said that, there are some other interesting observations.
plot(dataset$year,dataset$backlinks_log, main="Domain Age and Backlinks", xlab="Domain Age", ylab="Backlinks (Log)")
We could draw a line here that shows a relationship beween the age of a domain and the number of backlinks, so the next step is to investigate whether we have a correlation.
cor.test(dataset$year,dataset$backlinks_log)
##
## Pearson's product-moment correlation
##
## data: dataset$year and dataset$backlinks_log
## t = -33.47, df = 4779, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4584541 -0.4125188
## sample estimates:
## cor
## -0.4357702
Looks like we do have a correlation for age of a domain and the number of total links! Actually, this is not completely surprising since the older a site, the more time it has to collect links :) Obviously, we do not really know the direction of relationship, but in this case, it would not make sense that sites get older, the more links they have.
plot(dataset$year,dataset$SPEED, main="Speed versus Domain Age", xlab="Domain Age", ylab="Speed")
We could also try to draw a line here although it is not as obvious as the one before.
cor.test(dataset$year,dataset$SPEED)
##
## Pearson's product-moment correlation
##
## data: dataset$year and dataset$SPEED
## t = 10.87, df = 4915, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1258000 0.1803932
## sample estimates:
## cor
## 0.1532135
There is a very weak correlation here, so we could conclude that website owners with old domains are more likely to get a SSL certificate for their site.
A study of backlinko said that the average number 1 documents on Google has 1.890 words. With my (much smaller) dataset, I cannot reproduce this. But let’s look at a plot first.
plot(dataset$position,dataset$content_word_count, main="Number of words in document and ranking position", xlab = "Position in SERP", ylab = "Number of words")
Looks like we cannot draw a line here.
cor.test(dataset$position,dataset$content_word_count)
##
## Pearson's product-moment correlation
##
## data: dataset$position and dataset$content_word_count
## t = -0.49707, df = 4944, p-value = 0.6192
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03493254 0.02080510
## sample estimates:
## cor
## -0.007069213
The test did not deliver results either. So let’s look at the raw data:
summary(dataset$content_word_count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 77.0 273.0 690.3 839.0 140910.0
The “average” document has 690 words but the median is at 273 words; unfortunately, backlinko did not disclose what “average” they used. Also, they did not disclose how they got the keyword count. For this study, Python’s BeautifulSoup was used for scraping and extracting the content. I have also only looked at documents in Germany so maybe more words are used in the US. But this seems to be unlikely. So let’s look at number 1 documents only:
summary(dataset$content_word_count[dataset$position==1])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 98.75 313.00 640.53 733.75 10469.00
The mean is lower whilst the median is higher. Looks like there is something in there, right?
summary(dataset$content_word_count[dataset$position==11])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 201.2 565.5 848.0 1222.8 4288.0
But if I write a bit more, then it seems to be more likely for me to get to position 11. In other words, writing more does not mean that I will get higher on the Google SERP. Let’s look at a log(wordcount):
boxplot(log(content_word_count)~position, data = dataset)
To sum up, in my small dataset, I cannot conclude that writing more will bring you to a higher position on a Google results page. And I cannot reproduce the number from backlinko.
WDF/IDF has been one of the hottest things for many SEOs in the last years, so what if we look at exact match WDF/IDF?
plot(dataset$position,dataset$content_wdf_idf)
Does not look like we can see a pattern here.
cor.test(dataset$position,dataset$content_wdf_idf)
##
## Pearson's product-moment correlation
##
## data: dataset$position and dataset$content_wdf_idf
## t = -1.0225, df = 4745, p-value = 0.3066
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04327262 0.01361174
## sample estimates:
## cor
## -0.01484245
And our test also does not come to a result. Based on this data, I would be hesitant to recommend WDF/IDF, also because we are missing all the stemming and other linguistic methods applied to text.
Mainly because ranking is dynamic. Not for all documents, backlink data is available, and in such cases, other ranking signals become more dominant. So we may want to find out where a ranking signal is a statistically reliable signal.
cor_per_keyword_backlinks <- dataset %>%
group_by(hash) %>%
mutate(cor = cor(position,backlinks_raw), p = cor.test(position,backlinks_raw)$p.value) %>%
arrange(hash,position) %>%
select(hash,cor,p) %>%
distinct()
hist(cor_per_keyword_backlinks$cor, breaks=100, main="Histogram of Correlations of Backlinks/Positions on a Keyword Level", xlab="Correlation Coefficients")
This is the distribution of the correlation coefficients for all raw backlink data. We would expect a negative correlation (the higher the position, the lower the number of backlinks), but what we see is also keywords where the opposite is the case. Unfortunately, the pure correlation coefficient is not reliable, we also need the p value; so let’s look only at results where p < 0.05:
cor_per_keyword_backlinks <- dataset %>%
group_by(hash) %>%
mutate(cor = cor(position,backlinks_raw), p = cor.test(position,backlinks_raw)$p.value) %>%
arrange(hash,position) %>%
select(hash,cor,p) %>%
distinct()
hist(cor_per_keyword_backlinks$cor[cor_per_keyword_backlinks$p<0.05], breaks=100, main="Histogram of Correlations of Backlinks/Positions on a Keyword Level", xlab="Correlation Coefficients below p<0.05")
We now see ore valid negative correlation coefficients than before but also some positive ones. It is unlikely that having more backlinks will punish you in the top 10 on Google, so again we see that correlation is not cause and effect :)
Let’s do the same for speed:
cor_per_keyword_speed <- dataset %>%
group_by(hash) %>%
mutate(cor = cor(position,SPEED), p = cor.test(position,SPEED)$p.value) %>%
arrange(hash,position) %>%
select(hash,cor,p) %>%
distinct()
hist(cor_per_keyword_speed$cor[cor_per_keyword_speed$p<0.05], breaks=100, main="Histogram of Correlations of Speed/Positions on a Keyword Level", xlab="Correlation Coefficients below p<0.05")
Again, it is unlikely that having a slow page will improve your ranking on Google. So please take this with a grain of salt.
cor_per_keyword_https <- dataset %>%
group_by(hash) %>%
mutate(cor = cor(position,secure2), p = cor.test(position,secure2)$p.value) %>%
arrange(hash,position) %>%
select(hash,cor,p) %>%
distinct()
hist(cor_per_keyword_https$cor[cor_per_keyword_https$p<0.05], breaks=100, main="Histogram of Correlations of HTTPS/Positions on a Keyword Level", xlab="Correlation Coefficients below p<0.05")
Same here.
cor_per_keyword_wdf_idf <- dataset %>%
group_by(hash) %>%
filter(!is.na(content_wdf_idf)) %>%
mutate(cor = cor(position,content_wdf_idf), p = cor.test(position,content_wdf_idf)$p.value) %>%
arrange(hash,position) %>%
select(hash,cor,p) %>%
distinct()
hist(cor_per_keyword_wdf_idf$cor[cor_per_keyword_wdf_idf$p<0.05], breaks=100, main="Histogram of Correlations of WDF-IDF/Positions on a Keyword Level", xlab="Correlation Coefficients below p<0.05")
There are only very few results with a significant correlation. Compared to the other results, I would be extremely careful to use the exact match-based WDF/IDF tools for an analysis.
Now, let’s look at only one keyword, “player update”:
player_update <- dataset[ which(dataset$hash == "002849692a74103fa4f867b43ac3b088"),]
cor(player_update[2:12])
## position secure2 year SPEED
## position 1.000000000 -0.14213381 0.68232305 0.66909158
## secure2 -0.142133811 1.00000000 -0.12309149 0.03240948
## year 0.682323045 -0.12309149 1.00000000 0.13563725
## SPEED 0.669091578 0.03240948 0.13563725 1.00000000
## backlinks_raw -0.680702773 0.47885649 -0.36529823 -0.68277810
## backlinks_log -0.266761578 -0.11383705 -0.41150574 -0.34237645
## title_word_count -0.051752817 -0.86063153 0.06519164 -0.24155722
## desc_word_count 0.004887624 -0.41838105 0.33015892 -0.09250187
## content_word_count 0.236537896 -0.87914035 0.34344837 -0.27081845
## content_tf_idf -0.058025885 -0.40824829 -0.20100756 0.19626152
## content_wdf_idf -0.058025885 -0.40824829 -0.20100756 0.19626152
## backlinks_raw backlinks_log title_word_count
## position -0.6807028 -0.26676158 -0.05175282
## secure2 0.4788565 -0.11383705 -0.86063153
## year -0.3652982 -0.41150574 0.06519164
## SPEED -0.6827781 -0.34237645 -0.24155722
## backlinks_raw 1.0000000 0.60288490 -0.12780842
## backlinks_log 0.6028849 1.00000000 0.39313530
## title_word_count -0.1278084 0.39313530 1.00000000
## desc_word_count 0.0220935 0.04238384 0.44923620
## content_word_count -0.3810450 0.10522050 0.76857971
## content_tf_idf -0.1984321 0.03267363 0.35135135
## content_wdf_idf -0.1984321 0.03267363 0.35135135
## desc_word_count content_word_count content_tf_idf
## position 0.004887624 0.2365379 -0.05802589
## secure2 -0.418381048 -0.8791404 -0.40824829
## year 0.330158915 0.3434484 -0.20100756
## SPEED -0.092501869 -0.2708185 0.19626152
## backlinks_raw 0.022093496 -0.3810450 -0.19843212
## backlinks_log 0.042383844 0.1052205 0.03267363
## title_word_count 0.449236202 0.7685797 0.35135135
## desc_word_count 1.000000000 0.2758416 0.69725202
## content_word_count 0.275841618 1.0000000 0.12353875
## content_tf_idf 0.697252022 0.1235387 1.00000000
## content_wdf_idf 0.697252022 0.1235387 1.00000000
## content_wdf_idf
## position -0.05802589
## secure2 -0.40824829
## year -0.20100756
## SPEED 0.19626152
## backlinks_raw -0.19843212
## backlinks_log 0.03267363
## title_word_count 0.35135135
## desc_word_count 0.69725202
## content_word_count 0.12353875
## content_tf_idf 1.00000000
## content_wdf_idf 1.00000000
Whilst we can see correlations here, pairs() does not offer p values. However, we can see that on a keyword level, signal impact looks differently, and looking at keywords of the same region (i.e. similar correlations for one signal), we may find more robust other signals for that region.
Looking at this small dataset, it was not possible to prove that some of the common SEO practices really have an impact as long as they are regarded as general advice that will always work. Having said that, this is a really small data set. Nevertheless, the reactions to the presentation were emotional, to pick my words carefully.
Unless other participants, I do not sell SEO tools, and I also do not earn money selling SEO services. It is like telling a Christian that Jesus never existed. But that’s not what I had said; I only said that based on the data I currently have, I don’t see that WDF/IDF has an impact on SEO. This does not mean that SEO consultants cannot be successful using WDF/IDF: My dataset is small. But these SEO consultants may also do a lot of other great stuff for a website but then think it was WDF/IDF that moved the needle. And of course, as soon as you believe in something, you only look at the data that supports your opinion (confirmation bias). Unfortunately, people love easy explanations for things they see. But there are no easy answers to complex questions.