Skip to content Skip to sidebar Skip to footer

Wrong Number Of Results In Google Scrape With Python

I was trying to learn web scraping and I am facing a freaky issue... My task is to search Google for news on a topic in a certain date range and count the number of results. my si

Solution 1:

There are a couple of things that is causing this issue. First, it wants day and month parts of date in 2 digits and it is also expecting a user-agent string of some popular browser. Following code should work:

import requests,  bs4

headers = {
    "User-Agent":
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
}
payload = {'as_epq': 'James Clark', 'tbs':'cdr:1,cd_min:01/01/2015,cd_max:01/01/2015', 'tbm':'nws'}
r = requests.get("https://www.google.com/search", params=payload, headers=headers)

soup = bs4.BeautifulSoup(r.content, 'html5lib')
print soup.find(id='resultStats').text

Solution 2:

To add to Vikas' answer, Google will also fail to use 'custom date range' for some user-agents. That is, for certain user-agents, Google will simply search for 'recent' results instead of your specified date range.

I haven't detected a clear pattern in which user-agents will break the custom date range. It seems that including a language is a factor.

Here are some examples of user-agents that break cdr:

Mozilla/5.0 (Windows; U; Windows NT 6.1; fr-FR) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27

Mozilla/4.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)


Solution 3:

There's no need in selenium, you're looking for this:

soup.select_one('#result-stats nobr').previous_sibling
# About 10,700,000 results

Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
    "q": 'James Clark',  # query
    "hl": "en",          # lang
    "gl": "us",          # country to search from
    "tbm": "nws",        # news filter
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

# if used without "nobr" selector and previous_sibling it will return seconds as well: (0.41 secods)
number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 10,700,000 results

Alternatively, you can achieve the same thing by using Google News Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to find which selectors will make the work done, or figure out why some of them don't return data you want also they should, bypass blocks from search engines, and maintain it over time.

Instead, you only need to iterate over structured JSON and get the data you want, fast.

Code to integrate for your case:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": 'James Clark',
  "tbm": "nws",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

number_of_results = results['search_information']['total_results']
print(number_of_results)

# 14300000

Disclaimer, I work for SerpApi.


Post a Comment for "Wrong Number Of Results In Google Scrape With Python"