CopyPastor

Detecting plagiarism made easy.

Score: 1; Reported for: Exact paragraph match Open both answers

Possible Plagiarism

Reposted on 2023-03-23
by Denis Skopa

Original Post

Original - Posted on 2023-02-05
by Denis Skopa



            
Present in both answers; Present only in the new answer; Present only in the old answer;

If your request gets blocked you can try to solve the blocking issue by adding `headers` where your [user-agent](https://www.whatismybrowser.com/detect/what-is-my-user-agent/) will be specified, this is necessary for Google to recognize the request as from a user, and not as from a bot, and not block it: ```python # https://docs.python-requests.org/en/master/user/quickstart/#custom-headers headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36" } ``` An additional step could be to [rotate user-agents](https://serpapi.com/blog/how-to-reduce-chance-of-being-blocked-while-web/#rotate-user-agents).
Also one option would be to use CAPTCHA solver, for example, [2captcha](https://2captcha.com/?ref=serpapi.com).
If you plan to continue using `selenium`, then use [selenium-stealth](https://pypi.org/project/selenium-stealth/). This will bypass all blocks including Cloudflare captcha.
Using [non-token based pagination](https://python.plainenglish.io/pagination-techniques-to-scrape-data-from-any-website-in-python-779cd32bd514#5a76) you can dynamically extract all results from all possible pages. It will go through all of them, no matter how many pages there are.
Check code in the [online IDE](https://replit.com/@denisskopa/scrape-google-search-pagin-bs4#main.py). ```python from bs4 import BeautifulSoup import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls params = { "q": "Why do I only see the first 4 results?", # query example "hl": "en", # language "gl": "uk", # country of the search, UK -> United Kingdom "start": 0, # number page by default up to 0 #"num": 100 # parameter defines the maximum number of results to return. }
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36" }
page_limit = 5 page_num = 0
data = []
# pagination while True: page_num += 1 print(f"page: {page_num}") html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30) soup = BeautifulSoup(html.text, 'lxml') for result in soup.select(".tF2Cxc"): title = result.select_one(".DKV0Md").text try: snippet = result.select_one(".lEBKkf span").text except: snippet = None links = result.select_one(".yuRUbf a")["href"] data.append({ "title": title, "snippet": snippet, "links": links }) # stop loop by page limit if page_num == page_limit: break # stop the loop when there is no button to switch to the next page if soup.select_one(".d6cvqb a[id=pnnext]"): params["start"] += 10 else: break print(json.dumps(data, indent=2, ensure_ascii=False)) ``` Example output: ```json [ { "title": "How to Show Up on the First Page of Google Today - Neil Patel", "snippet": "And if you take a look at Google's first page results for “SEO Guide,” you'll ... it's the only way to understand how truly effective this strategy can be.", "links": "https://neilpatel.com/blog/first-page-google/" }, { "title": "Cervical screening results - NHS", "snippet": "This means your risk of getting cervical cancer is very low. You do not need any further tests to check for abnormal cervical cells, even if you have had these ...", "links": "https://www.nhs.uk/conditions/cervical-screening/your-results/" }, other results ... ] ```
____
Also, you can use third-party API like [Google Search Engine Results API](https://serpapi.com/search-api) from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example: ```python from serpapi import GoogleSearch from urllib.parse import urlsplit, parse_qsl import json, os
params = { "api_key": "...", # serpapi key from https://serpapi.com/manage-api-key "engine": "google", # serpapi parser engine "q": "Why do I only see the first 4 results?", # search query "num": "100" # number of results per page (100 per page in this case) # other search parameters: https://serpapi.com/search-api#api-parameters }
search = GoogleSearch(params) # where data extraction happens
organic_results_data = [] page_num = 0
while True: results = search.get_dict() # JSON -> Python dictionary page_num += 1 for result in results["organic_results"]: organic_results_data.append({ "title": result.get("title"), "snippet": result.get("snippet"), "link": result.get("link") }) if "next_link" in results.get("serpapi_pagination", []): search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query))) else: break print(json.dumps(organic_results_data, indent=2, ensure_ascii=False)) ``` Output: exactly the same as in the bs4 solution.
Google Search can be parsed with [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) web scraping library without `selenium`, since the data is not being loaded dynamically via JavaScript, and will execute much faster in comparison to `selenium` as there's no need to render the page and use browser.
In order to get information from all pages, you can use pagination using an [infinite `while` loop](https://python.plainenglish.io/pagination-techniques-to-scrape-data-from-any-website-in-python-779cd32bd514#74dc). Try to avoid using `for i in range()` pagination as it is a hardcoded way of doing pagination thus not reliable. If the page number would change (from 5 to 20), pagination will be broken.
Since the while loop is infinite, you need to set the conditions for exiting it, you can make two conditions: * the exit condition will be the presence of a button to switch to the next page (it is not on the last page), the presence can be checked by its CSS selector (in our case - ".d6cvqb a[id=pnnext]") ```python # condition for exiting the loop in the absence of the next page button if soup.select_one(".d6cvqb a[id=pnnext]"): params["start"] += 10 else: break ``` * another solution would be to add a limit of pages available for scraping if there is no need to extract all the pages. ```python # condition for exiting the loop when the page limit is reached if page_num == page_limit: break ```
When trying to request a site, it may think that this is a bot, so that this does not happen, you need to send `headers` that contain [`user-agent` in the request](https://serpapi.com/blog/how-to-reduce-chance-of-being-blocked-while-web/#user-agent), then the site will assume that you are a user and display the information.
Next step could be to [rotate `user-agent`](https://serpapi.com/blog/how-to-reduce-chance-of-being-blocked-while-web/#rotate-user-agents), for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on. The most reliable way is to use rotating proxies, user-agents, and a captcha solver.
Check full code in the [online IDE](https://replit.com/@denisskopa/scrape-google-search-page-limit-bs4#main.py). ```python from bs4 import BeautifulSoup import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls params = { "q": "cars", # query example "hl": "en", # language "gl": "uk", # country of the search, UK -> United Kingdom "start": 0, # number page by default up to 0 #"num": 100 # parameter defines the maximum number of results to return. }
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36" }
page_limit = 10 # page limit for example
page_num = 0
data = []
while True: page_num += 1 print(f"page: {page_num}") html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30) soup = BeautifulSoup(html.text, 'lxml') for result in soup.select(".tF2Cxc"): title = result.select_one(".DKV0Md").text try: snippet = result.select_one(".lEBKkf span").text except: snippet = None links = result.select_one(".yuRUbf a")["href"] data.append({ "title": title, "snippet": snippet, "links": links }) # condition for exiting the loop when the page limit is reached if page_num == page_limit: break # condition for exiting the loop in the absence of the next page button if soup.select_one(".d6cvqb a[id=pnnext]"): params["start"] += 10 else: break print(json.dumps(data, indent=2, ensure_ascii=False)) ``` Example output: ```json [ { "title": "Cars (2006) - IMDb", "snippet": "On the way to the biggest race of his life, a hotshot rookie race car gets stranded in a rundown town, and learns that winning isn't everything in life.", "links": "https://www.imdb.com/title/tt0317219/" }, { "title": "Cars (film) - Wikipedia", "snippet": "Cars is a 2006 American computer-animated sports comedy film produced by Pixar Animation Studios and released by Walt Disney Pictures. The film was directed ...", "links": "https://en.wikipedia.org/wiki/Cars_(film)" }, { "title": "Cars - Rotten Tomatoes", "snippet": "Cars offers visual treats that more than compensate for its somewhat thinly written story, adding up to a satisfying diversion for younger viewers.", "links": "https://www.rottentomatoes.com/m/cars" }, other results ... ] ```
____
Also you can use [Google Search Engine Results API](https://serpapi.com/search-api) from SerpApi. It's a paid API with a free plan. The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example: ```python from serpapi import GoogleSearch from urllib.parse import urlsplit, parse_qsl import json, os
params = { "api_key": "...", # serpapi key from https://serpapi.com/manage-api-key "engine": "google", # serpapi parser engine "q": "cars", # search query "gl": "uk", # country of the search, UK -> United Kingdom "num": "100" # number of results per page (100 per page in this case) # other search parameters: https://serpapi.com/search-api#api-parameters }
search = GoogleSearch(params) # where data extraction happens
page_limit = 10 organic_results_data = [] page_num = 0
while True: results = search.get_dict() # JSON -> Python dictionary page_num += 1 for result in results["organic_results"]: organic_results_data.append({ "title": result.get("title"), "snippet": result.get("snippet"), "link": result.get("link") })
if page_num == page_limit: break if "next_link" in results.get("serpapi_pagination", []): search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query))) else: break print(json.dumps(organic_results_data, indent=2, ensure_ascii=False)) ``` Output: ```json [ { "title": "Rally Cars - Page 30 - Google Books result", "snippet": "Some people say rally car drivers are the most skilled racers in the world. Roger Clark, a British rally legend of the 1970s, describes sliding his car down ...", "link": "https://books.google.co.uk/books?id=uIOlAgAAQBAJ&pg=PA30&lpg=PA30&dq=cars&source=bl&ots=9vDWFi0bHD&sig=ACfU3U1d4R-ShepjsTtWN-b9SDYkW1sTDQ&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgcEAM" }, { "title": "Independent Sports Cars - Page 5 - Google Books result", "snippet": "The big three American auto makers produced sports and sports-like cars beginning with GMs Corvette and Fords Thunderbird in 1954. Folowed by the Mustang, ...", "link": "https://books.google.co.uk/books?id=HolUDwAAQBAJ&pg=PA5&lpg=PA5&dq=cars&source=bl&ots=yDaDtQSyW1&sig=ACfU3U11nHeRTwLFORGMHHzWjaVHnbLK3Q&hl=en&sa=X&ved=2ahUKEwjPv9axu_b8AhX9LFkFHbBaB8c4yAEQ6AF6BAgaEAM" } other results... ] ```

        
Present in both answers; Present only in the new answer; Present only in the old answer;