The Trillion Dollar Club gets a little bigger

Going back all the way to 2018, the smartphone giant Apple was the first company to scale the heights and become the first Trillion dollar company. And the last time I reported on the Trillion Dollar…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Building a Search Engine Scraper with Streamlit

Web scraping is an effective technique to acquire data from the web with minimal manual effort

Web scraping enables us to acquire data from the web with minimal manual effort. However, this technique must be used with caution such that it doesn’t degrade the performance of a website. Before proceeding, we’ll discuss a few tips to follow to protect ourselves from being blacklisted by a website as a consequence of degrading its performance.

In this article, we’ll build a search engine scraper* using BeautifulSoup and Streamlit. The app takes a search string as an input, scrapes the search results from the first page of Bing and displays them. The app also returns a data frame of the search results. Below is the Python code of the Streamlit app and is clearly explained using comments.

* This app is built only to illustrate the process of web scraping. It is highly recommended to use the search engine’s API for extracting search results. Web scraping should only be the last resort.

The above code can be run by executing the following command in the terminal of a local machine.

Below is the output of the app in a local browser after executing the above command in the terminal of a local machine. We can also see a data frame of the results at the bottom of the output.

Image by author

All the above mentioned tasks demand a lot of manual effort that increases with the size of the entities to search. Hence, web scraping/APIs reduce this manual effort to a greater extent. However, human-in-the-loop is required to validate the results.

We’ll discuss a few questions that may arise in the mind of a beginner in web scraping.

1. What is requests.get()?

‘requests’ package enables us to handle HTTP requests in Python. Its get() function sends a query to the web server and returns a response.

2. What is a user agent?

User agent describes the browser and operating environment being used. This enables the web server to send the content that best suits our browser/device. A browser on a mobile device has a different user agent form that of Google Chrome on a Windows system. Switching user agents protects us from being blacklisted as the web server gets requests from multiple devices.

3. What is BeautifulSoup?

BeautifulSoup parses HTML/XML documents enabling us to extract data from web pages. We must pass the content/text of a request’s response to the BeautifulSoup class for it to parse. We can then extract the elements/tags of the web page’s HTML script.

4. How do we select the right HTML tag?

We can identify the right tag by inspecting the source code of a web page. The ‘inspect’ functionality is available in all the popular web browsers. We’ll discuss how to ‘inspect’ a web page’s source code.

Step 1: We’ll right click on the element in a web page that we want to scrape and select ‘Inspect’. For search engine scraper, we want to scrape the title, URL and description of a search result. Hence, we’ll right click on the URL in the first search result.

Image by author

We shouldn’t select only <h2> (title of search result) or <p> (description of a search result). This may pick up other <h2>s and <p>s in the page other than the ones associated with the search results. This may also make it difficult for us to map a list of <h2> to a list of <p>s.

Image by author

Hence, web scraping helps us in reducing the manual effort required for acquiring data from the web. We may use the same code multiple times to extract data from a website, provided the structure of the website stays the same. Code to extract the HTML script from a web page is similar for all the websites. What’s different is the process of cleaning and extracting the required data from the HTML script.

However, I highly recommend using search engine APIs for extracting search results. Web scraping should only be the last resort. Even if you opt for web scraping, it must be done without degrading the performance of a website and should only be used to extract the publicly available information.

Add a comment

Related posts:

My heart is in the work

Work is a term that contains multitudes. As an educator, I am because you are. This write-up talks about how we have just evolved biologically and certainly not socially. It talks about my purpose in life and hopes for humanity. My heart is in the work. Not given up on humanity yet.

Event Storming a startup idea remotely

We performed an event storming workshop recently for a new product idea being incubated within a larger organization which has seen phenomenal growth over the last few years and a very successful…

Learning to love you

Love the one thing we talk about, we want to be loved and also we want other people to be the object of our love. It is in the movies we watch, the books we read and the latest issue of our favorite…