By Mark Andrews, Alibaba Cloud Tech Share Author. Tech Share is Alibaba Cloud's incentive program to encourage the sharing of technical knowledge and best practices within the cloud community.
Today we will be writing our own basic headless web scraping "bot" in Python with Beautiful Soup. Headless generally means web browsing with no GUI (Graphical User Interface). In this lesson, we will be doing everything through the terminal command line.
We will deploy an Alibaba Cloud Elastic Compute Service (ECS) burstable type t5 nano instance running CentOS 7. We will be utilizing the Requests and Beautiful Soup 4 modules.
You should be familiar with launching an Alibaba Cloud instance running CentOS for this tutorial. If you are not sure how to set up an ECS instance, check out this tutorial. If you have already purchased one, check out this tutorial to configure your server accordingly.
I have deployed a CentOS instance for this lesson as it is a super light-weight OS. For this project the less bloat the better. Basic terminal command line knowledge is recommended as we are not going to be using a GUI (Graphical User Interface) on this project.
It's always a good idea to update everything on a particular instance. First, let's update all the packages to the latest versions so we won't run into any issues down the road.
sudo yum update
We will be using Python for our basic web scraping "bot". I admire the language for its relative simplicity and of course the wide variety of modules available to play around with. In particular, we will be using the Requests and Beautiful Soup 4 modules.
Usually Python3 is installed by default, but if it's not, install Python 3 and Pip. First we are going to install IUS, which stands for Inline with Upstream Stable. A community project, IUS provides Red Hat Package Manager (RPM) packages for some newer versions of select software. Then move forward with installing python36u and pip.
sudo yum install https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install python36u
sudo yum install python36u-pip
Pip is a package management system used to install and manage software packages, such as those found in the Python Package Index. What is Pip? Pip is a replacement for easy_install.
I've ran into some headaches in the past with installing Pip rather than python36u-pip so be aware that installing pip is for Python 2.7 and python36u-pip is for Python 3.
Nano is a basic text editor that is useful in applications such as this. Let's install Nano.
sudo yum install nano
Now we need to install our Python packages we will be using today, Requests and Beautiful Soup 4.
We will install these through PIP.
pip36u install requests
pip36u install beautifulsoup4
Requests is a Python module that allows us to navigate to a web page with the Requests .get method.
Requests allows you to send HTTP/1.1 requests, all programatically through the Python script. There's no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic. We will be focusing on the Requests .get method today to grab a web page source.
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
We will be using Beautiful Soup 4 with the standard html.parser in Python to parse and organize the data from the web page source we will be getting with Requests. In this tutorial we will use the Beautiful Soup "prettify" method to organize our data in a more human readable way.
Lets make a folder called "Python_apps". Then, we will change our present working directory to Python_apps.
mkdir Python_apps
cd Python_apps
Now comes the fun part! We can write our Python Headless Scraper Bot. We will be using Requests to go to a URL and grab the page source. Then we will use Beautiful Soup 4 to parse the HTML source into a semi readable format. After doing this we will save the parsed data to a local file on the instance. Let's get to work.
We will be using Requests to grab the page source and BeautifulSoup4 to format the data to a readable state. We will then use the Python methods of open() and write() to save the page data to our local hard drive. Let's go.
Open up Nano or a text editor of your choice in terminal and make a new file named "bot.py. I find Nano to be perfectly adequate for basic text editing functions.
First add our imports.
############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup
The code below defines several global variables.
####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("WHAT URL WOULD YOU LIKE TO SCRAPE? ")
####### REQUEST GET METHOD for URL
r = requests.get("http://" + url)
####### DATA FROM REQUESTS.GET
data = r.text
Now let's turn that global var "data" into a BS4 object so we can format it with the BS4 prettify method.
####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()
Let's print these variables out in the terminal as well as the local file. This will show us what data is going to be written to the local file before we actually write to it.
print(source)
You will get the source in a big chunk of text first. This is very hard for a human to decipher so we are going to turn to Beautiful Soup for some help in formatting. Let's call the Prettify method to organize the data somewhat better. This will make human readability much better. Then we print the source after the BS4 prettify() method.
print(pretty_source)
After running the code, you should get a prettified format of the HTML source of the imputed page in the terminal at this point.
Now let's save that file to our local hard drive on the Alibaba Cloud ECS instance. For this we need to first open the file in write mode.
To do this we pass the string "w" as the second argument in the open() method.
####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
### GET RID OF ENCODING ISSUES ##########################################
#local_file.write(pretty_source.encode('utf-8'))
####### CLOSE FILE
local_file.close()
In the above block of code we made a variable that creates and opens a file named after our URL we input earlier with "_scrapped.txt" concatenated on. The first argument for the open method is the file name on local disc. We are stripping the "HTTPS://" and the "HTTP://" from the file name. If we don't strip that the file name is invalid. The second argument is the permission in this case write.
We then write to the variable "local_file" with the .write method passing the "pretty_source" variable as an argument. If we need to encode the text, in this case, in UTF-8 to print to the local file properly, use the commented out line. Then we close the local text file.
Let's run the code and see what happens.
python3.6 bot.py
You will be asked to enter an URL to scrape. Let's try https://www.wikipedia.org. Let the bot work its magic for a minute. We should now have the decently formatted source code from a particular website saved in our local working directory as a .txt file.
The final code for this project should like like this.
print("*" * 30 )
print("""
#
# SCRIPT TO SCRAPE AND PARSE DATA FROM
# A USER INPUTTED URL. THEN SAVE THE PARSED
# DATA TO THE LOCAL HARD DRIVE.
""")
print("*" * 30 )
############################################################ IMPORTS
import requests
from bs4 import BeautifulSoup
####### REQUESTS TO GET PAGE : BS4 TO PARSE DATA
#GLOBAL VARS
####### URL FOR SITE TO SCRAPE
url = input("ENTER URL TO SCRAPE")
####### REQUEST GET METHOD for URL
r = requests.get(url)
####### DATA FROM REQUESTS.GET
data = r.text
####### MAKE DATA VAR BS4 OBJECT
source = BeautifulSoup(data, "html.parser")
####### USE BS4 PRETTIFY METHOD ON SOURCE VAR NEW VAR PRETTY_SOURCE
pretty_source = source.prettify()
print(source)
print(pretty_source)
####### OPEN SOURCE IN WRITE MODE WITH "W" TO VAR LOCAL_FILE
####### MAKE A NEW FILE
local_file = open(url.strip("https://" + "http://") + "_scrapped.txt" , "w")
####### WRITE THE VAR PRETTY_SOUP TO FILE
local_file.write(pretty_source)
#local_file.write(pretty_source.decode('utf-8','ignore'))
#local_file.write(pretty_source.encode('utf-8')
####### CLOSE FILE
local_file.close()
We have learned how to build a basic headless web scraping "bot" in Python with Beautiful Soup 4 on an Alibaba Cloud Elastic Compute Service (ECS) instance with CentOS 7. We used Requests to get a particular web page source code and then parsed the data with Beautiful soup 4. Finally, we save a local text file on the instance with the scraped web page source code. With the Beautiful soup 4 module we can format the text for better human readability.
Exploring Blockchain and Big Data with Alibaba Cloud Data Lake Analytics
2,599 posts | 762 followers
FollowAlibaba Clouder - May 20, 2019
Alibaba Clouder - May 21, 2019
Alibaba Clouder - May 21, 2019
Lee Li - January 4, 2021
Alibaba Clouder - August 10, 2020
Alibaba Clouder - December 20, 2019
2,599 posts | 762 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreBuild superapps and corresponding ecosystems on a full-stack platform
Learn MoreTransform your business into a customer-centric brand while keeping marketing campaigns cost effective.
Learn MoreMore Posts by Alibaba Clouder