Crawl Website in Python
Crawl Website in Python
This page explains how to use Python and extract (title) information from a LIVE website. The code provided is fairly simple and to use it we need to be comfortable using a terminal and have basic programming knowledge. Resources and libraries used:
A terminal window
Python3
installed and accessible via the terminal windowPIP
, the official Python package managerrequests
- a popular and simple HTTP libraryBeautiful Soup
- a library used to parse HTML and extract information with ease10minutes to understand and type the commands
Let's start writing code.
Check Python is installed
Python is installed by default in MacOS and Linux systems and should be downloaded and installed in all Windows versions. Once is properly installed, we can start the Python console by typing python
in the terminal.
Install libraries
Request - simple HTTP library for Python, built for human beings.
Beautiful Soup - Python library for pulling data out of HTML and XML files.
Write code in Python Console
The first step is to import the libraries used in our code:
Once the libraries are imported we can use all helpers exposed. The following code snippet defines a variable that holds the website address and download the page using requests
library.
At this point, the page
should be injected and used via BeautifulSoup4
.
This simple tutorial should provoke curious minds to search other Python hot topics
and try to code more useful things. We will provide a short-list with suggestions:
List all images of a web page
List the inner links (to other pages, the same domain)
List the outer links (external websites)
Links & Resources
Python - the official website
Python Cheatsheet - this site should make you curious
Join AppSeed - For support and
production-ready
starters
Last updated