Introduction to Python crawlers

What is a reptile

Crawler: A program that automatically grabs information from the Internet, grabs information that is valuable to us from the Internet.

Second, the Python crawler architecture

The Python crawler architecture is mainly composed of five parts, namely the scheduler, URL manager, web page downloader, web page parser, and application (crawling valuable data).

Scheduler: equivalent to the CPU of a computer, mainly responsible for scheduling the coordination between the URL manager, downloader, and parser.

URL manager: Includes the URL address to be crawled and the crawled URL address to prevent repeated crawling of URLs and circular crawling of URLs. There are three main ways to implement URL managers, which are implemented through memory, database, and cache database.

Web page downloader: Download a web page by passing in a URL address and convert the web page into a string. The web page downloader has urllib2 (the official Python basic module) including login, proxy, and cookies, and requests (third-party packages).

Web page parser: Parsing a web page string, we can extract our useful information according to our requirements, or it can be parsed according to the DOM tree parsing method. Web page parsers have regular expressions (intuitive, convert web pages to strings to extract valuable information through fuzzy matching. When the document is more complicated, this method will be very difficult to extract data), html. parser (built-in Python), beautifulsoup (third-party plug-in, you can use html.parser that comes with Python for parsing, or lxml for parsing, which is more powerful than several others), lxml (third-party plug-ins) , Can parse xml and HTML), html.parser and beautifulsoup and lxml are parsed as a DOM tree.

Application: An application composed of useful data extracted from a web page.

Let's use a diagram to explain how the scheduler works:

Three, urllib2 three ways to download web pages


#! / usr / bin / python
#-*-coding: UTF-8-*-
import cookielib
import urllib2
url = ""
response1 = urllib2.urlopen (url)
print "First method"
#Get status code, 200 indicates success
print response1.getcode ()
#Get length of web content
print len ​​( ())
print "The second method"
request = urllib2.Request (url)
#Simulate Mozilla browser for crawling
request.add_header ("user-agent", "Mozilla / 5.0")
response2 = urllib2.urlopen (request)
print response2.getcode ()
print len ​​( ())
print "The third method"
cookie = cookielib.CookieJar ()
#Add urllib2 ability to handle cookies
opener = urllib2.build_opener (urllib2.HTTPCookieProcessor (cookie))
urllib2.install_opener (opener)
response3 = urllib2.urlopen (url)
print response3.getcode ()
print len ​​( ())
print cookie

Fourth, the installation of third-party library Beautiful Soup
Beautiful Soup: Python's third-party plug-in is used to extract data from xml and HTML. The official website address is

1.Install Beautiful Soup

Open cmd (command prompt), enter the scripts in the Python (Python2.7 version) installation directory, enter dir to see if pip.exe is available, if you can use the pip command that comes with Python to install, enter the following command Just install it:

pip install beautifulsoup4

2.Test whether the installation is successful
Write a Python file and enter:

import bs4
print bs4

Run the file, if the output is normal, the installation is successful.

V. Parsing html files with Beautiful Soup


#! / usr / bin / python
#-*-coding: UTF-8-*-
import re
from bs4 import BeautifulSoup
html_doc = "" "
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie ,
Lacie and
Tillie ;
and they lived at the bottom of a well.


"" "
#Create a BeautifulSoup parse object
soup = BeautifulSoup (html_doc, "html.parser", from_encoding = "utf-8")
#Get all links
links = soup.find_all ('a')
print "all links"
for link in links:
     print, link ['href'], link.get_text ()
print "Get specific URL address"
link_node = soup.find ('a', href = "")
print, link_node ['href'], link_node ['class'], link_node.get_text ()
print "regex match"
link_node = soup.find ('a', href = re.compile (r "ti"))
print, link_node ['href'], link_node ['class'], link_node.get_text ()
print "Get the text of the P paragraph"
p_node = soup.find ('p', class _ = 'story')
print, p_node ['class'], p_node.get_text ()