Crawling a Website with Python / by Robert Walker

Today I had a seemingly simple request: Map all the Cintas locations with the Geosyntec locations.  Okay - easy!  I'll go to Cintas' website, click on a map of their locations, grab the JSON source, and map it.  

But of course Cintas doesn't actually have a map!  What do they have? A nested series of URLs that require a minimum of three clicks in a browser to get to the address of a location (see example).  With locations in most states and provinces that was not a manual task.  So... time to program!

I wrote this script to crawl over their URLs and grab the information I needed.  At first, I was having some errors... after all I am making hundreds of requests per minute (e.g., could appear as a DDOS attacker).  So the website sometimes became unresponsive for a minute - or maybe a link was dead (404 or 500), so I included a very handy retry wrapper from SaltyCrane.

After using this script I had a very useful list of Cintas locations to use for geocoding.

So let's review the script!

Import Libraries

import urllib.request
from bs4 import BeautifulSoup
import re

I used urllib to make the URL request; BeautifulSoup to parse the returned HTML; and regular expressions for cleaning up some of the content.

Create Configurations

baseURL = 'http://www.cintas.com/local/'
locations = {'canada': {
'provinces': ['ab', 'bc', 'on', 'qc']
},
 'usa': {
'states': ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA',
 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT',
 'VA', 'WA', 'WV', 'WI', 'WY']
}
 }
# file to store results
f = open('CintasLocations.txt', 'w')

I imported all 50 states - Cintas didn't have locations in all 50, but this way I can be safe.  This sometimes returned a 404 - but we handle that with a retry/error handling wrapper.  We also create a text file to save the results.

Use Wrapper and URL Request Functions for Error Handling

def retry(ExceptionToCheck, tries=4, delay=3, backoff=2, logger=None):
    """Retry calling the decorated function using an exponential backoff.

    http://www.saltycrane.com/blog/2009/11/trying-out-retry-decorator-python/
    original from: http://wiki.python.org/moin/PythonDecoratorLibrary#Retry

    :param ExceptionToCheck: the exception to check. may be a tuple of
        exceptions to check
    :type ExceptionToCheck: Exception or tuple
    :param tries: number of times to try (not retry) before giving up
    :type tries: int
    :param delay: initial delay between retries in seconds
    :type delay: int
    :param backoff: backoff multiplier e.g. value of 2 will double the delay
        each retry
    :type backoff: int
    :param logger: logger to use. If None, print
    :type logger: logging.Logger instance
    """
    def deco_retry(f):

        @wraps(f)
        def f_retry(*args, **kwargs):
            mtries, mdelay = tries, delay
            while mtries > 1:
                try:
                    return f(*args, **kwargs)
                except ExceptionToCheck as err:
                    msg = "%s, Retrying in %d seconds..." % (str(err), mdelay)
                    if logger:
                        logger.warning(msg)
                    else:
                        print(msg)
                    time.sleep(mdelay)
                    mtries -= 1
                    mdelay *= backoff
            return f(*args, **kwargs)

        return f_retry# true decorator

    return deco_retry

@retry(urllib.request.URLError, tries=4, delay=3, backoff=2)
def urlopen_with_retry(url_to_try):
    return urllib.request.urlopen(url_to_try)

I did not write this wrapper - but I did update it for Python 3.  It was very useful!

Build the URLs, Requests, and Data Extraction

for k, v in locations.items():
for i, j in v.items():
for a in j:
url = baseURL + k + "/" + a
print(url)

try:
doc = urlopen_with_retry(url)
parsed_doc = BeautifulSoup(doc.read(), 'html.parser')
# print(parsed_doc.body.find('ul', attrs={'class': 'locations'}))

for child in parsed_doc.body.find('ul', class_='locations').find_all('li'):
loc_link = child.find('a')['href']
print(loc_link)
loc_doc = urlopen_with_retry(loc_link)
parsed_loc_doc = BeautifulSoup(loc_doc.read(), 'html.parser')
address = parsed_loc_doc.body.find(string="Address").parent.parent
if len(address.span.contents) > 1:
street = address.span.contents[0]
else:
street = ""
city = address.contents[-1]
city = re.sub('[\t+]', '', city).strip()

print(street + ", " + city)
f.write(street + ", " + city + '\n')
except urllib.request.HTTPError as e:
print(e.args)

f.close()

First loop through my configuration and build a URL to make a request against.  Then parse the HTML and return just the content I need.  In the first request I want the URL to the individual locations.  In the second request I want the address of that location.

All Together

import urllib.request
from bs4 import BeautifulSoup
import re

baseURL = 'http://www.cintas.com/local/'
locations = {'canada': {
'provinces': ['ab', 'bc', 'on', 'qc']
},
 'usa': {
'states': ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA',
 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ',
 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT',
 'VA', 'WA', 'WV', 'WI', 'WY']
}
 }

# WORKFLOW
# Create Links
# Get list (ul, class="locations")
# Follow link in list
# Get Address (span, itemprop="streetAddress"); requires going up to parent also

# file to store results
f = open('CintasLocations.txt', 'w')

import time
from functools import wraps

def retry(ExceptionToCheck, tries=4, delay=3, backoff=2, logger=None):
"""Retry calling the decorated function using an exponential backoff.

http://www.saltycrane.com/blog/2009/11/trying-out-retry-decorator-python/
original from: http://wiki.python.org/moin/PythonDecoratorLibrary#Retry

:param ExceptionToCheck: the exception to check. may be a tuple of
exceptions to check
:type ExceptionToCheck: Exception or tuple
:param tries: number of times to try (not retry) before giving up
:type tries: int
:param delay: initial delay between retries in seconds
:type delay: int
:param backoff: backoff multiplier e.g. value of 2 will double the delay
each retry
:type backoff: int
:param logger: logger to use. If None, print
:type logger: logging.Logger instance
"""
def deco_retry(f):

@wraps(f)
def f_retry(*args, **kwargs):
mtries, mdelay = tries, delay
while mtries > 1:
try:
return f(*args, **kwargs)
except ExceptionToCheck as err:
msg = "%s, Retrying in %d seconds..." % (str(err), mdelay)
if logger:
logger.warning(msg)
else:
print(msg)
time.sleep(mdelay)
mtries -= 1
mdelay *= backoff
return f(*args, **kwargs)

return f_retry# true decorator

return deco_retry

@retry(urllib.request.URLError, tries=4, delay=3, backoff=2)
def urlopen_with_retry(url_to_try):
return urllib.request.urlopen(url_to_try)

for k, v in locations.items():
for i, j in v.items():
for a in j:
url = baseURL + k + "/" + a
print(url)

try:
doc = urlopen_with_retry(url)
parsed_doc = BeautifulSoup(doc.read(), 'html.parser')
# print(parsed_doc.body.find('ul', attrs={'class': 'locations'}))

for child in parsed_doc.body.find('ul', class_='locations').find_all('li'):
loc_link = child.find('a')['href']
print(loc_link)
loc_doc = urlopen_with_retry(loc_link)
parsed_loc_doc = BeautifulSoup(loc_doc.read(), 'html.parser')
address = parsed_loc_doc.body.find(string="Address").parent.parent
if len(address.span.contents) > 1:
street = address.span.contents[0]
else:
street = ""
city = address.contents[-1]
city = re.sub('[\t+]', '', city).strip()

print(street + ", " + city)
f.write(street + ", " + city + '\n')
except urllib.request.HTTPError as e:
print(e.args)

f.close()

Conclusion

This little script grabbed all 417 address for me to geocode in a just under an hour!  Super-easy! I had very good results!