Here’s a breakdown of each component we used to get the title:
Beautiful Soup is powerful because our Python objects match the nested structure of the HTML document we are scraping.
To get the text of the first <a tag, enter this:
soup.body.a.text # returns '1'
To get the title within the HTML’s body tag (denoted by the “title” class), type the following in your terminal:
soup.body.p.b # returns <bBody's title</b
For deeply nested HTML documents, navigation could quickly become tedious. Luckily, Beautiful Soup comes with a search function so we don’t have to navigate to retrieve HTML elements.
タグの要素を検索する
find_all()メソッドは HTML タグを文字列の引数として受け取り、与えられたタグとマッチする要素のリストを返します。例えば、doc.htmlに含まれるすべてのa` タグが欲しい場合、以下のようになります。
We’ve covered the most popular ways to get tags and their attributes. Sometimes, especially for less dynamic web pages, we just want the text from it. Let’s see how we can get it!
HTML を検査することで、本の URL、表紙画像、タイトル、評価、価格、その他のフィールドにアクセスする方法を学ぶことができます。では、本の項目をスクレイピングしてデータを抽出する関数を書いてみましょう。
def scrape(source_url, soup): # Takes the driver and the subdomain for concats as params
# Find the elements of the article tag
books = soup.find_all("article", class_="product_pod")
# Iterate over each book article tag
for each_book in books:
info_url = source_url+"/"+each_book.h3.find("a")["href"]
cover_url = source_url+"/catalogue" +
each_book.a.img["src"].replace("..", "")
title = each_book.h3.find("a")["title"]
rating = each_book.find("p", class_="star-rating")["class"][1]
# can also be written as : each_book.h3.find("a").get("title")
price = each_book.find("p", class_="price_color").text.strip().encode(
"ascii", "ignore").decode("ascii")
availability = each_book.find(
"p", class_="instock availability").text.strip()
# Invoke the write_to_csv function
write_to_csv([info_url, cover_url, title, rating, price, availability])
def write_to_csv(list_input):
# The scraped info will be written to a CSV here.
try:
with open("allBooks.csv", "a") as fopen: # Open the csv file.
csv_writer = csv.writer(fopen)
csv_writer.writerow(list_input)
except:
return False
def browse_and_scrape(seed_url, page_number=1):
# Fetch the URL - We will be using this to append to images and info routes
url_pat = re.compile(r"(http://.*.com)")
source_url = url_pat.search(seed_url).group(0)
# Page_number from the argument gets formatted in the URL & Fetched
formatted_url = seed_url.format(str(page_number))
try:
html_text = requests.get(formatted_url).text
# Prepare the soup
soup = BeautifulSoup(html_text, "html.parser")
print(f"Now Scraping - {formatted_url}")
# This if clause stops the script when it hits an empty page
if soup.find("li", class_="next") != None:
scrape(source_url, soup) # Invoke the scrape function
# Be a responsible citizen by waiting before you hit again
time.sleep(3)
page_number += 1
# Recursively invoke the same function with the increment
browse_and_scrape(seed_url, page_number)
else:
scrape(source_url, soup) # The script exits here
return True
return True
except Exception as e:
return e
パズルの最後のピースとして、スクレイピングフローを開始します。seed_urlを定義し、browse_and_scrape()を呼び出してデータを取得します。これはif name == “main“` ブロックの下で行われます。
if __name__ == "__main__":
seed_url = "http://books.toscrape.com/catalogue/page-{}.html"
print("Web scraping has begun")
result = browse_and_scrape(seed_url)
if result == True:
print("Web scraping is now complete!")
else:
print(f"Oops, That doesn't seem right!!! - {result}")
if name == “main“` ブロックについてもっと知りたい場合は、このブロックの動作に関するガイドを参照してください。
ターミナルで以下のようなスクリプトを実行すると、次のような出力が得られます。
$ python scraper.py
Web scraping has begun
Now Scraping - http://books.toscrape.com/catalogue/page-1.html
Now Scraping - http://books.toscrape.com/catalogue/page-2.html
Now Scraping - http://books.toscrape.com/catalogue/page-3.html
.
.
.
Now Scraping - http://books.toscrape.com/catalogue/page-49.html
Now Scraping - http://books.toscrape.com/catalogue/page-50.html
Web scraping is now complete!
http://books.toscrape.com/a-light-in-the-attic_1000/index.html,http://books.toscrape.com/catalogue/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg,A Light in the Attic,Three,51.77,In stock
In this tutorial, we learned the ethics of writing good web scrapers. We then used Beautiful Soup to extract data from an HTML file using the Beautiful Soup’s object properties, and it’s various methods like find(), find_all() and get_text(). We then built a scraper than retrieves a book list online and exports to CSV.
Web scraping is a useful skill that helps in various activities such as extracting data like an API, performing QA on a website, checking for broken URLs on a website, and more. What’s the next scraper you’re going to build?