The Project
In the digital age, data is often considered the new gold. For avid readers, knowing which books have left a lasting impact in the 21st century is invaluable. Goodreads, a leading platform for book enthusiasts, offers a treasure trove of information, including user-generated rankings and reviews for books spanning various genres.
Our journey begins with a simple yet ambitious goal: to compile a comprehensive list of the top 1000 books of the 21st century, as rated by Goodreads users. This list will not only serve as a valuable resource for readers seeking their next literary adventure but will also provide insights into the literary landscape of the past two decades.
To achieve this, we’ll employ simple, but effective Python data scraping code—a technique that allows us to extract structured data from websites. We will combine the html code from the Goodreads website and a popular data scraping library, BeautifulSoup, to navigate multiple levels of web pages within Goodreads, locating the information we need, and systematically pull the desired data on these literary gems into a dataframe we can clean and analyze later.
But data scraping is not merely about extracting data; it’s a journey of discovery. Along the way, we’ll encounter challenges, refine our techniques, and build a dataset that reflects the collective wisdom of Goodreads’ vast community of readers. We’ll explore the intricacies of Goodreads’ book pages, collecting not only titles and authors but also ratings, reviews, page counts, genres, and publication details. Let’s see how all of this testament to the power of data and the love of literature is possible.
The Objectives
Considerations Before Scraping Data
Before diving into a data scraping project, it’s essential to address several key questions. Some of these are functional in nature, such as determining whether there’s a public API available for data retrieval. Additionally, ethical considerations come into play. For instance, it’s crucial to assess whether the data is publicly accessible or if it’s owned by a third-party, as overlooking these aspects can lead to complications. While ethical scrutiny is vital for all data scraping projects, in our case, we’re working with a public website containing user-generated preferences on external content. Hence, the level of scrutiny might be less stringent than what’s required for corporate or business-related scraping endeavors.
Defining Data Extraction Strategy
Now, let’s outline our strategy for data extraction. Our initial focus will be on retrieving fundamental book details, which include:
Moving beyond these basics, we’ll consider the availability of other data elements that can enhance future analysis. It’s important to keep in mind that our scraping approach is flexible. We can continuously refine it, capturing any missed data during subsequent iterations with minimal adjustments to our code. After careful deliberation, we’ve identified the following additional elements from Goodreads that we aim to collect:
Now that we’ve identified the data we intend to extract from each book, let’s delve into the execution phase, where we’ll detail how our code will obtain this information.
Importing Python Libraries
In order to begin writing our code, we need to first import the libraries we’ll be utilizing. We’ll be using the following Python libraries in Google Colaboratory to complete our project:
from bs4 import BeautifulSoup
import requests
import pandas as pd
Book List Webpage Collection
We’ll first need to collect all of the URLs that get us to the individual webpage for each book. Goodreads lists the books in batches of 100 (1-100, 101-200, etc.) and in order to collect our 1,000 books, we’ll need to collect 100 URLs for each page from the first 10 pages of the “Best Books of the 21st Century” list.
In order to complete this, we need to utilize two loops. The first loop will capture the book webpage links for the 21st century book list, pages 1 through 10. From there, we’ll embed a second loop to extract the books (1-100) from subsequent list pages. First, we provide the URL, with a variable element in the page number, allowing for our initial loop:
# Loop for Initial Scraping
gr_pages = 10
i = 99
j = 0
GoodReadsBest = []
for page_num in range(1, gr_pages + 1):
# Construct the URL for the current page
good_url = f"https://www.goodreads.com/list/show/
7.Best_Books_of_the_21st_Century?page={page_num}"
# Send a GET request to the page
good_webpage = requests.get(good_url, headers=HEADERS)
good_soup = BeautifulSoup(good_webpage.content, 'html.parser')
# Find all the book links on the current page
good_links = good_soup.find_all("a", attrs={"class":
"bookTitle"})
The variables i and j defined at the outset of our code will be used in the next loop. Additionally, we’re adding the HEADERS element, which provides my User Agent in the requests, providing a record of my data scrape. I’ve also pulled the necessary HTML attributes to gather the titles and individual URLs from each page we’re scraping.
Now, we deploy another loop, based on the first. Calling the individual book URL text, parsing again, and lining them up for individual book data extraction:
# Loop through the book links and extract information from
individual book pages
for i in range(j, i+1):
page_url = base_url + good_links[i].get("href")
new_goodpage = requests.get(page_url, headers=HEADERS)
if new_goodpage != "Response [200]":
new_goodpage = requests.get(page_url, headers=HEADERS)
else:
new_goodpage
new_goodsoup = BeautifulSoup(new_goodpage.content,
"html.parser")
I added an element in the loop to make sure that we’re able to connect with the URLs. If we’re connected, we’ll get “Response [200]” from our HTML request. If that isn’t satisfied, the loop will try again until we’re able to connect and extract the URLs for the books on a given page. Using the “href” will pull the URL for us, not just the text. Since our dataframe will be indexing the data starting at 0, I’ve made sure to incorporate this into our loop so we’ll be able to better clean the data once it’s collected.
Individual Book Data Collection
We now have the code to collect the links from each of the list pages (1-10), and to use that data to get to the individual book pages (rank 1-100, index 0-99). Now we need to get the data for the fields we want extracted for each book in our top 1,000 list, embedded in our second loop.
For Title:
# Extract information from the book page
book_title_elem = new_goodsoup.find("h1", attrs={"class":
"Text Text__title1"})
if book_title_elem:
book_title = book_title_elem.text
else:
book_title = None
For Author:
author_elem = new_goodsoup.find("span", attrs={"class":
'ContributorLink__name'})
if author_elem:
author = author_elem.text
else:
author = None
Contingent Loop
When I first wrote the full loop, I encountered two problems that needed solutions. The first was that, on occasion, the HTML request made for an individual data point would return no text. In other words, the output wouldn’t define as a text element. This would disrupt the loop, halting the scrape. Additionally, I ran this code in a rural area, with internet that would intermittently drop out. It goes unnoticed during regular web-surfing or streaming, however, compiling code for more than a few minutes straight, the bandwidth can become unreliable. To combat this, I added an if/else that would return a None value if the request didn’t return a text element, or if the internet dropped out while trying to extract the information.
While the initial code collected the majority of data, it also left us with a number of None values for many of the records collected. In order to overcome this issue, I wrote in another loop to go back through the data, and re-collect any of the books that returned None valued that were identified. I included a range of 3, so the code will make multiple attempts to collect the data, preventing further scraping after the code initiates:
# Loop for Revisiting Pages with None Values
for revisit_count in range(3):
none_indices = [index for index, value in
enumerate(GoodReadsBest) if None in value]
for index_to_revisit in none_indices:
page_num = (index_to_revisit // 100) + 1
adjusted_index = index_to_revisit % 100
good_url = f"https://www.goodreads.com/list/show/
7.Best_Books_of_the_21st_Century?page={page_num}"
good_webpage = requests.get(good_url, headers=HEADERS)
good_soup = BeautifulSoup(good_webpage.content,
'html.parser')
good_links = good_soup.find_all("a", attrs={"class":
"bookTitle"})
page_url = base_url + good_links[adjusted_index].get("href")
new_goodpage = requests.get(page_url, headers=HEADERS)
if new_goodpage.status_code == 200:
new_goodsoup = BeautifulSoup(new_goodpage.content,
"html.parser")
# Extract information from the book page
book_title_elem = new_goodsoup.find("h1", attrs={"class":
"Text Text__title1"})
if book_title_elem:
book_title = book_title_elem.text
else:
book_title = None
author_elem = new_goodsoup.find("span", attrs={"class":
'ContributorLink__name'})
if author_elem:
author = author_elem.text
else:
author = None
Unfortunately, this final loop did add extra time to the compilation of the code. However, the extra time (in minutes, not hours) was worth gathering the totality of the data that we wanted. Additionally, it overcame weaknesses in both the initial code, and the internet over which I was performing our data scrape.
Finally, we want to put this data somewhere that we can utilize later. So, we’ll use Pandas to build a dataframe we can export to a csv file in the future:
# Regular df load in
df = pd.DataFrame(GoodReadsBest, columns=['Title', 'Author', 'Rating', 'No. of
Ratings', 'No. of Reviews', 'Pages',
'Genres', 'Published', 'Quote-Disc-Quest
Raw'])
df.head()
This gave us our top 1,000 books of the 21st century, from goodread.com, with a dataframe shape of (1000, 9).
Index & Numeric Data
During our data scrape, we identified the need to reset the index to correctly reflect Goodreads’ user ratings. By undertaking this initial step in the data cleaning process, we pave the way for a more precise and well-structured dataset that will be the foundation for our subsequent analyses and visualizations. This will also create a new column with our old index, we’ll be sure to drop that column as well.
# Reset index to 1-999
df.reset_index(drop=True, inplace=True) # Reset the index and drop the old index column
df.index = df.index + 1
columns_to_drop = ['Unnamed: 0']
df.drop(columns=columns_to_drop, inplace=True)
Now that our index is aligned with Goodreads rankings, our next focus is extracting numerical data. All the information we collected from the HTML code was in text format, making it unusable for any numerical calculations or analyses. To address this, we’ll specifically target the number of ratings, reviews, pages, and parse the numeric values within our ‘Quote-Disc-Quest Raw’ field. This process will yield three additional columns of numerical data, essential for statistical analysis.
Our approach involves straightforward formulas and functions to remove text characters and commas from the Ratings and Reviews fields. After this transformation, we’ll have both the numeric value and its corresponding text description. Our goal is to extract the numeric part and convert it into a float data type for further analysis.
# Remove commas, extract numbers and convert to integers for Ratings and Reviews
# Ratings
df[['No. of Ratings', 'Ratings']] = df['No. of Ratings'].str.extract(r'([0-9,]+)\s*(\D*)')
df['No. of Ratings'] = df['No. of Ratings'].str.replace(',', '').astype(float)
def format_float(value):
return '{:,.0f}'.format(value)
df['No. of Ratings'] = df['No. of Ratings'].apply(format_float)
# Reviews
df[['No. of Reviews', 'Reviews']] = df['No. of Reviews'].str.extract(r'([0-9,]+)\s*(\D*)')
df['No. of Reviews'] = df['No. of Reviews'].str.replace(',', '').astype(float)
def format_float(value):
return '{:,.0f}'.format(value)
df['No. of Reviews'] = df['No. of Reviews'].apply(format_float)
Now, let’s address the number of pages for each of our top 1,000 books. Currently, this data is also in text form, containing both a numerical value, the word ‘pages,’ and the book’s format. This step will introduce an additional categorical data field for potential future use.
Our first task is to create the new ‘Format’ column, positioning it right after the ‘Pages’ field. Following that, we’ll split the text we extracted, specifically at the comma after ‘pages,’ and remove the word itself. Once we convert the page numbers into a float data type, we’ll have both the numeric value and a fresh categorical field ready for analysis.
insert_location = 6 # Index where you want to insert the new column
column_name = 'Format'
df.insert(insert_location, column_name,"")
df[['Pages', 'Format']] = df['Pages'].str.split(', ', expand=True)
df['Pages'] = df['Pages'].str.replace(' pages', '').astype(float)
Another column within our dataset holds valuable numeric data for potential analysis. Due to the structure of Goodreads’ book pages in HTML, we extracted the totals for Quotes, Discussions, and Questions for each book as a single text string. To make this data more usable, we’ll follow a similar process to what we’ve done previously: isolating the numeric values. Since these numbers are currently in a single column, we’ll create three new columns, each dedicated to one of these fields. Additionally, we’ll convert any empty or null values to zeros, transforming these text strings into integers. Lastly, we’ll remove the original column from which this data was initially pulled.
# Extract numerical values using regular expressions
df['Quotes'] = df['Quote-Disc-Quest Raw'].str.extract(r'(\d+)quotes')
df['Discussions'] = df['Quote-Disc-Quest Raw'].str.extract(r'(\d+)discussions')
df['Questions'] = df['Quote-Disc-Quest Raw'].str.extract(r'(\d+)questions')
# Fill missing values with zeros
df['Quotes'].fillna(0, inplace=True)
df['Discussions'].fillna(0, inplace=True)
df['Questions'].fillna(0, inplace=True)
# Convert the new columns to numeric data type and round to integers
df['Quotes'] = pd.to_numeric(df['Quotes']).astype(int)
df['Discussions'] = pd.to_numeric(df['Discussions']).astype(int)
df['Questions'] = pd.to_numeric(df['Questions']).astype(int)
# Drop the original 'Quote-Disc-Quest Raw' column
df.drop(columns=['Quote-Disc-Quest Raw'], inplace=True)
Next, we’ll handle our categorical data, and the single column that contains datetime information.
Genres & DateTime Data
The titles and author names were efficiently extracted with our loop, eliminating the need for any cleaning in those columns. However, there are some tasks ahead for refining our genres column and retrieving the publication date for each book.
In the genre column, each book is associated with several descriptive genres. Our data scrape collected all of these genres as a single, unseparated string. Given that we won’t perform extensive numerical operations on this data, we only need the top 5 genres for each book. Before proceeding, though, we must address an issue in the genre column. Two possible genres in our top 1,000 book list are ‘World War I’ and ‘World War II.’ If we split this text based on content, these two genres would be indistinguishable. To resolve this, we will search through the data and replace Roman numerals with standard Arabic numerals.
Once the genre text is cleaned, we can divide the complete string into five new columns, each representing one of the genres we intend to extract. To accomplish this, we will employ regular expressions to remove all text after the first five genres, splitting at capital letters with no spaces before or after. Utilizing a loop function, we will ensure that all the desired data is properly parsed.
# Convert 'Genres' column to string type
df['Genres'] = df['Genres'].astype(str)
# Replace "World War I" with "World War 1" and "World War II" with "World War 2"
df['Genres'] = df['Genres'].str.replace('World War II', 'World War 2').str.replace('World War I', 'World War 1')
# Convert 'Genres' column to string type
df['Genres'] = df['Genres'].astype(str)
df['Genres'] = df['Genres'].str[6:]
# Split the 'Genres' column and create separate genre columns
def extract_genres(text):
genres = []
genre = ""
for char in text:
if char.isupper() and (not genre or genre[-1] != ' '):
if genre:
genres.append(genre.strip())
genre = char
else:
genre += char
if genre:
genres.append(genre.strip())
return genres[:5]
genre_columns = ['Genre 1', 'Genre 2', 'Genre 3', 'Genre 4', 'Genre 5']
for idx, genre_text in enumerate(df['Genres']):
if isinstance(genre_text, str):
extracted_genres = extract_genres(genre_text)
for i, genre in enumerate(extracted_genres):
df.at[idx, genre_columns[i]] = genre
# Drop the original 'Genres' column
df.drop(columns=['Genres', 'Ratings', 'Reviews'], inplace=True)
Now, let’s address the solitary piece of datetime data. We have the date of the first publication of each book in text format within our dataset. Our objective is to extract the date portion in the format ‘month-day-year’. Following this, we’ll convert it into a numeric datetime format for potential use in future analyses, establishing a new column labeled ‘First Published’. Subsequently, we will eliminate the original ‘Published’ column from our dataset.
# Extract date portion and convert to date, then format as MM-DD-YYYY
date_extracted = df['Published'].str.extract(r'First published (\w+ \d+, \d{4})')[0]
df['First Published'] = pd.to_datetime(date_extracted).dt.strftime('%m-%d-%Y')
# Drop the original 'Published' column
df.drop(columns=['Published'], inplace=True)
Now that our data has been cleaned and processed, you can easily see how the raw data scrape can now be easily used by other analysts.
Exporting Data to csv
Finally, we’re going to export our data to a csv file that we can use in the future. With the data from our scrape cleaned and processed, it can now be utilized for a myriad of analyses and insights.
df.to_csv('GoodReadsTop1000(clean).csv', index=True)