- Mon 30 April 2018
- Web Scraping
- William Miller
- #web scraping, #BeautifulSoup, #TED.com, #TED talks
In preparation for a data analysis project predicting how people are likely to rate TED talks, I decided to scrape all of the metadata and transcripts for all TED talks. Since there are currently over 2700 TED talks, this script takes quite a long time to run, but it is effective. I will provide brief explanations of what I'm doing in each cell of code apart from the normal documentation.
Note that everything below is based on the state of their website as of April 30th, 2018, and it might lose functionality if they make any significant changes to their HTML or javascript.
Import libraries
from bs4 import BeautifulSoup, Comment
import json
import requests
import re
from time import sleep
from random import randint
import pandas as pd
from dateutil import parser
import sys
Create toggles
It can take a really long time to scrape all this data, so I'm creating a dashboard of sorts at the outset so I can toggle various levels of it on or off. "General data" does not take very long at all, and some fun analysis can be done with just that, so unless I decide to modify this more in the future, I'm just going to separate that out from the rest. As I also noted in the comment below, scraping specific data relies on general data, so it must be run at least once before specific is set to "True"
#Set to "False" to avoid updating data. Set to "True" to update, which can be a lengthy process.
#Scraping specific data relies on general data, so it must be run at least once before specific is set to "True"
update_general_data = False
update_specific_data = True
#If this is toggled off, see if the file already exists and if data can be pulled from it.
if update_general_data == False:
try:
TED_gen_df = pd.read_csv('TEDGeneral.csv')
except:
print('No general data to read, set "fetch general data" to "True" to aquire data')
if update_specific_data == False:
try:
pd.read_csv('TEDSpecific.csv')
except:
print('No specific data to read, set "fetch specific data" to "True" to aquire data')
Create function to show progress
We're going to want to see how much work has been done and how much remains at various points, so I'm going to create a very basic, easy function for that.
def show_progress(part, whole):
"""
Input:
part = element of list
whole = list
---------
Function:
Find the number of element "part" is within "whole"
---------
Output:
Return the string "[nth list element] of [list length]"
"""
path = len(whole)
step = whole.index(part) + 1
progress = str(step) + " of " + str(path)
return progress
Create URL building functions
We'll need a couple of functions to assist with seeing how many pages there are, then building URLs to access them
def get_num_pages(start_page_soup):
"""
Input:
start_page_soup = The HTML of a page 1 of talks on www.TED.com, as processed by BeautifulSoup
---------
Function:
Search the HTML of www.TED.com/talks for the number of pages of talks.
---------
Output:
The number of pages of talks on www.TED.com/talks
"""
# There are hyperlinks to navigate pages of TED talks given the class "pagination", find that class
pagination_str = start_page_soup.find(class_ = 'pagination')
page_num_list = []
# Step through the strings returned by searching for the pagination class and add them to a list
for child in pagination_str.contents:
page_num = re.search(r'\d+', str(child))
if page_num != None:
page_num_list.append(int(page_num[0]))
# Return the value of link that contained the number of the largest page
return max(page_num_list)
def build_page_urls(start_url, num_pages):
"""
Input:
start_url = The base URL that lists TED talks
num_pages = the number of pages of TED talks
---------
Function:
Build a list of valid URLs of pages listing TED talks
---------
Output:
The list of URLs
"""
url_list = []
# Create strings of valid URLs according to the pattern observed on TED.com
for i in range(1, num_pages + 1):
url_list.append(re.sub(r'\d+', str(i), start_url))
return url_list
Create functions to request a webpage and handle request errors
It is very easy to call a website too frequently and get shut out, so it is necessary to insert some wait time into this function. After experimenting, I found that waiting random intervals between 10 seconds and 25 seconds seemed to be ideal. Sometimes transcript pages do not exist for some talks, so I definitely need to handle the errors that will result from when I request pages that do not exist. Occasionally requests will also timeout for no apparent reason, so this must also be handled.
Getting this right is probably the most delicate and tricky part of scraping a large amount of data.
def request_webpage(url):
"""
Input:
url = The url to request
---------
Function:
Request the html from a URL while avoiding calling TED.com too frequently
and while turning any errors that may occur over to an error handler.
---------
Output:
The html returned from a URL request
"""
# If we receive a timeout, we're going to want to try again, timeout_tries is how many times
timeout_tries = 5
tries = 0
# Count the number of tries, retry until we've tried enough
while tries <= timeout_tries:
#Wait a random interval to keep from calling the website too often.
sleep(randint(10,25))
#Try to get the page, if it returns a code that is not a success code, call the error handler
try:
page = requests.get(url, timeout=60)
page_request_code = str(page.status_code)
if page_request_code[:1] != '2':
page = request_errorhandle(page_request_code)
break
# If the request times out, iterate up the number of tries and try again.
except:
tries = tries + 1
return page
def request_errorhandle(code):
"""
Input:
code = The code of a failed URL request
---------
Function:
Handle the codes of any failed URL request that we are likely to receive
---------
Output:
The HTML returned from a URL request
"""
# If we entered an invalid URL, return null data
if code == '404':
return ''
# Most codes throw by valid URLs can be handled by waiting awhile and trying again
else:
sleep(randint(300,360))
try:
page = requests.get(url, timeout=60)
#If that doesn't work, make the returned page data equal "error". I can then go back and fix it.
except:
page = 'error'
return page
Set up functions to retrieve data
I've set up 3 functions for retrieving data. "data_from_url" retrieves very general information about each TED talk. "extract_data_script" retrieves much more specific information for each TED talk. "get_transcript" retrieves the transcript for each talk, if it exists.
def data_from_url(page_soup, data_dict):
"""
Input:
page_soup = The BeautifulSoup data from a page of TED talks
data_dict = A dictionary of the data from TED talks that I will append to
---------
Function:
Extract the data I'm interested in and append it to a dictionary
---------
Output:
A dictionary containing the metadata I'm interested in from a page listing TED talks
"""
# Save data from all talks on page to talk_data
page_data = page_soup.find_all(class_ = 'talk-link')
# Iterate over that data and save the bits we're interested in
for talk in page_data:
media_data = talk.find('span' , {'class':'thumb thumb--video thumb--crop-top'})
message_data = talk.find(class_ = 'media__message')
title_url_data = message_data.find(class_ = ' ga-link', attrs = {'data-ga-context': 'talks'})
meta_data = message_data.find('div', {'class':'meta'})
#get title
talk_title = title_url_data.contents[0].strip()
#get speaker
talk_speaker = message_data.find(class_ = 'h12 talk-link__speaker').contents[0].strip()
#get url
talk_url = 'https://www.ted.com' + title_url_data['href'].strip()
#get date posted
talk_date = meta_data.find(class_ = 'meta__item').find(class_ = 'meta__val').contents[0].strip()
#get rated bool and rating
if meta_data.find(class_ = 'meta__row'):
talk_rated_bool = True
talk_ratings = meta_data.find(class_ = 'meta__row').find(class_ = 'meta__val').contents[0].strip()
else:
talk_rated_bool = False
talk_ratings = None
# get duration
talk_duration = media_data.find('span', {'class':'thumb__duration'}).contents[0].strip()
data_dict[talk_url] = ({'title': talk_title, 'speaker': talk_speaker, 'date': talk_date,
'rated_bool': talk_rated_bool, 'ratings': talk_ratings,
'duration': talk_duration})
return data_dict
def extract_data_script(page_data, unique_start):
"""
Input:
page_data = The BeautifulSoup data from the page of a specific TED talk
unique_start = Any unique string that marks the beginning of some javascript to extract
---------
Function:
Extract some javascript containing data I'm interested in and append it to a dictionary
---------
Output:
A string of raw javascript that contains data I'm interested in.
(Further processing will be required to finalize data extraction.)
"""
# Create a string from the BeautifulSoup page data.
page_str = str(page_data)
# Set the index to search within to the location of the string in unique_start
start_idx = page_str.find(unique_start)
# Find the closing and opening tags for all javascript
num_script_start = re.finditer('<script', page_str)
num_script_end = re.finditer('</script>', page_str)
# Make a list of locations of opening and closing javacripts tags, then combine
start_list = [s.start() for s in num_script_start]
end_list = [e.end() for e in num_script_end]
script_list = list(zip(start_list,end_list))
# Narrow that list of locations down to just the ones that begin after unique_start
narrowed_script_list = [s for s in script_list if s[0] >= start_idx ]
# Ensure that we have selected the first opening tag and first closing tag after
# unique start without getting mixed up somewhere along the way.
if narrowed_script_list[0][1] < narrowed_script_list[1][0]:
script_idx = narrowed_script_list[0]
# If the closing tag cannot be found, just return the whole string
else:
narrowed_script_list = narrowed_script_list
# Extract the javascript string immediately after unique_start
data_extract = page_str[script_idx[0]:script_idx[1]]
return data_extract
def get_transcript(transcript_page_soup, start_transcript, end_transcript):
"""
Input:
transcript_page_soup = The BeautifulSoup data from the page of a specific TED talk transcript
start_transcript = Comment indicating the start of a transcript
end_transcript = Comment indicating the end of a transcript
---------
Function:
Extract the transcript from a TED talk, if it exists
---------
Output:
A string containing the transcript of a TED talk.
"""
# Initialize a list to store each element of a transcript
transcript_list = []
# In case there is no closing comment for a transcript, set a maximum
# number of times we'll look for it.
timeout = 300
timeit = 0
# Find all comments within the transcript page
for comment in transcript_page_soup.find_all(text=lambda text:isinstance(text, Comment)\
and text.strip() == start_transcript):
# Find a comment that denotes the start of a transcript
if str(comment).strip() == start_transcript:
# Get the transcript until a comment is reached that denotes the end of a transcript
while True:
timeit = timeit + 1
comment = comment.next_element
if str(comment).strip() == end_transcript:
break
transcript_list.append(comment)
# If we've gone on for a while without finding a closing comment, end the loop
if timeit >= timeout:
break
# Join the list of elements that contained the transcript into a string
transcript = ''.join(str(s) for s in transcript_list)
return transcript
Retrieve general data
Having set up all of my functions, now it's time to use them. General data retrieval comes first, and as I mentioned, it is much faster than retrieving all of the specific data. This is because this data resides in the pages that lists all of the TED talks (which currently numbers 77), rather that the pages for each individual TED talk (which currently numbers over 2770).
# Check to see if I've toggled update_general_data on.
if update_general_data == True:
# Retrieve the first page of TED talks
start_url = 'https://www.ted.com/talks?page=1'
start_page = request_webpage(start_url)
start_page_soup = BeautifulSoup(start_page.text, 'html.parser')
# Use that page to find the total number of pages
num_talk_pages = get_num_pages(start_page_soup)
# Build a list of URLs for all the pages listing TED talks
talks_url_list = build_page_urls(start_url, num_talk_pages)
# Check to see if I've toggled update_general_data on.
if update_general_data == True:
# Set up dictionary for temporary storage
TED_talk_dict = {}
for url in talks_url_list:
# Call webpage request function
page = request_webpage(url)
# Convert page request to BeautfulSoup
page_soup = BeautifulSoup(page.text, 'html.parser')
# Pull data from soup using data_url_fuction and add to dict
TED_talk_dict = data_from_url(page_soup, TED_talk_dict)
# Show progress
sys.stdout.write('\r'+ 'step ' + show_progress(url, talks_url_list))
# Make dataframe of all data from URLs
TED_gen_df = pd.DataFrame.from_dict(TED_talk_dict, orient='index').reset_index()
# Process dataframe, setting the URL (which must be unique) to the index
TED_gen_df['url'] = TED_gen_df['index']
TED_gen_df = TED_gen_df.drop('index', axis=1)
# Show a sample
TED_gen_df.sample(15)
# Check to see if I've toggled update_general_data on.
if update_general_data == True:
#If so, save our newly retrieve general data to a CSV
TED_gen_df.to_csv('TEDGeneral.csv', index=False)
# Add two empty columns to our dataframe
TED_gen_df['raw_data'] = ''
TED_gen_df['transcript'] = ''
# For each TED talk in the dataframe, retrieve its URL (stored as index)
for idx in TED_gen_df.index:
page_url = TED_gen_df.iloc[idx]['url']
# Make the url for the transcript page
transcript_url = page_url + '/transcript'
# Request web pages for the specific talk data and the transcript data.
talk_page = request_webpage(page_url)
transcript_page = request_webpage(transcript_url)
# If the talk page is not blank, and an error was not returned, extract data from it
if talk_page != '' and talk_page != 'error':
talk_page_soup = BeautifulSoup(talk_page.text, 'html.parser')
extracted_data = extract_data_script(talk_page_soup, '<script>q("talkPage.init", ')
# If either of those things was the case, set the data to null
else:
extracted_data = ''
# If the transcript page is not blank, and an error was not returned, extract transcript from it
if transcript_page != '' and transcript_page != 'error':
transcript_page_soup = BeautifulSoup(transcript_page.text, 'html.parser')
extracted_transcript = get_transcript(transcript_page_soup, 'Transcript text', '/Transcript text')
# If either of those things was the case, set the transcript to null
else:
extracted_transcript = ''
# Write that extracted data to the dataframe
TED_gen_df.at[idx, 'raw_data'] = extracted_data
TED_gen_df.at[idx, 'transcript'] = extracted_transcript
#Show progress. Blank string is to ensure complete overwrite of previous string after carriage return.
sys.stdout.write('\r' + 'step ' + show_progress(idx, list(TED_gen_df.index)) + ': ' + transcript_url\
+' ')
# Save the updated dataframe to a csv (different from csv for general data)
TED_gen_df.to_csv('TEDSpecific.csv', index=False)