A Gentle Introduction to Scraping a Table by Hand

AKA A Harsh Introduction to Scraping a Table the Easy Way

Jeffrey Hanif Watson
4 min readNov 7, 2021
Image by Author

Introduction

I’ve recently started a project with the aim of creating a suite of equity option web apps which was inspired by Harshit Tyagi’s article on building a financial newsfeed app. I wanted to test things out by starting small and focusing on the S&P 500 index. While recreating his headline reader, I quickly found that S&P stopped releasing a file that contains all of the stocks in the index. After a quick web search, I read that scraping the list from Wikipedia was the easiest way around the problem. I pulled out an old lesson on web scraping and got to work.

Amusingly, after coding out the solution by hand, I did some googling and found a Pandas method that easily creates data frames from html directly (re-read the last line and imagine the sad trombone sound playing as you read). However, the whole episode was a good learning experience on a few different levels.

First, there is no better way to learn new skills, or refresh old ones, than by jumping in and coding. Digging into the project forced me to recall the foundations of web scraping, html tags, etc. I haven’t used those skills in a while, and I’ve actually come away with a better understanding of them.

Second, there is probably a method, package or library that can do what you want to do quicker and easier than you coding a solution from scratch.

Scraping with Requests and Beautiful Soup

Aside from Pandas, the libraries we’ll be using are:

  • Requests: An easy to use HTTP library for Python we will use to get the page html.
  • Beautiful Soup: A library that simplifies web scraping in Python that we will use to parse the html and extract the table.

In the code below we get the html from Wikipedia, parse the data, and print out the first row of the stock information.

# getting S&P 500 wikipedia page
r = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
# parsing the html
soup = BeautifulSoup(r.text, 'lxml')
# extracting the table
table = soup.find('table', {'class': 'wikitable sortable'})
# printing the row containing the first stock
print(table.findAll('tr')[1:2])

Output:

[<tr>
<td><a class="external text" href="https://www.nyse.com/quote/XNYS:MMM" rel="nofollow">MMM</a>
</td>
<td><a href="/wiki/3M" title="3M">3M</a></td>
<td><a class="external text" href="https://www.sec.gov/edgar/browse/?CIK=66740" rel="nofollow">reports</a></td>
<td>Industrials</td>
<td>Industrial Conglomerates</td>
<td><a href="/wiki/Saint_Paul,_Minnesota" title="Saint Paul, Minnesota">Saint Paul, Minnesota</a></td>
<td>1976-08-09</td>
<td>0000066740</td>
<td>1902
</td></tr>]
time: 540 ms (started: 2021-11-06 17:45:16 -07:00)

Using Tags to Scrape the Data

The tag <tr> denotes the table rows, and <td> demarcates the table data (i.e. the individual cells) displayed. Reading the Beautiful Soup output above, we can see that the first cell of this row contains a link to the NYSE 3M quote page embedded into the ticker symbol text string ‘MMM’.

Let’s save the ticker symbols to a list by scraping the text from first cell of each row of stock data using a list comprehension.

# saving a list of ticker symbols
symbols = [row.findAll('td')[0].text for row in table.findAll('tr')[1:]]
# checking lengthprint(f'List Length: {len(symbols)} \n')# checking first 5 symbols
print(f'First Five Symbols: {symbols[:5]}')

Output:

List Length: 505 

First Five Symbols: ['MMM\n', 'ABT\n', 'ABBV\n', 'ABMD\n', 'ACN\n']

The length of the list looks good, but we have extra characters we need to strip from the symbols.

Stripping new line characters:

# stripping new line character from the strings 
symbols = list(map(lambda s: s.strip(), symbols))
# checking first 5 symbols
print(symbols[:5])

Output:

['MMM', 'ABT', 'ABBV', 'ABMD', 'ACN']

Looks good. We can apply the same technique to the second cell of the row to scrape and save the company name, and to the fifth cell to grab the company industry.

Combining the Lists into a Dictionary and Creating a Data Frame

After scraping and saving all the information we want to lists, we’ll combine them into a dictionary and form a data frame.

# making a data dictionary
data = {'Company Name': names, 'Symbol': symbols, 'Industry': industries}
# creating data frame from the data
stocks_df = pd.DataFrame.from_dict(data)
# checking the first five rows
stocks_df.head()

Output:

Image by Author

Conclusion

Pretty nice! Not too much work, and we have full control over how we scrape the page and what gets added to our data frame.

However, if the page is properly formatted, we can scrape the table and save it to a data frame in three lines of code (sad trombone). To be continued…

--

--

Jeffrey Hanif Watson

Data scientist with a background in education. Skilled at using data acquisition, analysis and machine learning to provide actionable insights.