Scraping Websites with Ruby

Scraping Websites with Ruby

Breaking down a simple web scraper

Subscribe to my newsletter and never miss my upcoming articles

TopTunes is a TinyApp that scrapes the Billboard Hot 100 songs and gets them ready to be manipulated, displayed, or filtered.

Hello, TopTunes 👋

The output of TopTunes isn't particularly exciting when you take a look at it below, but there's a lot of interesting stuff happening behind the scenes here.

Screen Shot 2021-07-23 at 9.55.52 AM.png

By running a Ruby file called script.rb which we'll break down ahead, we're able to get the current top 5 songs on the Billboard Hot 100. We want to extract the artist, song title, and the song's rank. If we run this script again next week, it will have different results!

The Site of Scraping 🪧

Web Scraping is the process of extracting data from other websites. Lots of large applications use web scraping as a strategy to collect data from all over the internet. TopTunes is a very simple application, meant to illustrate the power and process of web scraping.

The website that we're scraping data from is the Billboard Hot 100. This site is a good one to scrape because it has a nice list of data that's organized predictably. https://www.billboard.com/charts/hot-100

Central to scraping is the concept that a web page is just text. A website is really just a big glob of HTML at the end of the day. Web scrapers leverage the organization of HTML to parse through a website's text and extract useful information from it.

With that introduction, let's get our pickaxes, and get started.

Starting Out ⛏

We'll need to get our tools ready before we start scraping. Since we're using Ruby to scrape the Billboard Hot 100, we're going to use Nokogiri, OpenURI, and Pry to help us out.

Let's start by creating our main script file: script.rb. In that file we'll require certain Ruby gems.

require 'nokogiri'
require 'open-uri'
require 'pry'
  • Nokogiri: A Ruby gem that parses HTML and makes it easy to filter the HTML text to find what we're looking for. Using this gem is central to this application.
  • OpenURI: A module built into Ruby that makes it possible to open up an HTML document via Ruby.
  • Pry: A tool that allows us to inspect our Ruby code at certain points. If you're familiar with JavaScript, Pry is equivalent to a debugger in Ruby.

If you run into trouble with requiring these gems, it's likely that they are not installed on your machine. You can easily install Nokogiri and Pry:

$ gem install nokogiri
$ gem install pry

Next, let's get the whole HTML document via our Ruby script:

charts_doc = Nokogiri::HTML(URI.open('https://www.billboard.com/charts/hot-100'))

Like an onion, there's a few layers to this line. Let's start at the inner one and work our way out.

We can see a url that has an address to the Billboard Hot 100 page. This is wrapped within a method called URI.open(). This allows Ruby to read the Billboard page as an HTML document. Further, the HTML text is wrapped in Nokogiri::HTML(), which tells Nokogiri to parse this HTML for us.

It may not be clear why that's important right now, but Nokogiri will come in handy in a minute. For now, let's understand that charts_doc contains all of the HTML on the Billboard Hot 100 page.

Want to see for yourself? If you add a binding.pry below this line, you can access this variable when the Ruby script is run. To do this, head to your terminal, make sure that you cd into your project's directory and run the script:

$ ruby script.rb

With the binding.pry in your code, it should stop. By typing charts_doc you can see the current value of the variable. This process is useful for debugging code throughout the rest of this application's breakdown.

Fiddling with .css

We have all the Billboard chart as a document parsed by Nokogiri, but it's not giving us the information that we want. If you recall from the intro of this article, we're looking to isolate each song's title, rank, and artist. Currently we just have a blob of text for the entire chart content.

It would be ideal to have all of the song data in a neat array. Using Nokogiri, we can drill down and extract the information that we want. The .css method allows us to parse our HTML document using CSS classes.

If we inspect the Billboard Hot 100 web page, we can see that the songs being ranked are inside an <ol> element that has a class of .chart-list__elements. That's perfect data to grab onto.

Screen Shot 2021-07-27 at 4.19.33 PM.png

With this code, we set a new variable chart to an array of songs (which are <li> HTML elements). We access them by using .css on charts_doc and then selecting the children of the ordered list (<ol>).

chart = charts_doc.css('ol.chart-list__elements').children.select.with_index { |_,i| (i+1) % 2 == 0 }

You might notice something strange about the above code. What is this whole select method doing here? Why can't we just set chart to the children of the <ol>?

This seems to be a situation that's unique to this page on Billboard's website. As I was building this TinyApp, I found that every other child of the <ol> was blank. I don't exactly know why that is the case, and I certainly couldn't find the reason when I inspected the page. In order to make sure that we were only pulling children (<li>'s) with actual data in them, I selected every other index, starting at 1.

children.select.with_index { |_,i| (i+1) % 2 == 0 }

select.with_index deserves some further explanation. select.with_index loops over every item of the children array (the <li>'s) with the index accessible. It selects certain items based on the criteria given. Here, the criteria is (i+1) % 2 == 0. What that means is that if the index plus one is divisible by two (aka even), choose that item. Remember, that's index plus one, so in reality we're actually choosing all of the odd indexed items.

Song Class

At this point, we have been able to separate each <li> into items in an array, and we've selected the <li>s that have actual data.

Now, we'll want to figure out what to do with this data once we can extract our key information: title, artist, and rank.

Ruby is an object-oriented language. Everything in Ruby is an object. This is the subject of a much deeper discussion, but for now, know that organizing code into objects is what Ruby is made for. Since we're trying to extract song data and potentially use it in a variety of ways, it makes sense to create a class for Song, and treat it as an object.

A new file in our folder, song.rb will hold our Song class.

class Song
  attr_reader :title, :artist, :rank
  def initialize(title, artist, rank)
    @title = title
    @artist = artist
    @rank = rank
  end
end

Here, we're constructing a song (using def initialize) with a title, artist, and rank. The attr_reader helper shows that title, artist and rank can all be read outside of the class. The attr_reader is actually creating a method for each, .title, .artist, .rank - all of which can be used safely to get these values when called on an object. We'll see this in action soon.

If we return to our script.rb file, we'll need to require_relative 'song' to make sure that we have access to the Song class.

Then, below our existing code, we can create a loop that loops through chart, the array that holds the <li>s containing each song's information.

After inspecting the website code and using our handy dandy .css method with Nokogiri, we can extract rank, title, and artist, converting them to strings and creating a new Song object with each of them.

songs = []

chart.each do |song|
  rank = song.css('.chart-element__rank__number').children.to_s.strip
  title = song.css('.chart-element__information__song').children.to_s.strip
  artist = song.css('.chart-element__information__artist').children.to_s.strip

  songs.push(Song.new(title, artist, rank))
end

As they are created, we also push each newly created Song object into an array, songs.

Using the Data

Now that we have an array of Song objects that have all the data they need, we can use the data for all sorts of things! The options are unlimited.

As a very basic example of what we can do with this data, here's a single line of code that prints out the top 5 songs to the console.

songs[0...5].each { |song| puts "#{song.rank}. #{song.title} by #{song.artist}" }

We're using a range at the beginning, to select indexes 0-4 (aka the first five songs), and then using puts to print out song rank, title, and artist. When I run ruby script.rb, here's the sample output.

$  TopTunes ruby script.rb
1. Butter by BTS
2. Good 4 U by Olivia Rodrigo
3. Levitating by Dua Lipa Featuring DaBaby
4. Stay by The Kid LAROI &amp; Justin Bieber
5. Kiss Me More by Doja Cat Featuring SZA

We've successfully created a script that scrapes data from a webpage, and processed it to extract pertinent information wrapped up neatly inside objects. There are so many powerful applications for scraping, and this has just scratched the surface!

Til next time!

Further Reading

 
Share this