2021 Week 28 | Power BI: Web Scraping with Power Query

Introduction

July is “Power Query Month”, where the theme is focused less on visualization and more on data preparation. Every Power BI challenge requires Power Query to get and transform data, but now we’re going deeper in July! If you haven’t checked out Dave’s challenge from last week, make sure to check it out here.  

This week we are going to use Power Query to scrape data from a website utilizing a function to loop through the web scrape. We are then going to take that data and create the report that is embedded below.

Requirements

  • In Power Query, create a Web query that scrapes years 2000 through 2020 from the website: https://www.baseball-reference.com/leagues/majors/YYYY-standings.shtml
      •  First, create a list of the desired years using a Blank Query
      • Second, use another Blank Query to create a Function that cycles through the base website for every one of the desired years
      • Convert list of years to table, add a column using the Invoke Custom Function method
      • Expand table
      • Filter out  “Average” from the field [Tm]
      • Change data types as appropriate
  • Create a new column in Power Query from the standings query that is Games Above .500 = [W] – [L]
  • Create a second Web query that pulls in the data from the Team ID’s page from the follow website: https://www.baseball-reference.com/about/team_IDs.shtml
      • Filter out all rows that have a Last Year before the first year that you pulled the data for (2000 in my instance)
  • Create a data model relating the Tm field in the fact table to the Full Team Name in the dimension table
      • You will find that there is one team name in the fact table, the Los Angeles Angels, that is not found in the dimension table. The Franchise ID that this value needs to be assigned to is LAA. Feel free to solve for adding this value to the dimension table in whatever way you see fit. I duplicated the original query, filtered down to one value with the wrong name, replaced values, then appended to original query. 
  • Using bar charts, create a small multiple chart showing each of the team’s seasons results with the Year being the X axis, the amount of games over or under .500 being the Y axis, and the Franchise IDs being the Small Multiples
      • Differentiate seasons that teams won more than they lost by using different colors for the bars

Dataset

This week’s dataset will be scraped from the web from the following two websites:

  1. https://www.baseball-reference.com/leagues/majors/2020-standings.shtml 
  2. https://www.baseball-reference.com/about/team_IDs.shtml

Share

After you finish your workout, share on Twitter using the hashtags #WOW2021 and #PowerBI, and tag @JSBaucke@MMarie, @shan_gsd and @dataveldAlso make sure to fill out the Submission Tracker so that we can count you as a participant this week in order to track our participation throughout the year. 

Solution

9 thoughts on “2021 Week 28 | Power BI: Web Scraping with Power Query”

  1. Loved this challenge. Couldn’t work out how you managed to have all the axis showing without the scroll bar so to see the tip on using the custom page size and then expanding was really useful

  2. In addition to the missing Los Angeles Angels, I witnessed another problem.

    Cleveland Indians received a new name due to some controversies, and is now called Cleveland Guardians.

    I simply replaced all team name values in the fact table.

  3. M. A. R. Azri Boer

    Hi Team, I cannot access the detailed standings in my Power Query Editor. It can not be shown at the Web Page of the Power Query Editor. How can I fix that? Thanks for any comments!

    1. Shannon Lindsay

      Hi – I’d love to try to help! Can you elaborate? What do you mean when you say “detailed standings”? Is that a column in the data? Or something you’re referencing in the Power Query UI?

      1. M. A. R. Azri Boer

        Hi, Shannon. That is the data of all MLB team rank. I think the problem is how I reference the list to the website, and incorporate them in a custom function. When I put the custom function in the list, Power Query would show error. I just thought that I met my dead end of this problem. Sorry for the late response of mine.

  4. I have also tried the invoke function, and I think it’s the ‘website’ may have changed. Source{6} is ok but Source{7} will return nothing.

  5. Is this exercise still doable? @9:00 in the video, the formula includes “Data = Source{7}[Data] i.e. “Table no.7 in the HTML”.

    However, I can’t get it to work. It returns the Expression. Error: “There weren’t enough elements in the enumeration to complete the operation.”. Table 1-6 work fine.

    1. The exercise is still doable, but you may not be able to reference Source{7}. You can use the Power Query UI to help you get the table based upon selecting it by name in the UI while you create the function.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top