Previously: Why I Go All In With Data Science.
This is the first lesson I teach when introducing my unit on Data Science and Analyzing Big Data. My goal is to motivate the Big Idea of the unit: Data Tells a Story. I try to emphasize that while the data itself is objective, the interpretation is subjective and there are points we can draw out by looking for trends and patterns that go beyond the raw numbers. This lesson in particular basically has 2 parts: introduce this big idea to frame the rest of the unit, and to get them into Excel and looking at Big Data as soon as possible.
To introduce this unit, I actually steal someone else’s enthusiasm: Hans Rosling. Have you seen his TED talk? If not, you should watch it – he has this impassioned argument about the need to access data and represent it in a meaningful way. He does this with a piece of software called GapMinder that’s freely available on the web and as a program to download, but more on that later.
I start this lesson by letting students know that we’ve just spent a few weeks examining how computers communicate and represent information, but now we’re going to look at how human’s communicate and represent information. I show them Hans Rosling’s TED talk from around 2:15 (where he says he discovered a real need to communicate data) to around 5:00 (the end of the Instant Replay). We talk about how excited Hans Rosling is and how cool the program is, but then I ask students which they would have preferred: an hour-long lecture explaining the shifts in third-world vs first-world from the last 40 years, or watching that 3-minute video basically demonstrating the same thing? My hope is they’ll pick the latter…
This helps answer our question of how humans communicate information: through visuals. We talk about inforgraphics and charts on the internet a bit to tie in some personal experiences and I ask students how their brains interpret STOP signs – which piece of information does your brain prioritize: reading the words STOP or seeing the red octagon? I then go back to Hans Rosling and fast-forward to around 15:00 (where he starts his metaphor of data sets being buried in the ground) to around 16:30 (where he talks about needing a search function for the data). I re-frame this as a call to action: we have the data, we should be using it to communicate. But then I tell students that this talk happened in 2006 and ask how they think things have gone since then – is data more readily accessible? Is it being communicated effectively? Or are we in a wasteland of hidden data and bad representations?
To prime this discussion, I show students 4-5 images from this Code.org lesson on bad data visualizations (I especially like the bar chart of money, Obama, and Palin) and really try to make the argument that data gets misrepresented all the time. I make a big deal about this because it serves as the motivation for the whole rest of the unit: to use meaningful data and represent it accurately to tell a story. By this point, about half the class has gone by and we’re ready to start looking at real data and see how it can tell a story.
I tell students to go to Gapminder.org, scroll towards the bottom and find the section that says “Refresh Your World”, and choose one of the stories from that section to click on (ie: “Wealth & Health of Nations”, “CO2 emissions since 1800”, etc). I show students how to ‘start the world’ and move around in time, how to see individual countries over time, and then tell them to mess around for a bit and see if they find anything interesting. The Wealth & Health of Nations section has the most opportunity for interesting stories, especially when looking at life expectancy of certain countries around the time of war (ie: US during the Civil War; Germany during WW2; the entire world during WW1). Having students play in this program really helps flex their questioning muscles – they see something interesting and want to know why it happens, so we talk a bit about what’s happening in history at that time that could motivate the pattern they’re seeing. It’s a really great exercise for students to share questions and thoughts with each other. I have a piece of software that lets me share students screens with the class and I use it a lot in this lesson (a similar way you might be able to do this is with Google Cast for Education). I make sure to highlight students who have found an interesting trend and have found a history-based explanation for the pattern they’re seeing, or students who notice trends when comparing two countries together as if their data is tied together by some invisible force somehow.
After playing around for a little while, I show students how to view the actual data they’re working with at Gapminder.org/data. I have them click a few of the links and look at the pop-up windows so they can preview the data itself. I purposefully want them to get overwhelmed looking at all the rows and columns so it motivates the question of “how do you even begin to analyze all of this stuff?”. I have them download the “Population, total” data set and open it on their computer with Excel so we can start manipulating it.
At this point, my goal is to introduce students to some basic cleaning & filtering techniques (especially since these are explicitly called out in the AP CS Framework). I tell students to delete all data from before 1970 (which requires deleting columns), then delete any countries that don’t have consistent data from 1970 onwards (which requires deleting rows). Then I show how to add a filter to the Countries column and I tell students to choose 2 countries to compare the populations of. I tell them to have a reason for picking their two countries to see if you can detect a relationship between them. Last year I made a video walking through most of these steps, but I didn’t use it this year because I just had students use the Recommended Charts instead of manually selecting the data.
I really encourage students to be intentional with the countries they pick so they can see if they can detect a story with the data. I’ve had students pick the US and Mexico because they want to see if they can detect the results of illegal immigration; I’ve had students pick North Korea and South Korea to see what happened after the countries split; I’ve had students pick neighboring countries in the same geographic area to see if they can discern anything about their populations; etc etc. I tell them to make a chart representing their data and see if their data tells some sort of story. Here are some past results:
I try to have students share in a place where they can see each other’s results publicly so they can get little nuggets of inspiration or motivation from what other people are doing. In the past, I’ve made a Google Slide deck that everyone can edit and give each student their own slide to post their results, but that was tricky to manage this year. I’ve started to use Padlet instead for these types of projects – students can post their own updates and can see everyone else’s posts in real-time, but they can only edit their own posts. I just have to make sure students post their names as the title of their Padlet, otherwise I don’t know who the author of the post is.
Most of the time, the visualization students create isn’t really clear enough to be able to tell any sort of meaningful story, but that’s okay. The goal of all of this isn’t to start seriously analyzing data – it’s more just to get students into Excel, understanding filters, making charts, and seeing if they can write sentences trying to extrapolate meaning from the visualization they’ve created. I also emphasize that every chart needs a title and labeled axes, which is the beginning of a battle I’ll fight the entire unit: getting students to add labels to their data.
I’ve done this lesson for 2 years now and I really think Hans Rosling and Gapminder are two of the best resources to introduce the need to communicate with data and the excitement that can come from finding a worthwhile data story. The Gapminder software is also a really great introduction to the power of data visualization and how it can easily facilitate discussions of patterns and trends, which ties together the whole rest of the unit. Writing all of this out makes this lesson seem really long, but I’m usually able to cover all of this in an hour-long class if I’m really tight with my transitions. Hopefully by the end of the lesson students understand the motivations and goals for the unit and they’ve already dipped their feet into the pool of Big Data by opening up Excel and making a quick visualization, which helps to disperse some of that intimidation that comes with working with Big Data. Especially since the water only gets deeper after today.