Session 1: Introducing Python

You can find the intro slides here and a copy of the original Jupyter Notebook file here.

Here’s a video recording of the session:

1. Your notebook

We are going to have a tour of some of the things Python can do, from the simple to the complex. We will end with some data examples having to do with COVID-19.

Before we get into programming, let’s take a look at our the computing environment we are using. Again, this is a Jupyter notebook running on Google’s Colaboratory — the name “Jupyter” being a mix Julia, Python and R, the three languages the notebook originally supported.

Programming with the notebook is often referred to as literate computing — by that we mean that you code a little, have a look, write a little, come up with more ideas, code a little more, write a little more and so on. To support this, there are two kinds of “cells” that one can either write or program in — the former appears in so-called “text cells” and the latter in “code cells”. Near the top of this window, you see two buttons “+ Code” and “+ Text” to add new cells to your notebook.

The paragraphs you’ve been reading are in a text cell. The arithmetic expression below is in a code cell. You perform the computation by clicking in the cell, and then holding down the “shift” key while hitting the “enter” key. (The Google implementation of the notebook also has a little a arrow on the left of the cell that you can click to run the code.)

Go ahead — the sum of 3 and 5 equals…

3+5

Your first expression in Python!

Oh and we’re back in a text cell. This paragraph, along with all the other commentary in this notebook, is written in “Markdown,” a kind of pre-language for creating HTML. If you double-click on this cell, your screen will split into two — the raw Markdown appears on the left and the page we were just looking at has moved to the right.

By comparing left and right, you can see that in Markdown we turn text bold by surrounding it with double asterisks, italicize it with single asterisks, and we turn text into a heading by typing a row of hyphens below it. And a list is made by starting consecutive lines with numbers:

  1. Attend a Python Tutorial
  2. Find a project involving computation
  3. Publish a story using my new skills

The idea is that instead of writing an HTML document and deal with all the tagging syntax, you can use a set of almost graphical conventions that look good in the plain text on the left, and can be translated into HTML for viewing in the browser on the right. Now, click in this window, hold down the “shift” key and hit the “enter” key to render the Markdown as HTML and return the text to a single window.

After we’re done here today, take a moment and go through the Markdown Tutorial — it will be time well spent. You can find the full Markdown description here. If you are using the Colaboratory, there is a toolbar at the top of a text cell that inserts the asterisks and such to make the Markdown for you.

2. Some Python basics

So Python is capable of simple arithmetic.

12*(5+11)
max(3,5)+12
(3-7)/101

By default, the notebook with print out the result of the last computation you perform.

Instead of just seeing an answer, you can also assign a value to a “variable” — that is, we take the result of some expression or computation on the righthand side of the “equals” sign and let the name on the lefthand side refer to it. Here, p is associated with the sum of 5 and 30 and wherever we refer to p, that value of 35 is substituted. (The hashtag or pound sign or number sign is how you insert “comments” into code cells — Python ignores everything after a #).

# associate "p" with the sum of 5 and 30
p = 5+30

# what do you get when you add 12?
12+p

It seems like a lot of trouble to go through for computations you can perform on a calculator.

So let’s dig in a bit more. Working with Python is about creating and evolving “software objects”. For example, the number 35 is an object that, like objects in the real world, has things you can do with it (add it to or multiply it by another number, say) and various properties (for example, 35 is smaller than 38). Python’s creators designed a series of powerful objects that will help us do a lot of work, and, importantly, they left open a backdoor so you can make new kinds of objects. Why might we do that?

Community members have created objects to work with images and sound, to manipulate tabular data and not just single values like 35, to make requests for data across the web, or to suck the data out of PDF files. All of this becomes become second nature as you work with the language. But for now, the important thing is that Python is an object-oriented language, meaning that software objects are used to organize data and computations.

You can get the type or “class” of any object by asking with the “function” type(). A function as a series of Python commands that are executed based on some input you provide — remember SUM() and MAX() in our spreadsheet example, these are functions too, just Excel functions.

type() takes an object as input and then returns a short description of the kind of object it is. If there’s an object type that you don’t understand, there is plenty of online documention to help you. The docs.python.com site has a nice introduction to the simple data types that come “built-in” with Python.

Here we execute type() for the number 35.

type(35)

In the output, int stands for “integer” which we (hopefully) remember from school as numbers like 1,2,3 and -10,-11,-12. Integers are “built-in” data types, classes that were part of Guido’s original invention. There are a few of these basic types — let’s look at some.

5.0/30.0 + 2.3
type(5.0/30.0 + 2.3)

Wait, “float”? What’s that? Hmm. Lucky thing Python even knows about more elaborate objects like YouTube videos.

from IPython.display import YouTubeVideo
YouTubeVideo('PZRI1IfStY0')

But we’re getting ahead of ourselves. The type “float” represents a “floating point number” which is a computer representation of numbers that have a decimal point.

Numbers are certainly basic, but think about all the other sources of information we come across every day…

%%HTML
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">LIGHT AT THE END OF THE TUNNEL!</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1247130744260083712?ref_src=twsrc%5Etfw">April 6, 2020</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Chief on this list is sequences of characters or “strings”. These might represent people’s names or addresses, for example, or the text of a tweet. We create a string in Python by surrounding a series of characters with quotations. Here we assign the text of the President’s tweet to a variable named tweet. Then, we check out its type().

tweet = "LIGHT AT THE END OF THE TUNNEL!"
tweet
type(tweet)

As we said, objects help us organize computations. If you add two numbers, you get their sum. Add two strings and you get…

tweet1 = "We are learning much about the Invisible Enemy."
tweet2 = " It is tough and smart, but we are tougher and smarter!"

tweet1+tweet2

… a concatenation! What about multiplication?

"Tweet "*10

There’s one more built-in data type we want to cover. It has only two values — True and False.

type(True)
type(False)

The data type above is called Boolean. You can create a Boolean value by typing the special sequences of characters True or False (without quotations because they are not strings).

But you will primarily encounter Booleans as the output of some logical expression. Here are some examples of expressions that return Boolean (True/False) data. Try the expressions below — each asks whether a relationship holds or not, is true or false.

Riddle me this: Is 3 bigger than 5?

3 > 5

Riddle me this: Are there 10 e’s in the President’s latest tweet?

tweet = "Vote today, Tuesday, for highly respected Republican, Justice Daniel Kelly. Tough on Crime, loves your Military, Vets, Farmers, & will save your 2nd Amendment. A BIG VOTE!"
tweet.count("e") == 10

Riddle me this: Is the letter ‘u’ in ‘Donald Trump’?

"u" in "Donald Trump"

Riddle me this: Is 10 larger than 5 and are there less than 20 e’s in the President’s latest tweet?

tweet = "Vote today, Tuesday, for highly respected Republican, Justice Daniel Kelly. Tough on Crime, loves your Military, Vets, Farmers, & will save your 2nd Amendment. A BIG VOTE!"

10 > 5 and tweet.count("e") < 20

>, <, in, and, or and not are so-called “logical operators”. They are used to form expressions that return True or False.

The Boolean type will be important when we start to write code that “branches” its behavior depending on whether some condition is true or false — we might want to take one action if something is true, but another action if that thing is false. For example, we might want to analyze only the tweets coming from the President’s mobile device and would use a Boolean to separate out those cases. Or we might just want tweets that have been retweeted a large number of times.

3. Methods

To access the data and computations (they’re called “methods”) unique to a particular object, we use so-called “dot” or “.” notation. The methods provided by Python for strings, say, were chosen because the operations have proven useful in working with data or in completing general programming tasks — in short, they are used often and so we want to make sure they are easy to execute on the object.

Here we use the methods lower() and upper() to, well, change the case of the string to all uppercase or all lowercase.

%%HTML
<blockquote class="twitter-tweet"><p lang="in" dir="ltr">USA STRONG!</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1247130573010845708?ref_src=twsrc%5Etfw">April 6, 2020</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
tweet1 = "USA STRONG"
tweet1.lower()
tweet2 = "We are learning much about the Invisible Enemy."
tweet2.upper()

Why would we ever use this (aside from needing to yell in tweet)?

In addition to case changes, we can count the number of times certain patterns occur in a string or find where the pattern starts. Here we count the number of “I”’s.

tweet2.count("i")

And here we take our original string and replace all “t”’s with “g”’s. Again, why might this come in handy?

tweet2.replace("t","g")

We can also save the result of the computation in another variable for use later.

statement = "White House news conference at 6:45 P.M. Eastern. Thank you!"
yell = statement.upper()

yell

Notice that when we are taking action like translating something to uppercase or counting the number of “i”’s in the string, we end the method with parentheses. The same is true when we ask for an object’s type(). Think back to your algebra when you were introduced to functions — maybe y = f(x) on a graphing calculator, or our SUM() example from a spreadsheet.

It’s the same concept here. Ah but sometimes functions require “arguments” in the parentheses to specify what we want done (like when we replaced the “t”’s with “g”’s) and sometimes they do not (like when we turned the string to upper or lowercase).

Finally, methods can (and likely will be) unique to the kind of object we are dealing with. The command below will toss up an error because it’s not clear how one turns a number into uppercase.

p = 40
p.upper()

Python has a simple help facility to let you see what kinds of things you can do to an object and what kinds of data it has. help() is another function, by the way. (This means we’ve seen two kinds of functions — help() and type() are so-called “globals” that can be applied widely, whereas upper() and count() are associated with specific object types and are called with the dot notation.)

Here we ask for help about a string that we have stored in a variable called statement.

statement = "White House news conference at 6:45 P.M. Eastern. Thank you!"
help(type(statement))

And here you see all the things you can do to a float. Like, say, turn it into the ratio of two integers…

x = 1.5
help(type(x))
x.as_integer_ratio()

In addition to help(), there are plenty of online sources to help you on your journey learning the language, from cheatsheets to online tutorials. You will quickly find that the web is a great place to find examples of code to do what you need to do. So, suppose you can’t remember how to concatenate two strings…

<img src=https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/images/cc.jpeg width=600 style=”border:1px solid black”>

A word of caution: Make sure that when you find an answer on the web, it refers to Python version 3. There are two popular versions of the language in use, 2 and 3. This is a good example of eventually needing to move on from whatever technological environment you are used to. Apps have upgrades, your phone’s operating system asks to be upgraded, and programming languages evolve. I mean everything can get better, right? We are learning Python Version 3.

4. A little more advanced

The development community around Python has been hard at work adding useful objects to Python. We share data and code through “packages”. You can find useful packages by searching the Python Package Index, PyPI — here are packages having to do with web scraping. The NICAR mailing list and the News Nerdery Slack channel are also good sources of recommendations.

How about a few examples? There are many amazing tools that build structure on strings that represent language. One simple package is TextBlob.

# Preparing for TextBlob -- it depends on another package, but ignore
# this for the moment. The TextBlob code asks you to enter this code
# when you first work with it. No magic here.

from nltk import download
download("brown")
download('punkt')

In the next code cell, we import a function from the textblob package. We use this mechanism to get access to the data and functions in the packages contributed by the Python development community. So we import a function, and that function takes a string and returns a new kind of object…

# import a new function
from textblob import TextBlob

# create a string from one of the President's recent tweets
tweet = "For humanitarian reasons, the passengers from the two CoronaVirus stricken cruise ships have been given medical treatment and, when appropriate, allowed to disembark, under strict supervision. Very carefully done. People were dying & no other countries would allow them to dock!"

# use that string as an input and create a new object...
blob = TextBlob(tweet)

type(blob)

This new object contains lots of data derived from the President’s text. For example…

blob.noun_phrases

This object also has methods to compute with the President’s text. Here we tranlate his sentences into Spanish.

blob.translate(to="es")

Another set of computations data journalists grapple with regularly have to do with dates and times. In the pandas package there is a handy function called to_datetime(). It takes as an input a string and then outputs an object representing, well, a time and date. We will see Pandas again shortly because it is primarily known for its Data Frame object, Python’s answer to a spreadsheet. But for the moment, let’s use to_datetime(). While you can specify the format your dates are in (is it month, day, year or year-month-day or something else), the function will also make an educated guess for you.

from pandas import to_datetime

day1 = to_datetime("3/10/20 9:00 PM")
day2 = to_datetime("4/1/20 3:00 AM")

day2

And then we can do things like date arithmetic.

day2-day1

Our point with these two objects is that the Python development community has built tools to extend the language from the built-in data types to dates and spatial data and language and YouTube videos. The language grows as the world of data grows. And with it, your reporting abilities!

5. Data structures

We rarely base our reporting on just a single value like a number or a string. Instead, we usually collect multiple pieces of information on a single entity — a person or a company or a county. Python has a couple built-in data structures that help you associate a collection of single values.

Let’s start with a tweet again. What kinds of separate pieces of information do we have about a tweet?

%%HTML
<blockquote class="twitter-tweet"><p lang="in" dir="ltr">USA STRONG!</p>&mdash; Donald J. Trump (@realDonaldTrump) <a href="https://twitter.com/realDonaldTrump/status/1247130744260083712?ref_src=twsrc%5Etfw">April 6, 2020</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Clicking through and looking at the tweet in Twitter’s web interface, we can identify some basic facts about the tweet — things like the number of favorites and retweets, and even what kind of device was used to post the tweet.

favorites = 371357
retweets = 68452
time = "07:55 AM"
day = 6
month = "Apr"
year = 2020
device = "Twitter for iPhone"
ID = 1247130744260083712
URL = "https://twitter.com/realDonaldTrump/status/1247130744260083712"
handle = "realDonaldTrump"
text = "LIGHT AT THE END OF THE TUNNEL!"

To associate these values with a single tweet, we will use a built-in container object, a dictionary.

Think back for a moment to how you used a literal dictionary (Websters?). Finding a definition meant specifying the word we were after. This is the idea behind a dictionary in Python — store data (“values”) according to a name (a word or some kind of “key”). The result is a collection of key-value pairs.

For example at 5:17 AM on April 7, 2020, this tweet had 371,357 favorites.

activity = {"date":"05:18 AM, April 7, 2020", "favorites":474}

# have a look at what we built
activity

The curly braces (not parentheses!) mean we are creating a dictionary, a set of key-value pairs. The names we give to the data (the word, if you will, we associate with the dictionary entry) are date and favorites (one the left of the colons) and the values, the data, are on the right.

If we want to lookup or access data, we provide a name in square brackets.

activity["favorites"]

As a dictionary, activity also has methods. One handy method simply lists off all its keys, telling you what data are availble. The method is called keys().

activity.keys()

Now, make a new version of activity but add the number of retweets under the key “retweets”.

# your code here


If you run across a tweet in the wild you will find a lot of data, data that is not obvious here - Twitter bundles a fair amount of “metadata” with each tweet, only some of which is shown in the display above. We can use the Application Programming Interface or API to pull the full tweet as a dictionary. (There are so many Python packages for accessing Twitter, my favorite being Twarc.) We can explore a dictionary visually by printing it out (too easy), or we can also ask for the kinds of data it contains with .keys().

tweet = {"created_at": "Mon Apr 06 11:55:14 +0000 2020", "id": 1247130744260083712, "id_str": "1247130744260083712", "full_text": "LIGHT AT THE END OF THE TUNNEL!", "truncated": False, "display_text_range": [0, 31], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": []}, "source": "<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "in_reply_to_status_id": None, "in_reply_to_status_id_str": None, "in_reply_to_user_id": None, "in_reply_to_user_id_str": None, "in_reply_to_screen_name": None, "user": {"id": 25073877, "id_str": "25073877", "name": "Donald J. Trump", "screen_name": "realDonaldTrump", "location": "Washington, DC", "description": "45th President of the United States of America\ud83c\uddfa\ud83c\uddf8", "url": "https://t.co/OMxB0x7xC5", "entities": {"url": {"urls": [{"url": "https://t.co/OMxB0x7xC5", "expanded_url": "http://www.Instagram.com/realDonaldTrump", "display_url": "Instagram.com/realDonaldTrump", "indices": [0, 23]}]}, "description": {"urls": []}}, "protected": False, "followers_count": 76162614, "friends_count": 47, "listed_count": 118769, "created_at": "Wed Mar 18 13:46:38 +0000 2009", "favourites_count": 6, "utc_offset": None, "time_zone": None, "geo_enabled": True, "verified": True, "statuses_count": 50468, "lang": None, "contributors_enabled": False, "is_translator": False, "is_translation_enabled": True, "profile_background_color": "6D5C18", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": True, "profile_image_url": "http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg", "profile_banner_url": "https://pbs.twimg.com/profile_banners/25073877/1583212785", "profile_image_extensions_alt_text": None, "profile_banner_extensions_alt_text": None, "profile_link_color": "1B95E0", "profile_sidebar_border_color": "BDDCAD", "profile_sidebar_fill_color": "C5CEC0", "profile_text_color": "333333", "profile_use_background_image": True, "has_extended_profile": False, "default_profile": False, "default_profile_image": False, "following": True, "follow_request_sent": False, "notifications": False, "translator_type": "regular"}, "geo": None, "coordinates": None, "place": None, "contributors": None, "is_quote_status": False, "retweet_count": 68452, "favorite_count": 371357, "favorited": False, "retweeted": False, "lang": "en"}
tweet
tweet.keys()

Again, this tells you all the keys or words that are used to reference data. For example, under created_at we have the time the tweet was authored (in GMT). We access the information as we did above, providing the key or word to look up.

tweet['created_at']

Things start to get fun when you realize that you can store anything in a dictionary. Under the key user for example, we have another dictionary!

tweet["user"]

To extract data from this inner dictionary, we can just stack up the square brackets. The first set ["user"] pulls the user dictionary from our tweet dictionary, and the second set ["followers_count"] pulls the follower count from this user dictionary.

tweet["user"]["followers_count"]

If that makes your head hurt, we can do it in two steps.

# let u point to the "user" dictionary
u = tweet["user"]

# u is a dictionary so we pull data from u using the key "followers_count"
u["followers_count"]

A list is another built-in data structure used to group information. As its name suggests, it is simply an ordered collection of objects. It has a well-defined first entry, a second entry and a last entry. It can hold different kinds of objects in each position. It is constructed using square brackets [ ] (as opposed to the curly braces for a dictionary)

Here we have the counts of COVID-19 tests from the CDC web site, the data that we saw in the spreadsheet at the start of the lesson.

counts = [7,3,10,36,53,101,79,77,65,102,256]

type(counts)

Below we introduce a new “global funciton” called len(). This function returns the number of elements in a list, or its length. It is a global funciton because it can be called meaningfully on a lot of objects. For example, it will also tell you the length or number of characters in a string.

len(counts)
tweet = "For humanitarian reasons, the passengers from the two CoronaVirus stricken cruise ships have been given medical treatment and, when appropriate, allowed to disembark, under strict supervision. Very carefully done. People were dying & no other countries would allow them to dock!"

len(tweet)

As a container object (an object that holds or groups other objects), the most obvious set of operations you would like to perform should involve storing and retrieveing data from the list. As we said, a list stores objects in a well-defined order. There is a first, a second, a third, and so on. You access these objects using an index. A small catch: Python refers to positions starting at 0 and not at 1. So the first object has index 0, the second has index 1 and so on.

Again, with a dictionary, you access data by name (key) and with a list you use numbers (entry order).

# the first element
counts[0]
# the third element
counts[2]
# the sixth elemenet
counts[5]
# from the sixth to the end
counts[5:]
# from the beginning up to but not including the fourth element
counts[:3]
# from the sixth element up to but not incuding the ninth
counts[5:8]

The colon notation is called a slice and lets you select a range of entries from the list. We’re now going to start looking at more complex data. We needed a little syntax so this phase would make sense. Let’s move on to a Data Frame!

6. The Pandas Data Frame

So we have seen lists and dictionaries, built-in structures that help us group data that are associated in some way. With dictionaries, we use names or keys to look up data. With lists, we use position to look things up. In many cases we actually need a mixture of both kinds of structures. The most common example is a table.

Think about a spreadsheet. The basic structure involves rows and columns. In many cases the rows refer to different objects in the real world and the columns represent things we measure or record about each object. For example, if instead of one tweet from Donald Trump, we had 100 or 1,000, we would have a series of rows. For each row, the first entry could be the date and time he tweeted, the second could be the text of the tweet, the third could be the tweet’s ID and so on. This happens so often that researchers have created a special Python object to emulate a spreadsheet.

To see it in action, let’s start with the counts of COVID-19 tests from the CDC web site, the data that we saw in the spreadsheet at the start of the lesson.

We will store each day as a dictionary, with one key for the day and another for the overall test count. Yes, that’s a list made up of dictionaries!

tests = [{"day":"1/20", "count":7},
         {"day":"1/21", "count":3},
         {"day":"1/22", "count":10},
         {"day":"1/23", "count":36},
         {"day":"1/24", "count":53},
         {"day":"1/25", "count":101},
         {"day":"1/26", "count":79},
         {"day":"1/27", "count":77},
         {"day":"1/28", "count":65},
         {"day":"1/29", "count":102},
         {"day":"1/30", "count":256}
        ]

type(tests)
# the number of "rows" or entries in the list
len(tests)
# the fourth row
tests[3]
# the counts from the fourth row
tests[3]["count"]

We will, from time to time, make our data sets “by hand” like this, so it’s worth seeing how it might be done. Our data format, the list of dictionaries, is trying really hard to create essentially a table. That is, a grid of data, where each row refers to a time period and then each column refers to either a date or a test count. For our simple data above, that would be a table with 11 rows and 2 columns.

Interacting with even this simple data in this format is a little cumbersome. We can appeal to a higher-level object to create a proper table for us. I’ve assumed you are familiar with Excel or some spreadsheet. These programs are all about tables. In Python, the answer to Excel (or a popular answer) is a so-called Pandas DataFrame. Pandas refers to a package contributed by a Python developer who wanted to make working with tabular data easier.

You can read more about Pandas here

And there are simple tutorials here

Again, Pandas is a package that means its author has published data, functions and a host of new objects for the community to use. Whereas the built-in objects are basic and get us pretty far, often we need something special to make our lives easier. In the case of Pandas, an object of type DataFrame will help us manipulate (compute with, make graphs of, etc) simple tabular data.

We can use the times object (the list of dictionaries) and turn it into a DataFrame using the function DataFrame(). (Yeah, that might be confusing — the type of the object is “DataFrame” and the name of the function to turn your data into an object of that type is also called “DataFrame”. We saw it with TextBlob as well. This is a fairly common naming convention, and functions like this are called “constructors.”) As arguments, it takes the data itself (the list of dictionaries).

We import the function DataFrame from the pandas package first. The import command is giving us super powers from the Pandas package to do things not built into the basic Python system. We will see this construction a lot.

from pandas import DataFrame

cdc = DataFrame(tests)
cdc

Notice that the way our data looks has changed. It’s much more like an actual table now with column headings and the like. The DataFrame has lots of wonderful things you can do to it — lots of ways to compute with the data contained in the underlying table.

One simple thing is just to get its size. How many rows and columns? This is an attribute, information, stored with the object that we can again access with “dot” notation. Because we are looking up information and not computing something (like making strings lowercase, say), we don’t need parentheses.

cdc.shape

We might also want to sort the table by count to see how much variation there was across these 11 days. ascending= is an argument that lets us specify ordering the rows from smallest to largest or the other way around.

cdc.sort_values("count",ascending=False)

Now, we import a function that lets us make easy line plots. With the command line() all we have to do is specify a DataFrame, the columns that will be our x- and y-axes and maybe give the plot a title. The underlying plotly functionality allows for fairly general plots, but its developers have also made the basic plots very easy. You can read about the so-called plotly “express” here.

from plotly.express import line

fig = line(cdc, x="day", y="count", title='CDC Tests')
fig

Notice that line() made an object that we called fig. When the notebook displays the result of the last computation, it shows the plot.

7. Pulling data from the web

Alright, so we worked pretty hard to get a plot of 11 points. That exercise was mostly so that you could begin to read Python expressions and understand what the code is doing. Let’s scale things up. Instead of DataFrame() that takes as input some structured data (like a list of dictionaries), another pandas function read_csv() takes a CSV file (or the URL for one).

CSV stands for “comma separated values”. Each line in a CSV file is a row in a table, with the entries for each column separated by columns. Here’s what our 11 point data set would look like in CSV.

count,day
7,1/20
3,1/21
10,1/22
36,1/23
53,1/24
101,1/25
79,1/26
77,1/27
65,1/28
102,1/29
256,1/30

Next, we are going to work with Johns Hopkins COVID data. They are updating a data set every day in the form of a CSV. You can read about their collection efforts here. You can preview the data set here.

Let’s load it in. (I have created a slightly altered version of the data just because time is tight.)

from pandas import read_csv

# read in the data
cases = read_csv("https://github.com/computationaljournalism/columbia2020/raw/master/data/jh_cases.csv",parse_dates=True)

# print out the first 25 rows
cases.head(25)
cases.shape

Given a table like this, one of the things we typically want to do is create subsets. We can isolate rows of a Data Frame with a Boolean expression. Below, we create a “list” with 14k elements, one for each row in the cases Data Frame. We will have a True if the Country/Region is the US and a False otherwise.

cases["Country/Region"]=="US"

We can then use this to subset our Data Frame. If we use square brackets (the way all subsetting has been done so far), cases[ ] will take a collection of Booleans and keep only those rows with a True.

cases[cases["Country/Region"]=="US"]

Here we keep just the rows from the 31st of March. That is, the case counts for all the countries on March 31.

cases[cases["Date"]=="2020-03-31"]

So this returned a Data Frame which, as we have already seen, we can sort according to values in a column. Here we sort according to the Cases count.

cases[cases["Date"]=="2020-03-31"].sort_values("Cases",ascending=False)

Now, going back to the US subset, let’s pass the time series data to line() to track the progress of the US.

fig = line(cases[cases["Country/Region"]=="US"],x="Date",y="Cases")
fig

If instead of subsetting we used all the countries and told line() to group things by Country/Region we would get as many lines as there are countries in the dataset.

fig = line(cases,x="Date",y="Cases",color="Country/Region")
fig

The New York Times made their county-level COVID-19 data available to the public here. These are the data that are underneath their graphic tracking the virus. Let’s see if we can recreate their map.

First we read in their data. Again, I made a small alteration, adding the longitude and latitude of each county in their table — we need that information to center the bubbles. If you want to recreate what I did, you will need the .merge() method of a Data Frame. It combines data in different datasets that refer to the same place.

So let’s read the file and have a look.

from pandas import read_csv

covid_map = read_csv("https://github.com/computationaljournalism/columbia2020/raw/master/data/NYT_COVID.csv",parse_dates=True)
covid_map.head()
covid_map.shape

Each row refers to a county on a given day, recording the number of cases of COVID-19 reported that day. We can see where the largest counts are across both time and space by sorting the table on cases.

covid_map.sort_values("cases",ascending=False).head(25)

Now this was just dumb luck, but plotly.express has a function scatter_geo() that makes map-based bubble charts. I know, I was a little shocked myself.

from plotly.express import scatter_geo

fig = scatter_geo(covid_map,
                      lat = "lat",            # the latitude of the center of a county
                      lon = "lon",            # the longitude of the center of a county
                      hover_name="county",    # the name of a county shows up on "hover"
                      size="cases",           # size the bubble by the number of cases in a county
                      size_max=50,            # I eyeballed how big the biggest bubble should be
                      animation_frame="date", # let the map animate by day
                      scope="usa")            # make a map of the us, not the world or europe
fig

This has been a whirlwind introduction to Python. Hopefully some of the syntax is clearer now — the idea of object oriented programming, the package structure of the language, and how you can bootstrap a project. The use of a programming language like Python means that your process has a fighting chance of being reproducible by others — share a notebook with them and let them check your work!

Python is an excellent first language because its deisgn emphsizes sharing and readability. In my classes, computing is a team sport and Python supports that kind of activity perfectly.

8. And beyond

Let’s have a look at a significant journalistic project to document COVID testing in the US. The COVID Tracking Project is maintained by Alexis Madrigal of The Atlantic and an army of others. They produce data like this

https://covidtracking.com/api/us/daily.csv

It represents a CSV. In this case each row is a day and then for each day, among other things, we record the number of people who tested positive and negative for COVID19 in the US up to that day.

When we download daily.csv we are, in fact, making an API call. API stands for Application Programming Interface and we mentioned Twitter’s API earlier. The idea is that we specify to the Tracking Project’s server the kind of data we want, and they hand back data in machine-readable form (a CSV or a JSON file). The API could return for us just the data from New York state, say

https://covidtracking.com/api/states/daily.csv?state=NY

Or all the data from New York from March 22nd.

https://covidtracking.com/api/states/daily.csv?state=NY&date=20200322.

These data are unique because they are tracking the number of tests being given and their results. But they don’t come easily. The Tracking project needs to cull information from various state. Read about each state and the quality of their data here. Consider two states, California and Connecticut.

California makes its data available on an HTML page and Connecticut embeds its in a PDF. In both cases, we resort to “scraping” data. This is another task Python can handle gracefully.

# import some packages - "requests" for fetching web pages,
# "BeautifulSoup" for parsing HTML, and "re" for specifying
# patterns in text
from requests import get
from bs4 import BeautifulSoup
from re import compile, sub

# the URL of the California web page
url = 'https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/Immunization/ncov2019.aspx'

# the http request
response = get(url)

# run the HTML through BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# look for the COVID header, the element containing "by the numbers"
# and take the next element. call the text "alert"
header = soup.find(text=compile(r"COVID-19 by the Numbers"))
alert = header.next_element.string

# here is the string we want
print(alert)

# parse out the cases and deaths
date = sub(r"^.*of\s([a-zA-Z]+\s+[0-9]+,\s+[0-9]+),.*$",r"\1",alert)
cases = sub(r"^.*\s+([0-9,]+)\spositive.*$",r"\1",alert)
deaths = sub(r"^.*\s+([0-9,]+)\s+deaths.*$",r"\1",alert)

# we would store these as a row in a CSV say
print("Date:", date)
print("Cases:", cases)
print("Deaths:", deaths)

Aside from something needed by TextBlob, the packages we’ve used were all installed by Google already. For Connecticut we need to install a package to handle PDFs. It’s called pdfplumber. Here’s how we install it and how we pull a table from the PDF.

!pip install pdfplumber
# Our imports
from requests import get
import pdfplumber

# the URL for the CT PDF
url = "https://portal.ct.gov/-/media/Coronavirus/CTDPHCOVID19summary4052020.pdf?la=en"

# make the http request
response = get(url)

# this is awkward, but open a file to download the CT PDF into
ct_file = open("ct.pdf","wb")
ct_file.write(response.content)

# then open the PDF file in pdfplumber
pdf = pdfplumber.open("ct.pdf")

The pdf object is like a list, with subsetting to get us different pages. Hers’s the first, from which we extract_text().

page = pdf.pages[0]
page.extract_text()

And here we extract the table from the first page. It will be a list of lists which we can load into a Pandas Data Frame in the sme way we loaded a list of dictionaries.

table = page.extract_table()
table
# imports
from pandas import DataFrame

# create the data frame, removing the first four elements or rows of data - they
# are not that useful as headers. we will provide our own column names below.
df = DataFrame(table[4:],columns=["County","Cases","Deaths"])
df

This can then be written to a growing CSV or put into a database.