Save dannguyen/b5e7639888115cc8c9ad6a1220ee1226 to your computer and use it in GitHub Desktop.
Walkthrough of how to download the massive FEC individual contributions file for 2016, and filter it for spreadsheet usage
The 2016 U.S. Election is so hot and rich that even with a month to go, the Federal Election Commission has recorded 12.6 million individual campaign donations.
For comparison's sake, the 2014 cycle had just 2.1 million donors. And the last presidential cycle, 2012, had 3.3 million donors.
The bottom line is that we can't import all the data directly into Excel (nevermind Google Sheets). But we can at least import the subset of data that we need (under a million rows) without having to crack out a database.
This tutorial assumes you have csvkit installed and know your way around the command-line. The typing in Windows will probably be different. The parts with curl and unzip can be done with however you managed to download files and unzip them via point and click.
The steps here download the indiv16.zip file from the FEC, unzips it, combines it with headers, and makes a new comma-delimited file named all-indiv2016.csv . Which is too big for Excel, but we'll filter it later.
The FEC has separate URL for the headers and for the data, which is a zip file at:
And, for whatever reason, the headers are comma-delimited but the data file is pipe-delimted.
# Download zip file and save as indiv16.zip curl ftp://ftp.fec.gov/FEC/2016/indiv16.zip \ -o indiv16.zip # Unzip zip file unzip indiv16.zip # This produces itcont.txt # Create a new text file named `all-indiv2016.csv`, starting with # the headers that the FEC provides curl http://www.fec.gov/finance/disclosure/metadata/indiv_header_file.csv \ -o all-indiv2016.csv # Because the `itcont.txt` file is pipe-delimited, we use csvformat -d # to read it with pipes, and spit it out with commas # and then append w/ >> to all-indiv2016.csv csvformat -d '|' itcont.txt >> all-indiv2016.csv
Every candidate has one primary campaign committee, and up to several other committees. Let's keep things simple and focus on Republican candidate Donald Trump.
His primary committee, DONALD J. TRUMP FOR PRESIDENT, INC ., has an ID of C00580100 .
So, we want to filter all-indiv2016.csv by whatever column has committee IDs. Now's a good time to check out the dataset's data dictionary.
The very first column CMTE_ID seems to have what we want:
The csvgrep utility let's us specify a column name, e.g. CMTE_ID , and then, with the -r flag, specify a regex pattern that we want to match against the value of CMTE_ID
(if you just want a literal match, -m works fine too)
Here's the command to read all-indiv2016.csv and filter it for Trump's committee, and then to send the matching rows into a file named trump-indiv2016.csv :
csvgrep -c 'CMTE_ID' \ -r 'C00580100' all-indiv2016.csv \ > trump-indiv2016.csv
A couple of warnings: this will take awhile (a literal match might be faster), and it probably won't capture all of the donors to Trump, if there are any other committees affiliated with him.
Here's the command to read all-indiv2016.csv and filter it for Trump's committee, and then to send the matching rows into a file named trump-indiv2016.csv :
csvgrep -c 'CMTE_ID' \ -r 'C00580100' all-indiv2016.csv \ > trump-indiv2016.csv
A couple of warnings: this will take awhile (a literal match might be faster), and it probably won't capture all of the donors to Trump, if there are any other committees affiliated with him.
Using wc, I get about 73,000 records:
wc -l trump-indiv2016.csv
We could repeat the process for each of the other presidential candidates, which requires looking up each individual committee ID.
Here's how to use a regex to get them both in the same file:
csvgrep -c 'CMTE_ID' \ -r 'C00580100|C00575795' all-indiv2016.csv \ > trumphill-indiv2016.csv
Clinton's donors add more than 860,000 records to the file. Still small enough to open in Excel, but barely.
Knowing the data dictionary better will increase the ways you can filter the data to your interests.
It's pretty difficult to generate a list of all the names of possible entertainment-donors. So let's just search by name for celebrities we know have donated, and see what they look like in the data.
First, let's create a new file that contains only donors with a STATE of CA or NY :
csvgrep -c 'STATE' -r 'CA|NY' all-indiv2016.csv > nyca-indiv2016.csv
Those two states have 2.8 million records alone.
In this localized data, let's do a name search for a few people listed in the LAT story, just to see how they describe their occupations. In the example below, I chain a couple of other csvkit commands to trim the columns and make it easier to read:
csvgrep -c 'NAME' \ -r 'DEVITO, DAN|BRUCKHEIMER, JERRY|SPIELBERG, ST' \ nyca-indiv2016.csv \ | csvcut -c 'CMTE_ID,NAME,STATE,EMPLOYER,OCCUPATION,TRANSACTION_AMT' \ | csvlook
In the example below, I'm going to throw in a csvcut so that only a few columns are shown, and a csvlook to make things look nice.
Here's a sample of the results:
CMTE_ID | NAME | STATE | EMPLOYER | OCCUPATION | TRANSACTION_AMT |
---|---|---|---|---|---|
C00311944 | SPIELBERG, STEVEN | CA | FREELANCE | DIRECTOR | 1200 |
C00068353 | SPIELBERG, STEVEN | CA | SELF | DIRECTOR | 2700 |
C00068353 | SPIELBERG, STEVEN | CA | SELF | DIRECTOR | 2700 |
C00578658 | DEVITO, DANNY | CA | N/A | ACTOR | 2700 |
C00540195 | DEVITO, DANNY | CA | SELF EMPLOYED | ACTOR | 5000 |
C00492991 | SPIELBERG, STEVEN | CA | SELF-EMPLOYED | DIRECTOR | 2700 |
C00608497 | BRUCKHEIMER, JERRY | CA | JERRY BRUCKHEIMER FILMS | PRODUCER | 5400 |
C00264101 | SPIELBERG, STEVEN | CA | DREAMWORKS SKG ENTERTAINMENT | PRINCIPAL PARTNER | 2700 |
C00258475 | SPIELBERG, STEVEN | CA | DREAMWORKS SKG | FILM EXECUTIVE | 2700 |
C00346312 | SPIELBERG, STEVEN | CA | SELF EMPLOYED | DIRECTOR/PRODUCER | 2700 |
Looks like a variety of fields. Doing a complete search for entertainment-types will require some manual labor, but for now, here's one pattern for scooping many of them up. This time, we focus on OCCUPATION , and dump the results into fun-indiv2016.csv
In the code below, I pipe into csvcut so I can filter out unnecessary columns -- else, the full layout causes Google Sheets to choke:
csvgrep -c 'OCCUPATION' \ -r 'ACTOR|ACTRESS|DIRECTOR|PRODUCER|FILM' \ nyca-indiv2016.csv \ | csvcut -c 'CMTE_ID,NAME,CITY,STATE,ZIP_CODE,EMPLOYER,OCCUPATION,TRANSACTION_DT,TRANSACTION_AMT' \ > fun-indiv2016.csv
Looks like filtering on "DIRECTOR" was too loose of a filter, as we can see a few SPACEX folks in the data:
Let's do a pivot table that includes only "ACTOR/ACTRESS"-types, and then group along the CMTE_ID to get a summation of donations per committee:
So who do actors and actresses give their money to? You'll just have to look it up individually. Or learn a little SQL.
The individual contributions table doesn't have the "label" data that would make it easy to which committee belongs to which candidate. We can't even filter it by party. It's easy enough to just look the FEC data on committee IDs, but it's obviously not scalable for wide-range analysis. Experiencing this difficulty will help you understand why SQL is so powerful.