dannguyen / fec-indiv2016-simplified.md

Save dannguyen/b5e7639888115cc8c9ad6a1220ee1226 to your computer and use it in GitHub Desktop.

Walkthrough of how to download the massive FEC individual contributions file for 2016, and filter it for spreadsheet usage

Fitting 2016 FEC campaign donor data into your humble spreadsheet

The 2016 U.S. Election is so hot and rich that even with a month to go, the Federal Election Commission has recorded 12.6 million individual campaign donations.

For comparison's sake, the 2014 cycle had just 2.1 million donors. And the last presidential cycle, 2012, had 3.3 million donors.

The bottom line is that we can't import all the data directly into Excel (nevermind Google Sheets). But we can at least import the subset of data that we need (under a million rows) without having to crack out a database.

This tutorial assumes you have csvkit installed and know your way around the command-line. The typing in Windows will probably be different. The parts with curl and unzip can be done with however you managed to download files and unzip them via point and click.

Download the data, compile into a single CSV file

The steps here download the indiv16.zip file from the FEC, unzips it, combines it with headers, and makes a new comma-delimited file named all-indiv2016.csv . Which is too big for Excel, but we'll filter it later.

The FEC has separate URL for the headers and for the data, which is a zip file at:

And, for whatever reason, the headers are comma-delimited but the data file is pipe-delimted.

# Download zip file and save as indiv16.zip curl ftp://ftp.fec.gov/FEC/2016/indiv16.zip \ -o indiv16.zip # Unzip zip file unzip indiv16.zip # This produces itcont.txt # Create a new text file named `all-indiv2016.csv`, starting with # the headers that the FEC provides curl http://www.fec.gov/finance/disclosure/metadata/indiv_header_file.csv \ -o all-indiv2016.csv # Because the `itcont.txt` file is pipe-delimited, we use csvformat -d # to read it with pipes, and spit it out with commas # and then append w/ >> to all-indiv2016.csv csvformat -d '|' itcont.txt >> all-indiv2016.csv

Filter big data file by committee ID

Every candidate has one primary campaign committee, and up to several other committees. Let's keep things simple and focus on Republican candidate Donald Trump.

His primary committee, DONALD J. TRUMP FOR PRESIDENT, INC ., has an ID of C00580100 .

So, we want to filter all-indiv2016.csv by whatever column has committee IDs. Now's a good time to check out the dataset's data dictionary.

The very first column CMTE_ID seems to have what we want:

Filter CSVs by column with csvgrep

The csvgrep utility let's us specify a column name, e.g. CMTE_ID , and then, with the -r flag, specify a regex pattern that we want to match against the value of CMTE_ID

(if you just want a literal match, -m works fine too)

Making trump-indiv2016.csv

Here's the command to read all-indiv2016.csv and filter it for Trump's committee, and then to send the matching rows into a file named trump-indiv2016.csv :

csvgrep -c 'CMTE_ID' \ -r 'C00580100' all-indiv2016.csv \ > trump-indiv2016.csv

A couple of warnings: this will take awhile (a literal match might be faster), and it probably won't capture all of the donors to Trump, if there are any other committees affiliated with him.

Making trump-indiv2016.csv

Here's the command to read all-indiv2016.csv and filter it for Trump's committee, and then to send the matching rows into a file named trump-indiv2016.csv :

csvgrep -c 'CMTE_ID' \ -r 'C00580100' all-indiv2016.csv \ > trump-indiv2016.csv

A couple of warnings: this will take awhile (a literal match might be faster), and it probably won't capture all of the donors to Trump, if there are any other committees affiliated with him.

Using wc, I get about 73,000 records:

wc -l trump-indiv2016.csv

Getting Trump and Clinton together

We could repeat the process for each of the other presidential candidates, which requires looking up each individual committee ID.

Here's how to use a regex to get them both in the same file:

csvgrep -c 'CMTE_ID' \ -r 'C00580100|C00575795' all-indiv2016.csv \ > trumphill-indiv2016.csv

Clinton's donors add more than 860,000 records to the file. Still small enough to open in Excel, but barely.

Hollywood baby

Knowing the data dictionary better will increase the ways you can filter the data to your interests.

It's pretty difficult to generate a list of all the names of possible entertainment-donors. So let's just search by name for celebrities we know have donated, and see what they look like in the data.

First, let's create a new file that contains only donors with a STATE of CA or NY :

csvgrep -c 'STATE' -r 'CA|NY' all-indiv2016.csv > nyca-indiv2016.csv

Those two states have 2.8 million records alone.

In this localized data, let's do a name search for a few people listed in the LAT story, just to see how they describe their occupations. In the example below, I chain a couple of other csvkit commands to trim the columns and make it easier to read:

csvgrep -c 'NAME' \ -r 'DEVITO, DAN|BRUCKHEIMER, JERRY|SPIELBERG, ST' \ nyca-indiv2016.csv \ | csvcut -c 'CMTE_ID,NAME,STATE,EMPLOYER,OCCUPATION,TRANSACTION_AMT' \ | csvlook

In the example below, I'm going to throw in a csvcut so that only a few columns are shown, and a csvlook to make things look nice.

Here's a sample of the results:

CMTE_ID	NAME	STATE	EMPLOYER	OCCUPATION	TRANSACTION_AMT
C00311944	SPIELBERG, STEVEN	CA	FREELANCE	DIRECTOR	1200
C00068353	SPIELBERG, STEVEN	CA	SELF	DIRECTOR	2700
C00068353	SPIELBERG, STEVEN	CA	SELF	DIRECTOR	2700
C00578658	DEVITO, DANNY	CA	N/A	ACTOR	2700
C00540195	DEVITO, DANNY	CA	SELF EMPLOYED	ACTOR	5000
C00492991	SPIELBERG, STEVEN	CA	SELF-EMPLOYED	DIRECTOR	2700
C00608497	BRUCKHEIMER, JERRY	CA	JERRY BRUCKHEIMER FILMS	PRODUCER	5400
C00264101	SPIELBERG, STEVEN	CA	DREAMWORKS SKG ENTERTAINMENT	PRINCIPAL PARTNER	2700
C00258475	SPIELBERG, STEVEN	CA	DREAMWORKS SKG	FILM EXECUTIVE	2700
C00346312	SPIELBERG, STEVEN	CA	SELF EMPLOYED	DIRECTOR/PRODUCER	2700

Looks like a variety of fields. Doing a complete search for entertainment-types will require some manual labor, but for now, here's one pattern for scooping many of them up. This time, we focus on OCCUPATION , and dump the results into fun-indiv2016.csv

In the code below, I pipe into csvcut so I can filter out unnecessary columns -- else, the full layout causes Google Sheets to choke:

csvgrep -c 'OCCUPATION' \ -r 'ACTOR|ACTRESS|DIRECTOR|PRODUCER|FILM' \ nyca-indiv2016.csv \ | csvcut -c 'CMTE_ID,NAME,CITY,STATE,ZIP_CODE,EMPLOYER,OCCUPATION,TRANSACTION_DT,TRANSACTION_AMT' \ > fun-indiv2016.csv

Stars of the stars

Looks like filtering on "DIRECTOR" was too loose of a filter, as we can see a few SPACEX folks in the data:

Let's do a pivot table that includes only "ACTOR/ACTRESS"-types, and then group along the CMTE_ID to get a summation of donations per committee:

So who do actors and actresses give their money to? You'll just have to look it up individually. Or learn a little SQL.

The individual contributions table doesn't have the "label" data that would make it easy to which committee belongs to which candidate. We can't even filter it by party. It's easy enough to just look the FEC data on committee IDs, but it's obviously not scalable for wide-range analysis. Experiencing this difficulty will help you understand why SQL is so powerful.