2016-08-25

 

Getting old at 10

A few months ago, I started using Python to do any moderately serious data-crunching. A couple things drove the decision, but a big one was that the Google Sheets I was building, while effective, were bogging down. They would take forever to open, and sometimes would crash Chrome or just freeze up; small changes might do the same.

Python, especially using agate, fit the bill nicely. It was *fast*, for one thing, chewing through the datasets I was using — tens of thousands of records, in some cases — quickly. It’s also great because a Python script is reproducible: I can see exactly what I did, and changes last; I don’t have to re-do my work when I reimport data, or expand the dataset I’m using (e.g., from the Dow 30 to the S&P 500, or from five quarters of data to 13).

I’m still exporting my results to CSV and then importing into Google Sheets when I want to make a chart. I still find it easier to make certain kinds of pivot tables in Sheets as well. But the former will change, I suspect, once I get into leather, and the latter are probably easily doable in agate. I just need to bite the bullet and do it.

Anyway, I was pretty smug about the whole thing. Then a coworker came to me yesterday and asked for help. He often works with a particular dataset in Access. He’s familiar with the data and the software, and it works. But now the dataset has grown to 11.7 million records — 2.2 gigabytes of data — and Access chokes. He’s heard I use Python to crunch data. Maybe I can help.

Of course I can! So while the data are downloading, I fire up Pyzo and dash off a 23-line script — including plenty of blank lines for readability and a gratuitous four lines to calculate and display the progress of a for loop, so call it a scant dozen.

The script is great (see below). The only problem: 2.2GB of data is pushing it for my ancient 10 year old computer with a measly 8 GB of RAM. Even after I shut down every background app and utility I could find, I still ran out of memory and froze Python.

Luckily, my coworker got the hang of things, installed Pyzo and agate on his own (newer) machine, and got the script working.

Here it is, with a few changes to mask the actual data in question.

import agate

colNames = [//list of column names//]
colTypes = [//matching list of column types//]

table = agate.Table.from_csv('file.txt', column_names=colNames, column_types=colTypes, delimiter = '|')

lenAll = len(table.rows)    ## denominator for progress calculation

filterList = [//list of filter criteria//]
filteredTableList = []

progress = 0

for item in filterList:
    filteredTable = table.where(lambda row: row[fieldname] == item)
    filteredTableList.append(filteredTable)
    progress += len(filteredTable.rows)
    pctComplete = (progress/lenAll)*100
    print("Progress: " + str("{:.2f}".format(pctComplete)) + "%") # ugh

combo = agate.Table.merge(cmteTableList)

 

So I think I can declare victory: It worked, even if it didn’t work on my machine…


Previous post
Welcome to Hard of Thinking This is where I’m keeping track of my probably futile effort to abandon proprietary data formats and live my (digital) life in plain text. There’s
Next post
Lobbyists as Directors Test Rules for Corporate Boards October 04, 2016 03:23 PM Directors at some companies are paid to lobby for those firms or