Welcome to Lede 2017: Algorithms


  • Instructor: Jonathan Soma, js4571@columbia.edu
  • Dates: Mondays and Wednesdays, 7/17-7/30
  • Class: 10am-1pm
  • Location: 7/17-8/4: 601B, 8/7-8/30: Brown
  • Lab: 2pm-5pm
  • Slack channel: #algorithms

Course Overview

By the end of this course you’ll be able to understand and use techniques to analyze large datasets, from wielding servers to using regression and machine learning.


This is a rough outline, and is subject to change.

Week 1: Servers for high-memory or repeating tasks (7/17 + 7/19)

Tired of being bound by your computer’s limitations? In this first week we’ll look at how to set up a server to do tasks that either take too much memory or too much time. Topics covered include server setup, cron jobs, ssh, scp, diffing, and notifications.

We’ll be using Digital Ocean servers - if you use this referral link to sign up you’ll save $10.

We might also use Twilio or Mailgun for notifications.

Week 2: Text Processing, Part 1 (7/24, 7/26)

Our second week we’ll begin exploring analyzing large amounts of text. We’ll begin with OpenRefine’s text-cleaning abilities, then try out Amazon’s Mechnical Turk as an alternative to pdf2txt. If you’d like more control than OpenRefine gives you, we’ll also look at libraries like fuzzywuzzy for Python.

Later in the week we’ll begin using scikit-learn with pandas. Introduction to vectorization.

Week 3: Text Processing, Part 2 (7/31 + 8/2)

Continuing our work with sklearn, we’ll explore stemming and lemmatization, then head toward machine learning topics like topic modeling and clustering.

Week 4: Introduction to machine learning (8/7 + 8/9)

A more formal introduction to machine learning. What is machine learning, and how does it have to do with what we’ve been doing for the past few weeks?

Week 5: Regression (8/14 + 8/16)

Week 6: Entity Extraction and Network analysis (8/21 + 8/23)

Week 7: Float (8/28 + 8/30)