Tuesday, August 28, 2012

Using R from within Python

I recommend scientists new to programming start by learning the Python programming language. It's a great, beginner-friendly, powerful general purpose language that's capable of everything from numeric computation to creating graphical and web applications.

Many scientists that have to do complex data analysis may already have practical training in R. R lacks some of the capabilities and ease of use of Python, but it has top-notch statistical libraries that, at this point, Python simply can't match.

If you already have a little R experience, I still recommend going with Python, as it's a stronger general purpose language. When you need statistical libraries beyond Python's capabilities, R code can be called from within Python using the RPy module.

Here's an example of calling R code from within Python:

from rpy import r

print("Hello World!")

Evaluating R expressions will return the result, converted to the appropriate Python data type:

>>> r('cos(pi/2)')
>>> r('c(1,2,3)')
[1.0, 2.0, 3.0]

Very simple. You can load R modules using
and write and call R functions from within Python:

>>> from rpy import r
>>> r('hello = function(name) { print(sprintf("Hello, %s!", name)) }')
<Robj object at 0x7fddb8745410>
>>> r.hello("Ben")
[1] "Hello, Ben!"
'Hello, Ben!'

Saturday, January 7, 2012

Top 10 Unix tools (that you might not know about)

You don't have to spend a lot of effort learning shell scripting for some of these tools to come in handy. This is a list of tools that everyone should know. First, a few basics, then a few that you might not have heard of if you haven't spent much time with command-line tools.

1. cd, ls

cd is used to change directories. cd .. returns to the parent directory and cd - returns to the last directory. Note that "cd.." (no space) is valid on Windows but isn't in Unix systems; the space is needed between the command name and the argument so the correct form is "cd .."

Use ls to list the contents of the current directory. ls -lh will list it in a more readable format.

2. mv, cp

mv old_path new_path moves a file; cp old_path new_path copies a file. Simple as that. mv is also used for renaming files.

3. rm, rmdir

You might know about these - rm deletes files, rmdir deletes empty directories. A simple trick: rmdir is never necessary. There's an easy way to delete a directory, even if it's full of files. rm -rf directory_path deletes a directory and all of its contents, recursively. (Use with caution!)

4. touch

The purpose of touch is to update the "last modified" time of a file to the current time. This is not what I use it for. It doubles as a simple way to create a file if it doesn't exist. For example, touch test.txt - if test.txt doesn't exist, it will be created; if it does exist, its last modified time will be updated.

5. top

top helps monitor system processes and lists the amount of memory and CPU they're taking up, how long they've been running, etc.

6. screen

This is a great tool for running processes on a remote machine and coming back to them later. For example, to run the Python script "remote.py" after logging into a remote server:

screen python remote.py

then press CTRL+a, d to detach the process. You can log out, come back later, and return to the process with

screen -r

If you have multiple processes running under screen, you'll need to use screen -ls and refer to the one to reattach by process id.

7. ps

ps will give you all of your current running processes and their PID (process id). ps -e will give you all processes running on your machine for all users.

8. kill, pkill

kill allows you to kill a process by process id (which you could get from ps or top.) pkill allows you to kill a process by name, i.e. pkill python will kill all currently running Python processes.

9. diff

diff file1 file2 is a useful way to quickly display the differences between two text files.

10. grep

grep pattern file_path will return only lines from a file that contain pattern. grep can also be used with other Unix tools that return lists; for example, ls | grep .pdf will return all files in the current directory containing ".pdf"; ps axu | grep python will return a list of processes running in Python. There's a ton you can do with grep, and I've only just scratched the surface.

Thursday, January 5, 2012

Resuming partial file downloads (with wget or scp)

File downloads in a web browser have spoiled me - if my connection goes out for a few seconds, I expect it to pick up where it left off. When using unix command line tools wget and scp to download files, however, a break in the connection will kill the download. Very frustrating when a 10GB file stops downloading at 80%. So (mostly for my own reference) here's how to resume partially completed downloads.


When using wget, just add the -c flag to continue based on the size of the partially downloaded file:

wget -c (path-to-file)


The scp tool doesn't seem to have this functionality. The rsync tool, however, does, and can be used to resume scp downloads:

rsync --partial --progress --rsh=ssh user@host:file_path local_file_path

Monday, January 2, 2012

Scraping Data with Python

In a perfect world, all the data you needed would be easily accessible online. We're not quite there yet. In the past couple months I've had to write several scrapers to acquire large datasets and avoid a lot of tedious point/clicking or copy/pasting. (I also scraped some NFL player data to help with my fantasy football picks next year - same concept.)

"Scraping" data basically means to retrieve data from the web, stored in a less convenient format like HTML tables, and copy it into a format you can use such as a CSV file or database. It can be somewhat tedious, but it usually beats the alternative of trying to copy data by hand. Python has some excellent tools for scraping which I will cover here.

If you're scraping data from HTML pages, you're going to need some basic knowledge of HTML, and you'll need to check out the structure of the page you're scraping (right click > View Page Source) to figure out how to get to the content you need. Once you have an idea, the following tools will be useful to parse out what you need.


If you're not familiar with jQuery, it's a JavaScript library that provides easy access to elements within an HTML page. PyQuery ports the same concept to Python, allowing you to use jQuery syntax to find specific elements from an HTML string. Here's an example that finds all links with "special" class attribute in a string of HTML:

>>> from pyquery import PyQuery
>>> p = PyQuery("<a href='abc.html'>Link 1</a> <a href='def.html' class='special'>Link 2</a>")
>>> p("a.special")

Basic Automated Browsing

Python has several great libraries for automatically browsing web sites. A good one to start with is spynner. Spynner builds onto the mechanize library, which in turn is built on the urllib2 URL-opening library: urllib2 generates HTTP request and response objects, mechanize has a Browser object capable of navigating these requests and responses, and spynner adds some additional features including automatic form-filling and JavaScript support. Instead of working only with HTML, spynner can process jQuery and JavaScript to render pages as your usual browser would. These features and having to render the page do result in a slight performance hit, so if you don't need such an advanced tool, you can drop down to mechanize, which also has a Browser class with a similar interface.

Spynner and PyQuery can be used together. This example loads a page, then uses PyQuery to get a list of all <a> tags within <h3> tags:

>>> from spynner import Browser
>>> from pyquery import PyQuery
>>> browser = Browser()
>>> browser.set_html_parser(PyQuery)
>>> browser.load("https://www.google.com/search?hl=en&q=france")
>>> browser.soup("h3 a")
[<a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>]


In this example, we'll build a simple scraper to get a dictionary of state population sizes from the Wikipedia page http://en.wikipedia.org/wiki/List_of_U.S._states.

The first step is to take a look at the page's HTML. Notice that the data we want is in a table. There are several tables, but only one with class "wikitable" (you can use PyQuery to confirm this - PyQuery(html)('table.wikitable') should return only a single item.) The state names are in the first column, and the population sizes in the seventh. Note that the state names are also within a <a> link tag and population numbers have commas in them.

After our initial research, the scraper we write will look like this:

from spynner import Browser
from pyquery import PyQuery

browser = Browser()


# get the table of states
table = browser.soup("table.wikitable")
# skip the first row, which contains only column names
rows = table("tr")[1:]

pop_dict = {}
for row in rows:
    columns = row.findall("td")
    state_name = columns[0].find('a').text
    population = int(columns[6].text.replace(',', ''))
    pop_dict[state_name] = population
print pop_dict

Parting Advice

If you're scraping, you should've already exhausted more reasonable channels such as searching the page for more conveniently-formatted data or contacting the people in charge. If so, by definition, you're acquiring data that was not made available for easy download - and there may have been good reason. It's always smart to make sure that what you're doing is legal and that there are no strings attached with use of the data. (My NFL data can only be used for noncommercial purposes, for example.) Just because you can get it doesn't mean it's yours to freely use.

Since scraping can involve a lot of trial and error, it's usually best to save a page of sample data to your own machine and "practice" on that until you're sure the scraper will work. You don't want to be making a lot of needless HTTP requests - this could alert the data owner to what you're doing and cause unnecessary trouble for you, and could even result in your IP address being blocked or the data being taken down. Make sure your scraper behaves reasonably by limiting how often requests are made and how much is downloaded in a given period. This will help not to draw attention and prevent wasting someone else's resources.

Sunday, January 1, 2012

Programming for Biologists

Last semester I helped with the development of a programming course for biologists at Utah State University. There are beginning and advanced sections, and the course teaches Python and stresses practical knowledge over theory. There are code examples and exercises with biological data. Even if you're not a biologist, it's a useful introduction to scientific programming in Python.

All of the lectures, code examples, and exercises can be found on the course website, http://www.programmingforbiologists.org.

Which programming language should I learn?

If you want to begin programming but don't know where to start, this is the first question you'll need to address. Programming languages often are geared toward specific tasks, so the answer depends on what type of work you'll be doing most. Here's an overview of some of the most common programming languages and what they're best suited for to help you make your decision.

  • Python
Python is an interpreteddynamically-typed programming language. Python programs stress code readability, so even non-programmers should be able to decipher a Python program with relative ease. This also makes the language one of the easiest to learn and write code in quickly.

Python is very popular and has a strong set of libraries for everything from numerical and symbolic computing to data visualization and graphical user interfaces.

  • Ruby
Ruby is very similar to Python, but with different syntax and libraries. There's little reason to learn both, so unless you have a specific reason to choose Ruby (i.e. if this is the language your colleagues all use), I'd go with Python.

Ruby on Rails is one of the most popular web development frameworks out there, so if you're looking to do primarily web development you should compare Django (Python framework) and RoR first.

  • C
C is a low-level, statically typed, compiled language. The main benefit of C is its speed, so it's useful for tasks that are very computationally intensive. Because it's compiled into an executable, it's also easier to distribute C programs than programs written in interpreted languages like Python. The trade-off of increased speed is decreased programmer efficiency.

  • C++
C++ is C with some additional object-oriented features built in. It can be slower than C, but the two are pretty comparable, so it's up to you whether these additional features are worth it.

  • Haskell
This is probably the rarest language on my list; it's also my hands-down favorite. Haskell is a statically typed functional programming language with type inference. Basically, Haskell is closer to mathematics than other languages, and functions in Haskell are more like mathematical functions than step-by-step lists of instructions.

Many programmers find Haskell difficult to learn. Haskell avoids using changeable state, which is very different from programming in other languages. Those that have mastered this language gain invaluable insight into good programming practices that are useful in other languages as well.

  • Java
Java is a very popular language, which itself is a good reason to get acquainted with it. Originally it was primarily used for developing "applets" for websites. Today, it's a good cross-platform language that's very balanced: it's a little easier to learn and code in than C, and a little faster than Python. If you ever plan on developing mobile applications for Android, you'll need to pick up Java.

Java is often confused with JavaScript, a primarily browser-based web development language, but other than the name these languages have very little in common.

  • Visual Basic and C# (.NET)
These languages are mainly useful for Windows development and not commonly used for cross-platform applications. Visual Basic (VB for short) is a common beginner's language developed by Microsoft. C# as a language is very similar to Java; in fact, they're almost identical. C# does have some nice features that Java lacks, such as type inference and LINQ.

A strength that VB and C# share is Microsoft's .NET framework, which provides a common set of useful library functions to both languages. VB programs are generally developed in Microsoft Visual Studio, which includes an easy-to-use window designer for graphical applications.

  • R
R is a programming language mostly used for statistics and data visualization. While the R language by itself is not very interesting, R still has a wealth of great statistical and geospatial libraries that Python and Ruby just can't match.

  • PHP
PHP is a simple web scripting language. It can be used to put up websites with dynamic content quickly; it has little utility other than that. Since other languages can be used to construct websites as well, PHP probably isn't the best choice for a first language.


If you're only planning on learning a single language, I recommend learning Python - it's easy to learn and use. Otherwise, it's a good idea to add several diverse languages to your repertoire based on what type of development you typically do. For example, it's a good idea to supplement Python or Ruby knowledge with a low-level language such as C or C++ for writing fast, compiled applications. Finally, if you have some spare time, go ahead and give Haskell a shot - it'll change the way you approach programming in general.

For exposure to different types of programming languages, I recommend Seven Languages in Seven Weeks: A Pragmatic Guide to Learning Programming Languages (Pragmatic Programmers). You can spend a week (or less) getting exposed to each of seven programming languages of various different types. This book includes Haskell, Ruby, and other interesting languages that you might not hear about otherwise.


Interpreted vs. compiled languages: An interpreted language needs an interpreter to run; a compiled language is compiled into an executable (like a .exe file on Windows.) Compiled programs are typically smaller, faster and easier to distribute.

Static vs. dynamic typing: Dynamic typing means that the types of values that can be assigned to a variable are flexible. Static typing requires variables to be declared as a certain type (such as "integer"), and these types cannot be changed later. Static typing is often found in compiled languages and results in faster execution.

Type inference: Languages with type inference are statically typed, but in most cases will be able to infer the types of variables based on how they're used in the program without you needing to explicitly declare them. This saves effort and makes statically typed languages easier to work with.