tutorials

A collection of useful code chunks and workflows


Intro to Python and the Command Line

for an R User

Some things are just faster on the command line. But if you’re not used to it, the command line is a dark and scary place. Thankfully, I had amoeba to help me through:

  1. Copying a file to another folder
  2. Unzipping a *.tar.gz file
  3. Running a Python (2) script
  4. Exploring the data in Python (with len)
  5. Turning all the variables generated by a script into a dataframe
  6. Saving the dataframe to a csv

  1. Copy a file to another folder

Start by accessing the datateam server on Terminal by typing ssh USER_NAME@datateam.nceas.ucsb.edu, followed by your password. By running things on the server, any processes running on your personal computer can continue running without slowing down.

Then, navigate to the folder that the file of interest is in. These functions are pretty handy for that:

  • pwd = print working directory (R translation: getwd())
  • cd = change directory (R translation: setwd())
  • ls = list directory contents (R translation: ls())

Since I started out in home/isteves, I navigated to (and looked at) the folder of the Coopman dataset that I was interested in, like this:

cd /
cd home/visitor/Coopman
ls

Once you’re in, it’s easy to grab the file you want. Just use cp {FROM} {TO}:

cp DATA_PM_FlexPart.tar.gz ~/
  1. Unzipping a *.tar.gz file

Navigate into the directory that the tar file was saved into using cd like before, and run:

tar xzvf DATA_PM_FlexPart.tar.gz

The tar command is short for tar e{x}tract} g{z}ip {v}verbose {f}ile {the file}. Some (amoeba) call it “shorthand magic” and indeed it is.

Note: If you’re curious about a command line function, you can check out the details using man (R translation: ? or help()). For example, man tar tells you all the possible commands you can use with tar. To quote amoeba: “A man page is like the shop manual for a car which is often overkill for [a beginner’s] line of inquiry.” An easier resource for deciphering command line code is tldr. Just search the function to find the English translation!

  1. Running a Python (2) script

There are some details of running python scripts that we skimmed over during our learning session (in particular, installing Python libraries), so the following assumes that you have the Python infrastructure ready to use.

Start by navigating into the folder of the script you want to check out. In my case, I used cd DATA_TEST to get to the folder that I had unzipped.

Then, use ipython to start a Python session. It should display something similar to the following:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

The file I’m interested in co-locates a bunch of sensor files. In other words, it takes data from a lot of different sources and synthesizes them. Specifically, the file contains one function Read_PMFlexpart that will compile the accompanying txt files into 30+ variables. I can thus read the function into Python and save the variables they’ve defined like this:

from Read_PMFlexPart import Read_PMFlexpart
LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000 = Read_PMFlexpart('./')

Notice that unlike R, Python can important a single function from a package/library: from {PACKAGE/FILE} import {FUNCTION}. Otherwise, think of it as the R equivalent of source().

The second line in the code chunk above is a “multiple return” - it’s another feature of Python that doesn’t exist in R.

  1. Exploring the data

Now that we have our variables, we can explore them a bit. To look at length, we can use the len() function (R translation: length()). To run len() on all our variables, we can use a type of for-loop:

vars = [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000]
[len(var) for var in vars]

Alternatively, you can try the Python equivalent of *apply(), which is map().

map(len, [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000])

From either of the code chunks above, we see that all variables have 1619368 observations, except for PZE_M, which has 0. We’ll want to take this into account for the next steps.

  1. Turning all the variables generated by a script into a dataframe

Python doesn’t have a built-in way for handling data frames, so that’s where the Pandas :panda_face: package comes in. If you have it installed, you can load it with import pandas as pd. We use pd as an abbreviation for Pandas, which makes for less typing when we call functions from it. To create a data frame, for example, you can use:

output = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})

If we were to translate the syntax to R, it would be something along the lines of:

output <- pd::DataFrame(x = c(1, 2, 3), y = c(4, 5, 6))

To save a single one of our variables from before to a data frame, you would want to run:

output = pd.DataFrame({'LIN': LIN})

If you use brackets around the second LIN (i.e., [LIN]), you would get a one-row data frame.

To try to get them all, you could manually type out all the variables using the above syntax. Alternatively, you could come up with a non-manual solution using some Python magic. Here’s how we did it for our system:

var_names = [var for var in dir() if var.upper() == var and not var.startswith('_') and var != "PZE_M"]
var_dict = dict(zip(var_names, [eval(var) for var in var_names]))
output = pd.DataFrame(var_dict)

The first line does some extra filtering to grab the relevant variables from our Python environment and exclude PZE_M. The second line creates a dictionary using Python tricks that I won’t get into for now. Finally, the third line converts the variable dictionary to a dataframe, which is then saved to output.

  1. Saving the dataframe to a csv

If you’ve made it this far, then the last step is a one-line breeze: output.to_csv('output.csv') (R translation: write.csv()).

Like in R, the output defaults to having numbered row names (row.names = TRUE). If you don’t want to add them, you can use index = False like so: output.to_csv("output.csv", index = False).

Check the folder you’ve been working in, and you should see it pop up in no time!