Here’s the software and services stack we’ll be using.

This is by no means the only way to do open science or data science with open source software, and recommended packages are likely to change over time. The below R-based toolchain should be considered as merely one (out of several) consistent implementations of some best practices. However, once participants have mastered this toolchain, they should find it relatively easy to adapt to other ecosystems.

All of the software will be free and open source software, but we will also be using some proprietary Software-as-a-Service (Saas) offerings. For each of the proprietary services, there are open-source and/or self-hosted alternatives, but these are often much less convenient (e.g. self-hosted Jenkins vs Travis CI), or they are much less popular in the community, and therefore less useful (e.g. GitLab vs GitHub). Relying on, or pushing proprietary services, especially in an education context, is always awkward, but the disadvantages can sometimes be outweighted by convenience and network effect advantages. For some aspects of open source software development and open science, proprietary services – especially GitHub and the StackExchange network – for better or for worse just are the only game in town. In any event, most of what students will learn in this class is in free and open source software, and the remaining proprietary usage should easily translate to other, competing or open services.

Introduction

For the introductory session, participants should watch this video:

Here’s the corresponding slide deck.

Installation

Participants should also install all of the below software and sign up for all the below services before the first class.

The steps required for installation will depend on your platform and system setup.

Basic Computing Literacy

You should know, or easily find, the answer to questions such as:

  • In what directory (absolute path) are programs I use daily stored on computer?
  • What is the OS and version on my computer?
  • In what directory (absolute path) do I store my files?
  • Do I have sufficient privileges to install software? If not, how can I get them?
  • Which file format is better suited to editing: A *.docx or a *.pdf?
  • Why do the search queries jaguar car and jaguar -car give different results on Google?
  • I can name at least 10 file types.
  • What is the username I usually use on public-facing platforms?
  • What is Two-Factor Authentification? (2FA)?
  • How is my harddisc formatted?
  • How can I upgrade my OS and frequently used software?
  • How is the data on my computer protected from unauthorised access?
  • What is my backup plan?
  • What is a VPN client, and what do I need it for?

If you feel like you need to brush up on some basic computing skills, these resources might be helpful:

Software Carpentry

Project Management

Sign up to GitHub.com

GitHub is a collaboration platform, code repository and git host (more on all of below) along with some helpful project management tools.

Tasks

Community & Help

Sign up to StackOverflow Sign up to RStudio community

Aside from Google, these are two great places to get help, and to get involved in the community.

A lot of volunteers spend a lot of time on these sites, so it is very important not to waste their efforts, and to only add quality content, as defined by these sites.

Tasks

Additional Resources

In addition to StackExchange and RStudio Community, there are a couple of other platforms where the (very friendly) R community hangs out:

Markup Language

GitHub Flavored Markdown Spec

The full GFM spec is just FYI; there’s nothing to install here.

Plain text has many advantages (more on that later), but one glaring disadvantage: it does not look very nice, and does not implement many of the typesetting conventions that have evolved since Gutenberg (say, bold face).

Markup syntaxes solve this problem. Markup syntaxes are sets of conventions (as in *something* for highlighting) to structure human-generated text in a way that computers can operate on them, such as formatting a piece of text.

There are many, many, such markup languages out there, including HTML but also Markdown and LaTeX.

We will be focusing on Markdown as a source language, and then use open source tools (especially Pandoc), to render our source documents to all sorts of other formats, including PDF (via LaTeX), HTML (such as this website), but also Microsoft Word documents.

Markdown is a very lightweight markup language, that was designed to be maximally human readable, that is, looking meaningful without being compiled by a computer. Most of the syntax takes its clues from how people have already formatted plain text, such as enclosing a *word* with * for highlighting.

Technically, Markdown is a convention for writing such files, as well as a program to convert such files into HMTL, as, for example, this website (which is written in a flavor of Markdown).

By convention, Markdown files use the .md file extension. It’s important to recognize that still, an .md is a plain text file. You could open it with any text editor, or even change the extension to *.txt and nothing would change. The extension .md serves merely to tell computers that the following plain text is marked up in markdown.

Markdown was (originally) quite a minimal standard, and has since branched out into a few specialised “flavors”, offering additional features.

We will be using only two of these flavors: GitHub Flavored Markdown and Pandoc’s Markdown (more on that below).

GitHub, a leading code-hosting service, has extended the above original Markdown spec by a couple of additional features. In addition to these formatting niceties, Github also implements some clever cross-referencing and autocompletion magic. When using Github for source control and collaboration, you really must use these in issues, comments, commit messages etc. (they work everywhere).

Resources

Additional Resources

If you like, you can also install a program on your computer to render Markdown to HTML. There are plenty of choices, including the free MarkdownPad for Windows, and Lightpaper for OS X. If you don’t want to install something, Github (see below) also offers a Markdown preview in its browser-based editor. We will be using different programs going forward.

Source Control Management (SCM)

Install Git

Git is just a CLI program. It offers all the functionality of git, but you may also install a Git graphical user interface (GUI).

There plenty of those out there, but one of the easiest is the GitHub Desktop app from GitHub (available only for Windows and macOS).

Install GitHub Desktop

You also need to configure git on your machine, and wherever else you are using Git (such as a SaaS):

Additional Resources

Command Line Interface

Install Bash Shell

The bash shell is the standard Unix-style command-line interface (CLI) (as opposed to a point-and-click graphical user interface, or GUI). A lot of programs that we’ll be using only run at the CLI, so it’s important to know how to use it. It is also often used in scripting (= automating) tasks.

On macOS, Linux: Nothing to install, ships with bash.

On Windows:

  • Install Git for Windows because that comes with at least a git shell. Choose git bash emulation on install.
  • If your version is >= Windows 10 Anniversary Update you can also install Install the Windows Subsystem for Linux (WSL) and use the Windows 10 Bash Shell. However this is a separate system inside your Windows installation, and the programs installed inside it may (as of 2019-01) not be used “normal” windows GUI programs. If you don’t know what this means, do not install the WSL; it can be very confusing.

If you like a fancier shell, you might want to look at the oh-my-zsh project, which has some pretty cool features. However this is strictly optional, will not be supported in class.

Additional Resources

Package Management

Linux: already ships with apt macOS: install homebrew Windows: install chocolatey

Installing and upgrading a lot of command line tools and their dependencies gets old quickly. Package managers solve this problem; they provide a clean and elegant way to install (CLI) programs, and even allow you to quickly upgrade everything.

Notice that LaTeX, Atom and R (all below) each have their own internal package managers (as do many other software ecosystems). If you’re installing a package for either of those, use the corresponding ecosystem package manager, not your system-wide program (= brew, apt-get, cholocatey).

Text Editor

Install Atom

Whenever we write something in this class, it will be in plain text. Plain text, roughly speaking, consists directly and only of letters, encoded in an open standard.

This may seem antiquated, but has several advantages:

  • Plain text can easily be versioned by computer software such as git.
  • Plain text is transparent to the user: it is human-readable. For comparison, try opening a *.doc in a text editor, and see whether you can make out any meaning.
  • Plain text is lightweight and robust. File sizes and memory footprint are tiny.
  • Plain text files future-proof your work and data. *.txt, or, equivalently for data, *.csv can be opened and edited on pretty much any computer today, could be 30 years go, today, and probably still will be widely accessible in 30 years time.

Most operating systems ship with a text editor, but they are quite basic and can be cumbersome to use. Specialized text editors (or just editors) offer more functionality geared towards technical writing or software development.

There are many editors out there, and people have strong views on which is best. In some ways, this is surprising, because of all the software used in collaborative writing or development, editors are the tool that needs the least standardisation. Playing off the advantages of plain text files, everyone can use what works best for them, because they all output the exact same thing: a *.txt.

You are therefore free to choose your own text editor.

Atom has the advantage of being relatively easy to use, free and open source and relatively widely supported. It also comes with some nice Git(Hub) integration.

Atom, as most editors, has a modular design. Many of its features are factored out to separate packages, some of which are contributed by external volunteers.

Here’s a list of packages you might also want to install:

  • atom-beautify
  • atom-html-preview
  • document-outline
  • git-plus
  • language-knitr
  • language-latex
  • latex
  • language-markdown
  • merge-conflicts
  • minimap-split-diff

R Integrated Development Environment (IDE)

Install RStudio

Aside from text editors, there are also integrated development environments (IDE) (though this distinction has recently been blurring with the arrival of Atom-IDE and others). IDEs are a little like text editors, in that they mostly let you edit plain text files, but they offer a lot of “training wheels” for programming and are often geared towards particular programming languages.

The leading IDE for R is called RStudio, by, confusingly, a company called RStudio. We will be using the open source variant of RStudio (the IDE), but RStudio (the company) also sells commercial licenses to the IDE and other products.

If you are already deeply invested in an IDE or Editor (especially vim or emacs) you may also trick out that program to support R. The Emacs speaks statistics project has great support for R, but Emacs has a steep learning curve.

For most everyone, RStudio will therefore be the strongly recommended choice.

Branching Model

GitHub Flow

There is a varied set of practices and tools that have evolved on top of Git. Together with the powerful git scm, it is these practices and tools, that make massively collaborative software development possible.

One of the simpler practices is GitHub Flow. We will use it to learn the branch and pull-request model.

Document Conversion

Install Pandoc

We’ll often want to convert documents from and to different markup formats. For that purpose, we’ll use pandoc.

Pandoc is, originally, a kind of swiss army knife for text document formats, such as, say, between Microsoft Word and HTML.

But as part of this work, Pandoc has also defined its own extension (flavor) to Markdown (largely compatible with GFM), including such features as footnotes, captions, references, and other aspects important for technical and scientific writing.

You should both learn to use Pandoc at the CLI as well as to write in the corresponding Pandoc’s Markdown style.

Typesetting

Install LaTeX

(La)TeX is strictly speaking a typesetting program, which can create beautiful documents. It has extensive support for all sorts of domain-specific typographic niceties, and is used a lot by academics, especially in math and sciences because.

However, because LaTeX is quite cumbersome to compose and tends to distract writing with a lot of bells and whistles, we will not learn to write LaTeX directly “by hand”. Instead, we will be using Pandoc to compile our Pandoc Markdown source to PDF (via LaTeX), and, because LaTeX can be slow to compile, we will only do so rarely and towards the end of any given project.

Still, it is important to learn some of the basics of LaTeX to use it programmatically.

Bibliography Management

Install Pandoc Citeproc Install a Bibliography Manager of Your Choice

Bibliography management is not the focus of this class.

It is also one of those tools, where there is no strong reason to standardize on any one program, so as long as the bibliography manager exports to one of the formats that pandoc can ingest.

Check if your bibliography manager can export to at least one of these formats.

If you have a choice, a BibTex or BibLaTeX file (confusingly both named *.bib*) are preferable.

Introductory R

“Base” R

Install R

Resources

Additional Resources

Literate Programming

Install knitr Install RMarkdown

Resources

Intermediate R

Tidyverse R

Install dplyr, tidyr, readr, tibble, purrr and stringr

Resources

  • Documentation for the above packages.
  • List of resources here.
  • Hadley Wickham and Garrett Grolemund’s R for Data Science (read only the chapters that you are interested in).

Plots

Install ggplot2

Resources

Interactive R

HTML, JS & CSS (optional)

The below packages for (web) interactivity in R try to abstract away as much as possible the underlying web technologies (HTML, JavaScript and CSS). You can use them without knowing anything about this stack, but you can accomplish more and understand them in a deeper way if you have at least a cursory understanding of how these technologies work.

Covering them in any depth, or even listing good resources (of which there are gazillions) is beyond the scope of this class, so these should be considered mere starting points.

Interactive Webapps

Install shiny

Resources

Advanced R

Cloud Computing with R

Continuous Integration & Development (CI/CD)

Sign up to Travis CI

Resources

The Cloudyr Project

tba.

Containerisation

tba.

Reproducible Research

Defensive Computing

tba.

Storing Datasets

tba.

Publishing results

tba.

Dependency Management

Install packrat

Package development