Free and Open Source Software for Open Science (“FOSSOS”) is an ongoing series of seminars taught by Max Held at Friedrich-Alexander Universität Erlangen-Nürnberg (FAU) at the Department of Sociology, introducing students to R, as well as the broader open science ecosystem and free and open source best practices and tools.

This repository houses the resources for these classes, and all class-related activities are tracked in the above issues.

## Advanced R and Open Social Data Science

… because learning from hackers is learning to win?

#DataScience #rstats Git(Hub) #ReproducibleResearch

Image Credit: Red Alt CC BY 2.0 hjl

[Coding – ] it’s the next best thing we have to a superpower.Drew Houston via code.org

The bad news is that whenever you learn a new skill, you’re going to suck.Hadley Wickham

Computers … a bicycle for the mindSteven Jobs

Think of free speech, not free beer.Richard Stallman

Most learning is not the result of instruction. It is rather the result of unhampered participation in a meaningful setting. – Ivan Illich (1971)

## Prerequisites

Everyone is welcome to this seminar. This is not a “proper” computer science class, and participants do not need any background in CS, statistics or math.

You should just be curious and ready to:

• learn to use specialised command-line software and open-source tools for collaboration,
• collaborate intensively using (perhaps unfamiliar) web-based tools.

No worries, we’ll bring everyone up to speed in no time.

You do not need to have completed a prior version of this class, or any other class. If you have some prior training, you will start the class at a different level.

## Time and Place (Summer Term 2019)

Students should attend >= 3 full days to guarantee successful participation. Below dates include backup dates.

The times and dates are (also on univIS).

### Preparatory Meeting

Wednesday, April 24th, 2019 18:15-19:45 at the Nuremberg Campus of Technology (NCT) in Nuremberg (see below for directions).

### Main Dates: Ascension Day Extended Weekend (Christi Himmelfahrt)

Thursday, May 30th through Sunday, June 2nd, 10-18:00 at the Nuremberg Campus of Technology (NCT) in Nuremberg (see below for directions).

### Backup: Corpus Christi Extended Weekend (Fronleichnam)

Thursday, June 20th through Sunday, June 23rd, 10-18:00 at the Nuremberg Campus of Technology (NCT) in Nuremberg (see below for directions). The backup date will be used if the main dates have to be cancelled for some reason.

### Venue

Classes will take place here:

:
| Nuremberg Campus of Technology
| "Auf AEG" Haus 11
| Room 11.2.2
| Fürther Straße 246c
| 90429 Nuremberg

(The building can be hard to find; see [here](http://datascience.phil.fau.de/fossos/img/directions.pdf) for directions).

## Language requirements

Depending on who will be attending the class, instruction may also occur in english or german. In any event, all of the readings and other course material are all in english, and participants are expected to be proficient in reading and writing english technical documents.

## A Multi-Semester Series

It is obviously impossible (for most students) to cover all of the material in this course in one semester.

This course (with a slightly different name) will therefore be taught every semester, in a non-consecutive series. Students can join the class every semester, and take the class for however many semesters they wish (if they still have new things to learn). Do not be confused by the name this class takes in some semester (say, “Advanced R …”) – you can still join as a beginner. Depending on the listing (see below) students can also take this class for credit multiple times.

By implication, the group of students in the class in any given semester will be heterogeneous, working at different levels. For example, some students may already have taken a course in the series previously, while others are just starting out. Because the previous experiences and learning speed of students vary greatly anyway, this is not a significant (additional) hindrance. Tasks, expectations and material covered will accordingly differ for each student, depending on the background.

## Credits and Listings

You can generally take this class as an undergraduate (Bachelor) lower-divison seminar (Proseminar) worth 5 ECTS points, or an upper-division seminar (Hauptseminar) worth 7.5 ECTS points. The workload will be adjusted accordingly.

Depending on your major, you may also take the class to fulfill requirements for a Masters program. Please be in touch to discuss the details.

This class was/is listed as:

• 2018/2019 Winter Term: “Open Source Werkzeuge für die wissenschaftliche Datenverarbeitung” (the original FOSSOS), crosslisted in the following modules:
• Bachelor Sociology
• Bachelor Digital Humanities and Social Sciences (“BA Zweitfach”)
• Elective (Wahlpflichtbereich FPO 2018)
• Elective (Wahlpflichtbereich FPO 2016)
• 2019 Summer Term: “Advanced R and Open Social Data Science”
• Bachelor Sociology
• Bachelor Digital Humanities and Social Sciences (“BA Zweitfach”)
• Elective (Wahlpflichtbereich FPO 2018)
• Elective (Wahlpflichtbereich FPO 2016)
• “Soft Skills” (Schlüsselqualifikationen)

## Course Description

Digitisation has created both new challenges and yet unrealised potentials for empirical social sciences. Larger, and often streamed datasets require more programmatic and dynamic statistical analyses. Existing commercial programs with graphical user interfaces (GUIs) are expensive, and analyses can easily become intransparent, sometimes contributing to a crisis of reproducibility in the social sciences and beyond (e.g., Mair 2016) or even propagating outright bugs (e.g., Reinhart and Rogoff 2010).

Happily, the open source community has already pioneered a set of technologies and conventions for their software development efforts that have proven useful in solving these problems in many academic fields. Additionally, open source software offers new ways to analyse and visualize data, as well as to present interactive results.

Together, these tools promise a radically open and participatory approach to science, and productive yet skeptical use of emerging data streams.

Unfortunately, learning these tools takes more time than is usually available until any given project deadline.

The goal of this series of seminars is therefore to train participants in a coherent set of leading tools and best practices, including:

• Software Carpentry
• Open source issue trackers to manage projects and their learning.
• Using leading community resources and services to troubleshoot issues.
• Writing text in a lightweight markup language (markdown).
• The world of UNIX-style command-line interface (CLI) programs …
• … and package managers, such as Homebrew or APT.
• Establishing an efficient plain-text workflow using editors and an Integrated Development Environment (IDE), including Atom and RStudio.
• Source control management (SCM) and massively collaborative development using Git and GitHub.
• Separating content and presentation using plain-text formats for technical and scientific writing, including LaTeX, Pandoc Markdown and RMarkdown and rendering results in a variety of formats (Word, HTML, PDF).
• Introductory R
• Introduction to “base” R.
• Literate programming in R.
• Intermediate R
• Importing, transforming and modeling data using tools from the R tidyverse ecosystem.
• Visualising data using ggplot2.
• Interactive R
• Interactive visualisations using leading JavaScript libraries (via plotly, htmlwidgets).
• Web dashboards using flexdashboard.
• Interactive webapps using shiny.
• Types, functional programming, object oriented programming (only S3), metaprogramming and techniques, all following Hadley Wickham’s Advanced R
• Cloud Computing
• Using containerisation (docker).
• Applying continuous integration and deployment (CI/CD) tools such as Travis CI.
• Reproducible Research
• Improving code quality by applying assertions using checkmate.
• Storing datasets in public repositories such as the Harvard dataverse.
• Releasing, publishing and indexing finished research using GitHub releases and zenodo.
• Other tools and practices for open and reproducible science.
• Strenghening reproducibility and portability by using dependency management (packrat) and containerisation (docker).
• Package Development
• Including documentation (roxygen2), defensive programming (checkmate), testing (testthat) and more best practices, all following Hadley Wickham’s R packages.

Towards the end of each of the seminars, participants will be able to use (parts of) this toolchain to work on their own projects, or to contribute to existing free and open source software.

This course will not focus on math and statistics knowledge or substantive domain expertise, though both are essential for solid data science work. Rather, the emphasis is on what Drew Conway loosely called hacking skills in his Data Science Venn Diagram, that is, simply getting these tools to work together, to learn how to troubleshoot them, and – aspirationally – to absorb some best practices of open source development.

While the course is not a proper computer science class, it should also be valuable to students with coding experience or a CS background who may be interested in the tooling and practices covered.

We will not cover the scaling and efficiency issues of proper “Big Data”, but confine ourselves to in-memory problems. We also limit ourselves to the R ecosystem, though some tools and problems will be similar for other scripting languages such as python.

An introduction to data science and open source may well open up new job opportunities, or serve as a first stepping stone to a career in tech, but that is arguably not the only reason why social scientists should be excited about it. Instead, to learn the way of open source is perhaps to update the ideals of the scientific process for the modern day: radical openness and rigorous reproducibility, maximal inclusivity and promised meritocracy, generous sharing and personal attribution. Open source may also be a worthwhile exercise in participant observation for social scientists: here is a real, if surely flawed utopia, massively coordinating individuals that is neither market nor state.

Less loftily, but not least, the seminar also promises a starter dose of gratification from having built something that actually works, and is of some immediate use to our fellow human – a good feeling sometimes hard to come by in the social sciences.

## Philosophy

This course is a little different from most seminars.

Teaching teaching R (and the broader ecosystem) at FAU sociology (as most other smaller, non-tech focused institutions) faces a couple of important constraints:

• Participants will have vastly different levels of previous experience, and will learn at different speeds.
• Given the relatively small number of interested students and complicated timetables, strictly consecutive seminars are difficult to organize. Too few students would ever meet the requirements (and schedule) to attend the advanced seminars.
• There is already plenty of high quality teaching material out there, and there is little point in re-inventing (an inferior) wheel.

To meet these constraints, this course will be held as a non-consecutive multi-semester series of seminars, and will, for the most part, operate on a flipped classroom model.

## Flipped Classroom

Because students will learn at different speeds, and from different starting points – among other reasons – teacher-centered teaching will be minimal in this class.

Instead, students will study the assigned material outside of class, including online documents, videos and interactive learning applications.

As they encounter problems, or develop own (small) projects, students will track such work on the issue tracker used in class. In class, students will work on their own problems or projects, in small groups and assisted by the instructor as necessary.

This class does not offer a one-size-fits-all set of pre-defined materials and assignments necessary for successful participation. What the class offers is:

• A carefully curated list of external learning resources, organised in a (somewhat) linear syllabus.
• A social setting (the class settings) and electronic fora (github repo) to keep organised, motivated and to help one another.
• Guidance and assistance by the instructor for each individual student.

## Expectations

Happily, there are a lot of great resources for learning data science tools out there, many of them free, some of them even open source themselves. We will be reusing a lot of these resources, and I (the instructor) do not have to reinvent an (inferior) wheel. There is no one curriculum that’s quite right for us, so I have cobbled together material from different sources.

All resources are listed, in roughly advisable chronological order, along with the stack.

The good news is that there are no academic papers or books for this class and everything students need is available online. There is, however, still a lot of material to work through (to the tune of hours per week), though it is written in a hopefully more accessible style than many academic documents. The listed resources are guaranteed to cover everything you need to use the software, often including tutorials, videos and exercises. Students are not limited to the listed resources; they can also choose their own material, so as long as it covers roughly the same ground. In fact, students are encouraged to share good additional resources with the rest of the class.

Students are expected to work through (not just read) all the material listed before the session in which the software is covered (see schedule). Additional resources are optional.

Whenever your run into a problem, or have a question, raise an issue on our github issue tracker. Please also make sure that:

• the issue does not already exist (always search first!)
• the issue is properly labelled (so we can all navigate through the issues)
• the issue is answerable, actionable and closable. Good issues are framed in such a way that they can be closed.

For broader, more open-ended discussions, use the class discussion board

## Schedule

Because students will learn at different speeds, and from different starting points, there is not a schedule for the class. We will, however, progress through the stack in the order in which software (and resources) are listed.

Students can work through this material at their own pace. Likewise, some students may wish to cover a lot of breadth (at shallow depth), while others want to dig in on a particular topic. This is all fine, but students should ensure that they learn something at a useful level to solve real-world problems, as will also be required for the assessment. If in doubt, ask the instructor for guidance.

Every student should first become competent in the practices and tools covered in “Software Carpentry”; these are required for all later topics.

As a lower bound, every student should cover at least one top-level heading (“Software Carpentry”, “Intermediate R”, etc.) per semester.

You can find out whatever will be worked on during the next session(s) by consulting the kanban board. During the earlier sessions, we will also cover some topics together in class. These topics will also be listed on the kanban board.

## Digital Places

We’re going to use a few digital tools to get organised in this class, all based on GitHub.

• Static information will be at https://datascience.phil.fau.de/fossos/, the class website. You can find all the resources (~ readings) and software links on https://datascience.phil.fau.de/fossos/stack.html.
• Pretty much all individual activity (i.e. to be done by one or a few students) is tracked as issues on our class repository issue tracker at https://github.com/soztag/fossos/issues. If you have a question, have an idea to work on, or are looking for inspiration for a task, this is your place. Issues are organised using labels and assignees. Milestones are currently not in use.
• A (currently relevant) subset of these issues are also listed on our Kanban board at https://github.com/soztag/fossos/projects/2. This board gives you an overview what we will be working on in the current and coming sessions. You can move your “own” issues around the board as appropriate, and you can also add issues that you want to see addressed.

These venues are also linked from the top bar of the class website, so you can always easily find them.

You can (and should) cross-reference issues and discussions between all of these venues.

## Assessment

Assessments are an unfortunate, tedious and arguably needless part of teaching – but here we are, so we are going to make the best of it.

Instead of some make belief work or hobby project, assignments in this class are, for the most part, designed to be actually useful to other people. This can be motivating, but it also means that other people are relying upon our work: it has to be delivered by the time, and in the quality expected.

You can work on pretty much anything you like – improving this very class (and its repo), some existing project that you like or even your own new (or existing) project. The only conditions are:

1. The work needs to be related to the tools and practices covered in class.
2. The work needs to be on GitHub or otherwise transparent.
3. The instructor needs to be able to assess the quality of the work, and advise you in your work.

We will begin with relatively easy, small tasks to serve other students in class, then address smaller issues with resources for the broader community, and eventually, fixing “real” bugs or enhancing functionality of open source data science software.

All tasks, big and small, are listed and tracked on the class github repository issue tracker. Students should assign themselves to tasks they will be working on, and report / link to any progress on these tasks in the issue thread.

### Pass/Fail

All students, including those who just want a “Sitzschein” (pass/fail option) must contribute to a number of issues labelled as pass/fail. These are issues that are smaller in scale and scope.

There is no straightfoward minimum metric (say, number of closed issues) to pass the class. Instead, students should display substantial contributions across a range of helpful activities, as recorded in the issue tracker.

Before working on these issues, students should assign themselves, to avoid us doing duplicate work.

Students who want to receive a grade on the class also have to complete a couple of issues tagged with graded-x.

The numbers next to the labels roughly indicate the estimated workload and difficulty of a task (also known as “story points” in agile development). Estimates are frequently wrong, and these points can be adjusted in consultation with the instructor, if some task turns out to be much harder or easier than expected. These story points correspond to ECTS credit points; if you are taking this as a “Proseminar”, you will need to have owned and closed issues worth 5 story points.

You will be graded based on how well you have adhered to the best practices and tooling covered in class, as well as (if applicable) the guidelines and standards of the external project (some other repo) or platform (Stack Overflow)

There are different kinds of graded issues:

#### Reproducible Example

Labels:

• community.rstudio, stack-overflow or bug report,
• and reprex, and question respectively.

Though it may also benefit yourself, a well-formulated question or bug report with a reproducible example can also serve the community. This is what we’re aiming for here.

A well-formulated question, in the context of open source development is often a reproducible example, or reprex, for short. This means that you should provide a code snippet (or, if not applicable, a very precise description of steps) that will allow any other user to reproduce the behavior in question, with no additional resources. Producing this can be harder than it sounds, and just narrowing down a problem like that may often help you solve it.

Make sure to read and adhere to all the resources listed community and help.

The three target platforms can be listed roughly in ascending order of precision of the question:

1. http://community.rstudio.com: Open to relatively open/vague questions, though you are absolutely expected to do your own research.
2. http://stackoverflow.com: Questions should be very precise and reproducible, and be definitively answerable. Not good for opiniated stuff. Consider the resources listed under community and help.
3. Bug report: If you’re absolutely sure that you have run into a bug, then it can be a good idea to raise it on the repository in question. For most things, you should raise it on S-O or community.rstudio first, to be sure that it really is a bug.

Here, as with all things open source, we must ensure that other people’s time is well-spent engaging our question (or bug report). To ensure that, please follow this procedure:

Sequence Chart for a Reprex

#### Answer on S-O or community.rstudio

Labels:

• community.rstudio, stack-overflow,
• reprex, and answer respectively.

Same process as for the above.

#### External Contribution

Labels: external documentation, external software.

These are improvements to external repos (typically also on GitHub), either other software (typically R repositories) or documentation and learning resources (typically those covered in class). The actual work (forking, raising a pull request, etc.) consequently occurs in the external target repository, and this activity is merely tracked in a placeholder issue in the class repository. Simply link to any relevant issues, commits or pull requests on the target repo in a placeholder issue.

This sounds quite challening, but it can be quite doable, especially if you’re starting by improving the documentation.

To start contributing to open source, you might also find these resources helpful:

For contributions to external documentation or software, it is very important that we do not burden the respective maintainers with sub-par work. To ensure that we deliver high quality work, you must follow the following procedure:

Sequence Chart for an External Contribution

Grading criteria are listed for each of the issues. Generally, a good grade will require following the practices and standards appropriate for the type of contribution in question, and students will need to demonstrate adequate command of the toolchain covered in class. For an excellent grade, students will need to go (a bit) beyond the covered material, and work on an especially pressing or complicated problem.

#### Own Project

As an alternative to this (graded) assessment, if students already have some prior knowledge and a ready project they wish to work on, this can also be arranged. Students should contact the instructor, and also track their progress on their own project in a placeholder issue on the fossos issue tracker.

The graded tasks (see above) will be graded using the below rubrics. The grading rubric is taken from the University of British Columbia Master of Data Science program (CC BY-SA 3.0).

Dimension Poor Unsatisfactory Satisfactory Good Excellent
Accuracy 25% Code fails to run, doesn’t have clear output, or performs the wrong task. Code performs only some of the correct tasks, the output is not easily understandable and the methods used to achieve the result are inefficient if performance is a concern. Code performs most of the correct tasks, the output is understandable, however the methods used to achieve the result are inefficient if performance is a concern. Code performs the correct tasks, the output is reasonably easy to understand, however the methods used to achieve the result are not the most efficient if performance is a concern. Code runs correctly without crashing, the output is very clear, and the intended or suitably correct methods are employed to achieve the correct result.
Student has cho sen the most efficient algorithm reasonable if performance is a concern.
Code Quality 25% Code is difficult to read and understand due to many issues that affects readability.
Code is also po orly organized. Code is generally easy to read and understand with few non-reoccurring issues and at most two reoccurring issue that affects readability. Code is generally easy to read and understand with few non-reoccurring issues and at most one reoccurring issue that affects readability Code is easy to read and understand with only 1-2 minor and non-reoccurring issues that affect readability . Code is exceptionally easy to read and understand.
For example, va riable names are clear, an appropriate amount of whitespace is used to maximize visibility, tabs and spaces are not mixed for indentation, sufficient comments are given.
Any coding sect ions of the assignment that were not completed have documentation explaining what a coded solution would look like.
Overall, the co de is extremely well organized and documented.
Mechanics 25% Evaluator was unable to run/open/read assignment submission despite best efforts.
This may be bec ause the student forgot to include certain files in the submission or tailored the software to only work on their local machine e.g. the code only works when run from a certain directory on the student’s machine, contains paths to files only on the student’s machine, etc., or they did not submit their assignment correctly or completely, or it was unclear where the relevant parts of the assignment are included in the submission. Evaluator had to spend some time to get the raw submission to work correctly Evaluator had to make an obvious, small, quick fix to get things working or the wrong file format was submitted The submission is self-contained and works flawlessly; it just works in anybody’s hands. The student did not forget to include all the files in the submission.
Any necessary l ibraries to install are either included or are installed by a script, or are made obvious that that the evaluator must install them.
Student used th e asked for file format.
All assignment instructions were followed.
All files were put in a repository, in a reasonable place, with reasonable names; any source files .tex, .Rmd are rendered to a readable output format e.g. .pdf, all figures are included, there is a README file indicating where to find the different aspects of the assignment, etc.
Robustness 25% Multiple issues with code repetition exist, and several tests are absent and/or are of poor efficacy Some form of re-occuring code repetition exists, or tests efficacy is poor. Some form of re-occuring code repetition exists, or tests efficacy is poor. Code repetition is mostly minimized and effective tests are present for most functions. Code repetition is minimized via the use of loops/mapping functions, functions or classes or scripts/files as needed without becoming overly complicated.
Functions are s hort, concise, and cohesive without losing clarity; code can be easily modified.
Tests are prese nt to ensure functions work as expected.
Exceptions are caught and thrown if necessary, pnce students have learned about exceptions.

## Attendance

Attendance is not mandatory, as per university policy. However, students are highly recommended to attend the seminar regularly, and to thoroughly study the assigned material. It is highly unlikely that you will be able to receive passing grades on the assignments otherwise.

Even though technical in nature, this class is no “rocket science”, and we will get everyone up to speed, no matter the prior knowledge. However, you have to work hard and thoroughly, otherwise it is very possible that you will simply fail the class, or receive a poor grade.

## Technical Requirements

Unfortunately, FAU has no computer lab facilities suitable for teaching this class and participants will have to bring their own computers. This has the advantage that students will learn to set up their own development environments, but adds some unwelcome complexity (different OSes, etc.).

The class will assist students in installing software on their devices, but students are responsible for maintaining their computers. In particular, student laptops must:

• have a reasonably current operating system (MacOS >= 10.13, Microsoft Windows >= Vista, Linux),
• have a current version of a web browser installed,
• not be virus-infested or in some other borked-up state,
• not be a mobile device (iOS or Android) (unless you can SSH into a Linux box or something),
• and have ready access to one of the WiFi networks at FAU: FAU-STUD, eduroam or FAU.fm. (If you need help setting up your WiFi, consult the RRZE Website.)

Emphatically, none of this requires a new, powerful or expensive device, let alone software. You can get a used laptop with / ready for Linux Ubuntu on EBay for well under €100 (if you buy a used computer, make sure that the hardware has good Linux support). With some tweaking, you can even use an inexpensive (x86) Google Chromebook (which runs on Linux). For more information, see stack.

If you are facing financial difficulties in obtaining a laptop for the class, please contact the instructor. We’ll figure something out for you.

### Operating System Maintenance

It is your responsibility to maintain your own computer and operating system (OS), as well as to figure out how to install the below software on your machine (though we will all help one another within reason).

### In-Browser Development

Using RStudio in the browser means that all the software you’re using won’t ever really be installed on your system, but only exist in a virtual image or online service. If you want to do serious development work or are facing edge cases, you may require a “real” installation on your client (see instructions in stack). However, in-browser development is a great way to have a standardized environment ready quickly.

You can run the RStudio IDE in your webbrowser in two ways:

### Linux

If you want to install the programs used in this class on your system, rather than use them through a (Docker) container, you may find it easier to do that on Unix-compatible operating systems, including macOS and Linux. Getting Windows to play nicely with open source software can be harder, and some convenient system utilities (such as a package manager) are often missing. It is technically possible to use most, if not all, of the tools above on Windows, but they may behave slightly differently, and supporting them may be more involved.

If you are using a Windows machine, you may consider the following alternatives to get a more Unix-compatible operating system, roughly ranked from easiest to most involved:

1. Replace your existing operating system with, say, Ubuntu, a frequently used Linux distribution. Before you do this, make sure that your hardware has good Linux support. This would also delete all of your data and applications, and you might have to choose and use new replacement applications.
2. Same as 1, but with a dual boot setup. This way, you can retain both your old operating system, and a new Linux install. However, you always have to restart to switch between the two systems.
3. Same as 2, but in a virtual machine which can run alongside and inside your Windows install. (Here are alternative instructions). Apparently, if your computer and Windows 10 version support it, there is also now a fancier/more efficient way to do this via Hyper-V. Carries some performance penalty.
4. Install the Windows Subsystem for Linux (WSL). This solution is available only for recent versions of Windows 10. It seems pretty elegant, but has some limitations (no GUIs) and may be quite involved.
5. Buy an x86 Chromebook and use crouton or (better, but still in beta?) crostini to run Linux on your Chromebook.
6. Rent a virtual machine (VM, same as 3), but on a rented cloud host. You can access everything through a browser, but there is a (small) fee, depending on your setup.

There is no guarantee that any of these alternatives or links will work for you; you will have to research them on your own.

## Contributors

A big Thank You to all contributors (in alphabetical order by username):

## References

Illich, Ivan. 1971. Deschooling Society. New York: Harper & Row.

Mair, Patrick. 2016. “Thou Shalt Be Reproducible! A Technology Perspective.” Frontiers in Psychology 7 (July). doi:10.3389/fpsyg.2016.01079.

Reinhart, Carmen M., and Kenneth S. Rogoff. 2010. “Growth in a Time of Debt.” Working Paper 15639. National Bureau of Economic Research. doi:10.3386/w15639.