simgen/repro-projects.qmd at main · bodkan/simgen · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
# Structuring (R) projects

## Introduction

You've done a huge amount of programming already. Processing data,
filtering data, annotating data, summarizing data, but also plotting data
using the most powerful tool for scientific visualization _ggplot2_! I
said it before but it's worth repeating again---you have now learned 70% of
all the tools you might need to do data science on a day-to-day basis using R.
This is not an exaggeration.

Sure, you might not remember everything, but remembering comes from
repetition and practice. You've been exposed to the most important concepts, and
you know where to come back for more information when needed later.

---

**In this session on reproducibility, we will take things a bit easier than the
earlier, programming-heavy sessions earlier.**

**Instead
of focusing on specific programming techniques and R functions, we'll go through
guidelines on how to organize (and run) computational projects on a practical
basis---taking _what_ you've learned, and learning _how_ to do it in practice.**

In this session your tasks will be to take bits of code from previous exercises
on IBD and other data (currently scattered over multiple separate scripts and
files, or maybe even nowhere!) and learn how to organize them in a structure
suitable for maximizing reproducibility and effectiveness of doing computational
research.

## Exercise 1: Creating a formal R project

**So far, you've worked in an "anonymous" RStudio session, with scripts saved
somewhere on your filesystem, disconnected from the data, or with ad hoc
interactions with an R console (i.e., with commands not saved anywhere).**
This is not how reproducible research should be done to be sustainable over
weeks or months of working on a project.

**To move forward, set up a formal R project and proper project structure by
following the following steps:**

1. Click `File` `->` `New Project...`
2. In the new window, click on `New Directory` `->` `New Project`
3. Under `"Directory name"` type in `"simgen-project"` (the name of our course,
and the directory where all project files will be stored). Pick where you
would like to save that directory (this doesn't matter, just put the project
somewhere you can later find it). Check the `"Open in new session"` box, leave
the other options related to "git" and "renv" unchecked.

4. Then click on the "Create Project" button.

**This will create a new RStudio window. Your original RStudio window
(where you worked so far) is still open. The task for the following exercises
will be to convert the (probably disorganized) bits of code into a "proper
reproducible R project".**

**Notice the new `simgen-project.Rproj` file that is currently the only file
in your project directory! We will come back to it soon.**


## Exercise 2: Seting up a project file structure

What makes for a good computational project structure? There are endless
possibilities but a **good guidelines are**:

1. **Be consistent** -- put files that "belong together" in the same place (i.e.,
in the same directory). For example, an Excel table shouldn't be saved in a
directory with scripts, a PDF with a plot shouldn't be in a directory with
tables.

2. **Be predictable** -- even a person unfamiliar with your project should be
able to guess where is what just by looking at your folder. Remember, when
you publish your paper or your thesis, you will have to provide all your
data and scripts as supplementary materials, so others should be able to
understand all of them!

4. **Document everything** -- write a lot of `# comments in code`, describing what
different pieces of code do, and why you wrote things in a certain way.
Your future self will thank you!

---

**Now let's set up an example computational project structure for our IBD data
our ancient DNA metadata** (we'll leave the $f_4$-ratio Neanderthal estimates
for next chapter)**, just like you would do this for a real study.**

**Note:** This is all just an example! Again, as long as you follow the guidelines
numbered above, everything works!

**In the `Files` pane of RStudio, click on `"New Folder"` and create the
following directories in the "root of your project directory
`simgen-project/`":**

1. `code/`
2. `data/`, and within it create directories `raw/` and `processed/`
3. `figures/`
4. `results/`
5. `reports/`

**Note:** You can do all this manually, or you can play around with doing it
using code with the incredibly useful function `?dir.create`! For instance, a
single command to do all of the above could be the following.
(If you're curious about the `recursive = TRUE`, look at help of `?dir.create`.)

```{r}
#| eval: false
dir.create("code/")
dir.create("data/raw", recursive = TRUE)
dir.create("data/processed", recursive = TRUE)
dir.create("figures/")
dir.create("results/")
dir.create("reports/")
```

---

**When you hit the `"Refresh file listing"` circular arrow button in the top-right
of the `"Files"` pane, you should now see all the directories you just created.
Make sure you see them, and not just the `simgen-project.Rproj` file like before!**

## Exercise 3: Building an example reproducible pipeline

Let's get something out of the way first.
**Of course, if there are a million possible ways how to properly structure a
computational project, there are infinite ways how a "reproducible
research pipeline" should be actually organized.**

**Obviously, all research projects are different**, they focus on different kinds
of data (archaeological data, ancient DNA sequences, linguistic data, field
observation data, etc.), so naturally they will require different code which
will have to be structured in different ways.

**Still, there are some common workflows which practically every single computational
scientific research study does:**

1. **Data gathering** -- in our IBD and metadata examples, this involved downloading
data from the internet.

2. **Data processing** -- in our case, this involved filtering the data, processing
it to bin individuals based on their ages, joining IBD data with metadata, etc.

3. **Data analysis** -- this involves computing summary statistics, creating figures,
etc.

**The dirty secret of many scientific research studies (especially in the "old days")
is that all of these steps are often clumped together in huge scripts, and
its very difficult (even for the author) to sometimes tell where is what.**
This can be a big problem, especially if a bug needs to be fixed, a new step
of a processing pipeline needs to be added, etc.

It gets even worse, when after
a long time you come back to a project and need to remember some details about
where in your code is that one line that does something that needs changing!

**Let's demonstrate how you could organize your code in practice, and hopefully
you'll see how investing a bit more time and thinking into properly organizing
code in your research project makes your life a lot easier in the long run**
(and much easier for everyone who might pick up your project later too).

---

### 1. Data gathering

**Create a standalone R script (`File` `->` `New File` `->` `R Script`). Paste the
following code to that script, and then save it as `01_download_data.R` in the
root of your `simgen-project` project directory.**

**Don't copy it without skimming through it and understanding it!**
You should recognize these commands from our
earlier exercises on _tidyverse_? Ideally, you've written those commands
before yourself. That example code was a little messy and
random, because it was structured as an _ad hoc_ tutorial. What we're doing
here is showing how to organize computational research properly.

::: {.callout-note collapse="true" icon=false}
#### Click to see the solution

You can find the complete script
[here](https://github.com/bodkan/simgen/blob/main/files/repro/01_download_data.R).


:::


---

**When you have your script `01_download_data.R`, run it by calling
`source("01_download_data.R")` in your R console and observe what happens.
You can also press Cmd/Ctrl + Shift + S --- a very useful shortcut!**

**Then, click through your `"Files"`
panel and look in `data/raw` --- do you see files that your new script
should create?**


**Note:** Notice the `cat("some kind of message\n")` command in the script.
This is  extremely useful for printing log information about a (potentially)
log running process. If you're confused about what it does, write this
into your R console: `cat("Hello friend!\n")`.

---

Creating `01_download_data.R` is a first step towards reproducibility.
Downloading of all data set now happens in a dedicated script, which means:

1. We only have to run it once, and have all data available in `data/raw`
for later use.
2. If we have to include a new data set to be included, we just edit that
script `01_download_data.R` and run it again!
3. Except for running the script top to bottom, we don't do any "manual work".

**This doesn't sound like much, but it's absolutely crucial. Automation
is the most important aspect of reproducible research. Unless something
isn't fully automated, it's not reproducible.**

Now let's move to the next step!

---

### 2. Data processing

**Create a standalone R script (`File` `->` `New File` `->` `R Script`). Add the
following code to that script, and then save it as `02_process_data.R` in the
root of your `simgen-project` project directory.**

::: {.callout-note collapse="true" icon=false}
#### Click to see the solution

You can find the complete script
[here](https://github.com/bodkan/simgen/blob/main/files/repro/02_process_data.R).


:::

**Again, please don't just blindly copy it! Look at the script and try
to recognize the commands from our earlier exercises on _tidyverse_.** You
ran all of this yourself, manually, command-by-command, in the R console.

**When you have your script, close RStudio.**

...

_Let's pretend that some time passed, you're done for the day and went home._

...

**Find the location of your `simgen-project` directory on your computer,
go to that directory, and then double-click on the R project file
`simgen-project.Rproj`. You'll get your old session back, just like you
left it (including history!**

---

**Now press Cmd/Ctrl + Shift + S to run the processing script, and observe what
happens! Then click through your `"Files"` panel and look in `data/processed`
--- do you see your files, process and tidy, ready for analysis?**


---

### 3. Data analysis

Hopefully you're now starting to get an idea about what a well-organized,
reproducible pipeline means. It's about structuring the directory
where your project lives, and about partitioning your code into scripts
which represent logical components of your project --- from downloading
data (and saving it to a particular storage location), and processing it
(and again saving it to a proper storage location), and, finally, to answering
research questions based on this data.

**Your project structure should now look something like this. Notice that
there are already data files present, not just directories:**

```
├── 01_download_data.R           # data fetching code
├── 02_process_data.R            # data processing code
├── code                         # directory for future custom functions
├── data
│   ├── processed                # processed / filter / cleaned data
│   │   ├── ibd_segments.tsv
│   │   ├── ibd_sum.tsv
│   │   └── metadata.tsv
│   └── raw                      # raw unprocessed data
│       ├── ibd_segments.tsv
│       └── metadata.tsv
├── figures                      # location for PDF figures
├── results                      # various other output data (tables, cache files, etc.)
├── reports                      # location for slides and reports (next section!)
└── simgen-project.Rproj
```

**You have implemented the first two steps of a data processing pipeline.
Now it's your job to implement additional components of the project pipeline,
again using a couple of examples of what we've already done before, but this
time properly integrating it into a nice and tidy package.**

---

#### Generating figures

**Create a standalone R script called `03_plot_metadata.R`, which will:**

- Read the table in `data/processed/metadata.tsv` using `read_tsv()` function.

- Create the following figures using _ggplot2_ package and the function
`ggsave()` you've learned about earlier.

1. Count of samples in each `age_bin` of `metadata` (use `geom_bar()`).

2. A histogram of `coverage` across different `age_bin` categories, only
for individuals with `age > 0` (i.e., ancient individuals), with one facet
showing one histogram from a given `age_bin` category
(use `geom_histogram()` and `facet_wrap()`).

- Save the figure(s) in the `figures/` directory (whether you
save individual PDF for each figure, or put multiple figures into one PDF
is up to you).

**Note:** Remember that you can use the function `ggsave()` to save _ggplot2_
figures stored in normal R variables, or the `pdf()` & `dev.off()`
trick!

**Hint:** If you need help, we solved all of these in previous exercises.

::: {.callout-note collapse="true" icon=false}
#### Click to see the solution

You can find my solution
[here](https://github.com/bodkan/simgen/blob/main/files/repro/03_plot_metadata.R).

:::

---

**Create a standalone R script called `04_plot_ibd_sharing.R`, which will:**

- Read the table in `data/processed/ibd_merged.tsv` using `read_tsv()` function.

- Filter the table to only `length > 5`  (segments longer than 5cM) and
`time_pair == "present-day:present-day"` (only present-day humans).

- Create the following figures using _ggplot2_ package and the function
`ggsave()` you've learned about earlier, saving all of these figures
into a single file `figures/ibd_sharing.pdf`:

  - For each pair of continents `region_pair == "Europe:Europe"`,
`region_pair == "America:America"`, `region_pair == "Asia:Asia"`, and
`region_pair == "Africa:Africa"`, plot a histogram of lengths of IBD
longer than 5 cM.

  - Use `facet_wrap(~ country_pair)` to visualize each histogram for each
  country pair in its own facets.

  - Therefore, your PDF should have four plots (one plot per page), with
  each plot having a number of histograms, one histogram per facet.


::: {.callout-note collapse="true" icon=false}
#### Click to see the solution

You can find my solution
[here](https://github.com/bodkan/simgen/blob/main/files/repro/03_plot_ibd_sharing.R).

:::

---

**Create a standalone R script called `05_plot_ibd_relatedness.R`, which will:**

- Read the table in `data/processed/ibd_sum.tsv` using `read_tsv()` function.

- Recreate that beautiful figure we made, showing a scatterplot of `total_ibd`
against `n_ibd`, and highlighting which individuals are duplicated, who is
a 1st degree relative, 2nd degree relative, etc.

- Save that figure to `figures/ibd_relatedness.pdf`.

::: {.callout-note collapse="true" icon=false}
#### Click to see the solution

You can find my solution
[here](https://github.com/bodkan/simgen/blob/main/files/repro/04_plot_ibd_relatedness.R).

:::

## Exercise 4: Generating other results

Of course, the main output of our research is visualizations. But sometimes,
a good old table is also very useful! Our papers often have both plots and
tables.

Just to pracice a little bit more, **create a standalone R script called
`06_sumarize_coverage.R`, which will:**

- Read the table in `data/processed/metadata.tsv` using `read_tsv()` function.

- `filter()` the table to only `age > 0`  (only rows for ancient individuals) and
`continent == "Europe"` (i.e., only aDNA European samples).

- `summarize()` the rows to compute across grouping based on `country` and
`age_bin` (remember `group_by()`?):

  - `avg_coverage` as a `mean()` of the `coverage` column,
  - `n_samples` as the `n()` number of rows.

- The rows should be sorted in a descending order by the column `n_samples`.

- Store the result of the `filter / group_by / summarize` `%>%` pipeline
as `df_summary`, and save the table using `write_tsv()` in a file
called `results/coverage_summary.tsv`.

::: {.callout-note collapse="true" icon=false}
#### Click to see the solution

You can find my solution
[here](https://github.com/bodkan/simgen/blob/main/files/repro/06_summarize_coverage.R).

:::

## Exercise 5: Adding a README file

It's a good idea for every project to have a README file at the root of the
directory. Traditionally, this file is formatted in a
[Markdown](https://www.markdownguide.org/basic-syntax/)
format, which makes it easy to read in plain text, but is easily visualized
on the web (many websites such as GitHub and other data repositories render
Markdown text beautifully).

**Look up the basic syntax of [Markdown formatting](https://www.markdownguide.org/basic-syntax/), then
create `File` `->` `New File` -> `Markdown File` (not R Markdown file!).
Then write a short list (again, take a look at the Markdown syntax and
read how to create a list), which will describe, in one sentence:**

1. Which directories are in your project, and what they countain (briefly!)
2. Which scripts you created and what they do.
3. Which results are created (so far just figures), and which scripts
created those results.

**Hint:** Try clicking the "Visual" button on top of your editing window.
I personally don't use it because I like to write Markdown in plain text,
but it can be very helpful, because you can create lists and other formatting
by clicking with the mouse, instead of having to type Markdown syntax manually.

---

**Creating a README is an important practice of documenting computational projects. If
anyone comes back to your project later (even you!) they will have a
summary of what they project contains, what code it contains, and what
that code does.**

**This is infinitely easier than having to read code and decipher what it does!**

**Always have a README file in your project root directory, especially when
you publish it in a thesis or in a paper.**


## What we're still missing (theoratically)

Here are a couple of topics which can be extremely helpful for pushing
reproducibility even further, but that we don't have bandwidth to go through,
unfortunately. I'm mentioning them in case you'd like to do more studying
on your own, perhaps as you get further into your own research projects.

I use all of these in my own work, but I consider them much more advanced
topics. Our workshop focuses on the _fundamentals_ of reproducible research.
In this sense, you could do very much

0. Backups. Unless you use git actively (which solves the backup
problem as a side-effect---see below), you need to back up regularly. Even
right-clicking your `simgen-project` directory, archiving it in a zip file
named like `"yyyy-mm-dd-simgen-project.zip"` regularly will get you through
any trouble.

1. [_git_](https://git-scm.com) -- an incredibly powerful version control system,
which is also at the back end of the online code repository service
[GitHub](https://github.com). This is the big one and probably the most
important tool I've ever learned as a programmer. It would require an entire
workshop on its own, unfortunately.

2. [_renv_](https://rstudio.github.io/renv/articles/renv.html) --
an R package which allows you to lock the exact versions of
R packages used in your project (and have anyone else restore those exact
versions of packages at a later point).

3. [_targets_](https://books.ropensci.org/targets/) -- an R package for
building and orchestrating truly complex computational pipelines. For instance,
it can track whether or not a particular function or piece of data has changed
and, therefore, whether other components of a pipeline downstream from it need
to be re-run (unlike our `01_download_data.R` or `02_process_data.R` scripts
which need to be re-run in their entirety after each modification).


## Closing remarks

Our workshop focuses primarily on the R programming language and various
R packages useful for data analysis and plotting. Of course, real-world
computational projects often rely on other programs and scripts written in
traditional shell (like BASH) or Python. However, the same ideas apply,
regardless of the exact language. In fact, in the sessions of population
genomics and inference, we will see how everything fits very naturally in
the overall structure we've established here.