-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathr-bootcamp.qmd
More file actions
3310 lines (2363 loc) · 92.8 KB
/
r-bootcamp.qmd
File metadata and controls
3310 lines (2363 loc) · 92.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# R bootcamp
In this first chapter, you will be exploring the fundamental, more technical
aspects of the R programming language.
We will focus on topics which are normally taken for granted and never
explained in basic data science courses, which generally immediately jump to
data manipulation and plotting.
I strongly believe that **getting familiar with the fundamentals of
R as a complete programming language from a "lower-level" perspective,
although it might seem a little overwhelming at the beginning, will pay
dividends over and over your scientific career**.
**The alternative is relying on magical black box thinking, which might work
when everything works smoothly... except things rarely work smoothly in
anything related to computing. Bugs appear, programs crash, incorrect results
are produced---only by understanding the fundamentals can you troubleshoot
problems.**
**We call this chapter a "bootcamp" on purpose -- we only have a limited
amount of time to go through all of these topics, and we have to rush
things through a bit.** After all, the primary reason for the existence of this workshop
is to make you competent researchers in computational population genomics, so
the emphasis will still be on practical applications and solving concrete
data science issues.
**Still, when we get to data science work in the following chapters, you will
see that many things which otherwise remain quite obscure and magical boil down
to a set of very simple principles introduced here. The purpose of this
chapter is to show you these fundamentals.**
**This knowledge will make you much more confident in the results of your work,
and much easier to debug issues and problems in your own projects, but also
track down problems in other people's code. The later happens much more often
than you might think!**
## Getting help
**Before we even get started, there's one thing you should remember: R (and
R packages) have an absolutely stellar documentation and help system.** What's
more, this documentation is standardized, has always the same format, and is
accessible in the same way. The primary way of interacting with it from inside
R (and RStudio) is the `?` operator. **For instance, to get help about the `hist()`
function (histograms), you can type `?hist` in the R console.** This documentation
has a consistent format and appears in the "Help" pane in your RStudio window.
There are a couple of things to look for:
1. On the top of the documentation page, you will always see a brief description
of the _arguments_ of each function. This is what you'll be looking for most
of the time ("How do I do specify this or that? How do I modify the behavior
of the function?").
2. On the bottom of the page are examples. These are small bits of code which
often explain the behavior of some functionality in a very helpful way.
Whenever you're lost or can't remember some detail about some piece of R
functionality, looking up `?` documentation is always very helpful.
**As a practice and to build a habit, whenever we introduce a new function
like `new_function()` in this course, use `?new_function` to open
its documentation and skim through it.** I do this many times a day to refresh
my memory on how something works.
---
## Exercise 0: Creating an R Script and general workflow
**Let's start easy. Open RStudio, create a new R script
(`File` `->` `New file` `->` `R Script`), save it somewhere on your computer
as `r-bootcamp.R` (`File` `->` `Save`, doesn't really matter where you save it).**
**Every time you encounter a new bit of code, looking like this (i.e.,
text shown in a grey box like this):**
```{r}
#| collapse: true
#| results: hide
# here
# is
# some
123
# R
"code"
```
**please copy it into your script. You can then put your cursor on the first
line of that code, and hit `CTRL + Enter` (on Windows/Linux) or `CMD + Enter`
(on macOS) to execute it, which will step over to the next executable line.
(Generally up to the next following `# comment block prefixed with '#'`).
Alternatively, you can also type it out to the R
Console directly, and evaluate it by hitting `Enter`.**
**It will sound very annoying, but try to limit copy-pasting code only
to very long commands. Typing things out by hand forces you to think about
every line of code and that is very important! At least at the beginning.**
## Exercise 1: Basic data types and variables
Every time you create a value in R, an object is created in computer
memory. For instance, when you type this and execute this command in the R
console by pressing `Enter`, R creates a bit of text in memory:
```{r}
#| eval: false
"this is a bit of text"
```
**You can use the assignment operator in R `<-` to store an object in a variable,
here a variable called `text_var`:**
```{r}
text_var <- "this is a bit of text"
```
**Except saying that this _"stores and object in a variable"_ is not correct,
even though we always use this phrase.** Instead, the `<-` operator actually
stores a _reference_ to a bit of computer memory where that value is located.
This means that even after you run this command next, `"this is a bit of text"`
is still sitting in memory, even though it appears to have been overwritten
by the number 42 (we just don't have access to that text anymore):
```{r}
text_var <- "this is a bit of text"
text_var <- 42
```
**Similarly, when you run this bit of code, you don't create a duplicate
of that text value, the second variable refers to the same bit of computer
memory:**
```{r}
text_var1 <- "some new text value"
text_var2 <- text_var1
```
**To summarize: values (lines of text, numbers, data frames, matrices, lists, etc.)
don't have "names". They exist "anonymously" in computer memory. Variables
are nothing but "labels" for those values.**
---
It might be very strange to start with something this technical (and
almost philosophical!), but it is very much worth keeping this in mind,
especially in more complex and huge data sets which we'll get to later in
our workshop.
**We will continue saying things like _"variable `abc` contains this or that
value"_ (instead of _"contains reference to that value in memory"_) because of convenience, but this is just an oversimplification.**
---
**Write the following variable definitions in your R script and then evaluate this code
in your R console by `CTRL / CMD + Enter`.** (Note that hitting this shortcut
will move the cursor to the next line, allowing you to step by step evaluate
longer bits of code.)
```{r}
w1 <- 3.14
x1 <- 42
y1 <- "hello"
z1 <- TRUE
```
**When you type the variable names in your R console, you'll get them printed
back, of course:**
```{r}
w1
x1
y1
z1
```
---
Programming involves assigning values (or generally some objects in computer
memory in general) to variables. In our code, variables change values, so we
often need to check what is a type of some variable---are we working with
a number, or some text, etc.? `typeof()` is one of the functions that are
useful for this.
**What are the data "types" you get when you apply function `typeof()` on each of
these variables, i.e. when you type and evaluate a command like `typeof(w1)`?
Compare the result to the values you saved in those variables---what do you
get from `typeof()` on each of them?**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
A "floating point" number is of a so-called type "double":
```{r}
typeof(w1)
```
Interestingly, "integer number" written like 42 is also represented as "double",
even though we don't see a decimal point:
```{r}
typeof(x1)
```
In order to force a variable to be "really integer", this is possible (note
that strange `L` letter). But you can regard this as a quirk of R. It's almost
never a distinction that's necessary. For 99.99% of my coding needs, a "number"
is just a number, and "double" is OK even for integers:
```{r}
x1 <- 42L
typeof(x1)
```
Generally speaking, "text" is represented by the data type "character" and is
always defined by surrounding something in `" double quotes "`:
```{r}
typeof(y1)
```
And the last important data type is "logical", indicating whether something
is `TRUE` or `FALSE`:
```{r}
typeof(z1)
```
:::
---
You can test whether or not a specific variable is of a specific type using
functions such as `is.numeric()`, `is.integer()`, `is.character()`, `is.logical()`. **See what results you get when you apply these functions on these
four variables `w1`, `x1`, `y1`, `z1`. Pay close attention to the difference
(or lack thereof?) between applying `is.numeric()` and `is.integer()` on
variables containing "values which look like numbers" (42, 3.14, etc.).**
**Note:** This might seem incredibly boring and useless but trust me. **In your
real data, such as in data frames (discussed below), you will encounter
variables with thousands of rows, sometimes millions. Being able to make sure
that the values you get in your data-frame columns are of the expected type
is something you will be doing very very often, especially when troubleshooting!
So this is a good habit to get into.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
#| collapse: true
w1
is.numeric(w1)
is.integer(w1)
```
```{r}
#| collapse: true
x1
is.numeric(x1)
is.integer(x1)
```
```{r}
#| collapse: true
y1
is.character(y1)
```
```{r}
#| collapse: true
z1
is.logical(z1)
is.numeric(z1)
is.integer(z1)
```
:::
To summarize (and oversimplify a little bit) R allows variables to have several
types of data, most importantly:
- integers (such as `42`)
- numerics (such as `42.13`)
- characters (such as `"text value"`)
- logicals (`TRUE` or `FALSE`)
---
**We will also encounter two types of "non-values". We will not be discussing
them in detail here, but they will be relevant later. For the time being, just
remember that there are also:**
- missing values represented by `NA`---you will see this very often in data!
- undefined values represented by `NULL`
---
**What do you think is the practical difference between `NULL` and `NA`? In
other words, when you encounter one or the other in the data, how would you
interpret this?**
## Exercise 2: Vectors
Vectors are, roughly speaking, collections of values. We create a vector by
calling the `c()` function (the "c" stands for "concatenate", or "joining
together").
**Create the following variables containing these vectors. Then inspect their
data types by calling the `typeof()` function on them again, just like you did for "single-value
variables" above.** Again, copy-paste this into your script and evaluate
using `CTRL / CMD + Enter` or paste it directly into your R Console and hit
`Enter`:
```{r}
w2 <- c(1.0, 2.72, 3.14)
x2 <- c(1, 13, 42)
y2 <- c("hello", "folks", "!")
z2 <- c(TRUE, FALSE)
```
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
typeof(w2)
typeof(x2)
typeof(y2)
typeof(z2)
```
:::
---
**We can use the function `is.vector()` to test that a given object really is a
vector. Try this on your vector variables.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
is.vector(w2)
is.vector(x2)
is.vector(y2)
is.vector(z2)
```
:::
---
**What happens when you call `is.vector()` on the variables `x1`, `y1,` etc. from
the previous Exercise (i.e., those which contain single values)?**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
is.vector(42)
```
Yes, even scalars (i.e., singular values) are formally vectors!
This is why we see the [1] index when we type a single number:
```{r}
1
```
In fact, even when we create a vector of length 1, we still get a scalar result:
```{r}
c(1)
```
The conclusion is, R doesn't actually distinguish between scalars and vectors!
A scalar (a single value) is simply a vector of length 1. Think of it this way:
in a strange mathematically-focused way, even a single tree is a forest. 🙃
:::
---
**Do elements of vectors need to be homogeneous (i.e., of the same data type)?
Try creating a vector with values `1`, `"42"`, and `"hello"` using the `c()`
function again, maybe save it into the variable `mixed_vector`. Can you do it?
What happens when you try (and evaluate this variable in the R console)? Inspect the result in the R console (take a close look
at how the result is presented in text and the quotes that you will see), or
use the `typeof()` function again.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
mixed_vector <- c(1, "42", "hello")
mixed_vector
```
```{r}
typeof(mixed_vector)
```
Notice that your values were all converted to text!
:::
---
**You can see that if vectors are not created with values of the same type, they are converted by
a cascade of so-called "coercions".** A vector defined with a mixture of different values (i.e., the four "atomic types" we discussed in Exercise 1, doubles,
integers, characters, and logicals) will be _coreced_ to be only one of
those types, given certain rules.
**Do a little detective work and try to figure out some of these coercion rules.
Make a couple of vectors with mixed values of different types using the
function `c()`, and observe what type of vector you get in return.**
**Hint:** Try creating a vector which has integers and strings, integers and
decimal numbers, integers and logicals, decimals and logicals, decimals and strings, and
logicals and strings. Observe the format of the result that you get, and
build your intuition on the rules of coercions by calling `typeof()` on each vector object to verify this intuition.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
- Numbers mixed with characters are forced to become characters:
```{r}
v1 <- c(1, "42", "hello")
v1
typeof(v1)
```
- Integers and decimals are considered all doubles (this is what you saw above
as R basically not distinguishing both that much, most of the time, for
most practical reasons):
```{r}
v2 <- c(1, 42.13, 123)
v2
typeof(v2)
```
- A logical mixed with numbers is forced into a number! `TRUE` is 1 and `FALSE`
is 0. This is used extremely often, so remember this rule!
```{r}
v3 <- c(1, 42, TRUE, FALSE)
v3
typeof(v3)
```
```{r}
v4 <- c(1.12, 42.13, FALSE)
v4
typeof(v4)
```
- Number is again mixed with text, giving type character again:
```{r}
v5 <- c(42.13, "hello")
v5
typeof(v5)
```
- Logical is also forced into a character:
```{r}
v6 <- c(TRUE, "hello")
v6
typeof(v6)
```
:::
---
**Out of all these data type explorations, this Exercise is probably the
most crucial for any kind of data science work. Why do you think I say this?
Think about what can happen when someone does incorrect manual data entry
in Excel.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
Imagine what kinds of trouble can happen if you just load a table data from
somewhere, if the values are not properly formatted. For instance, if a "numeric"
column of your table has accidentally some characters (which can very easily
happen when manually entering data in Excel, etc.). This will be much clearer
when we get to data frames below.
:::
---
Although creating vectors manually `c("using", "an", "approach", "like", "this")`
is often helpful, particularly when testing bits of code and experimenting
in the R console, it is impossible to create dozens or more values like this
by hand.
**You can create vector of consecutive values using several useful approaches.
Try and experiment these options:**
1. **Create a sequence of values from `i` to `j` with a shortcut `i:j`.
Create a vector of numbers from 7 to 23 like this.**
2. **Do the same using the function `seq()`. Read `?seq` to find out what
parameters you should specify (and how) to get the same result as the `i:j`
shortcut above to get vector 7 to 23.**
3. **Modify the arguments given to `seq()` so that you create a vector of
numbers from 20 to 1.**
4. **Use the `by =` argument of `seq()` to create a vector of only odd
values starting from 1.**
**`seq()` is one of the most useful utility functions in R, so keep it in mind!**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
# 1
7:23
# 2
seq(from = 7, to = 23)
# 3
seq(from = 20, to = 1)
# 4
seq(1, 20, by = 2)
```
This might look boring, but these functions are super useful to generate
indices for data, adding indices as columns to tabular data, etc.
:::
---
**Another very useful built-in helper function (especially when we get to
the iteration Exercise below) is `seq_along()`. What does it give you when you run
it on this vector, for instance?**
```{r}
v <- c(1, "42", "hello", 3.1416)
```
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
seq_along(v)
```
This function allows you to quickly iterate over elements of a vector (or a list)
using indices into that vector (or a list).
:::
---
## Exercise 3: Lists
Lists (created with the `list()` function, equivalent to the `c()` function
for vectors) are a little similar to vectors but very different in a couple of important
respects. Remember how we tested what happens when we put different types of
values in a vector (reminder: vectors must be "homogeneous" in terms of the
data types of their elements!)?
**What happens when you create lists with
different types of values using the code in the following chunk? Use `typeof()`
on the resulting list variables and compare your results to those you got on "mixed
value" vectors above.**
```{r}
w3 <- list(1.0, "2.72", 3.14)
x3 <- list(1, 13, 42, "billion")
y3 <- list("hello", "folks", "!", 123, "wow a number follows again", 42)
z3 <- list(TRUE, FALSE, 13, "string")
```
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
When we type the list variable in the R console, we no longer see the "coercion"
we observed for vectors (numbers remain numbers even though the list contains
strings):
```{r}
y3
```
:::
**Calling `typeof()` on the list in the R console will (disappointingly) not
tell us much about the data types of each individual element. Why is that?
Think about the mixed elements possible in a list.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
typeof(y3)
```
Well, if a list can have multiple different types of elements, there's no such
a thing as a "type of a list", in a way we can say that there's a "numeric vector":
```{r}
typeof(c(1, 10, 135))
```
Or a logical vector:
```{r}
typeof(c(TRUE, FALSE, FALSE, TRUE))
```
More on this below!
:::
---
**Try also a different function called for `str()` ("str" standing for
"structure") and apply it on one of those lists in your R console.
Is `typeof()` or `str()` more useful to inspect what kind of data is stored in
a list (`str` will be very useful when we get to data frames for --- spoiler
alert! --- exactly this reason). Why?**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
The structure function peeks into every element of the list:
```{r}
str(y3)
```
As we said above, `typeof()` can only test the variable itself, but that
variable has (potentially) multiple types of values in it:
```{r}
typeof(w3)
```
:::
---
**Apply `is.vector()` and `is.list()` on one of the lists above (like `w3`
perhaps). What result do you get? Why do you get that result? Then run
both functions on one of the vectors you created above (like `w2`).
What does this mean?**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
- **Testing both functions on a list like `w3`:**
```{r}
w3
```
Lists are vectors!
```{r}
is.vector(w3)
```
Lists are lists (obviously!):
```{r}
is.list(w3)
```
- **Testing both functions on a vector like `w2`:**
```{r}
w2
```
Vectors are not lists!
```{r}
is.list(w2)
```
**In conclusion:**
1. Every list is (formally speaking) also a vector.
2. But vectors are not lists (because lists can have values of multiple types).
:::
---
Not only can lists contain arbitrary values of mixed types (atomic data
types from Exercise 1 of this exercise), they can also contain "non-atomic" data
as well, such as other lists! In fact, you can, in principle, create lists
of lists of lists of... lists!
**Try creating a `list()` which, in addition
to a couple of normal values (numbers, strings, doesn't matter) also contains
one or two other lists (we call these lists "nested lists" for this reason,
or also "recursive lists"). Don't think about this too much,
just create something arbitrary "nested lists" to get a bit of practice.
Save this in a variable called `weird_list` and type it back in your R
console, just to see how R presents such data back to you. In the next
Exercise, we will learn how to explore this type of data better.**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
Here's an example of such "nested list":
```{r}
weird_list <- list(
1,
"two",
list(
"three",
4,
list(5, "six", 7)
)
)
```
When we type it out in the R console, we see that R tries to lay out the
structure of this data with numerical indices (we'll talk about indices below!)
indicating the "depth" of each nested pieces of data (either a plain number or
character, or another list!)
```{r}
weird_list
```
:::
**Note:** If you are confused (or even annoyed) why we are even doing this,
in the later discussion of data frames and spatial data structures, it
will become much clearer why putting lists into other lists allows a whole
another level of data science work. Please bear with me for now! This is
just laying the groundwork for some very cool things later down the line...
and, additionally, it's intended to bend your mind a little bit and get
comfortable with how complex data can be represented in computer memory.
## Exercise 4: Logical/boolean expressions and conditionals
**This exercise is probably the most important thing you can learn to do
complex data science work on data frames or matrices. It's not necessary
to remember all of this, just keep in mind we did these exercise so that
you can refer to this information on the following days!**
Let's recap some basic Boolean algebra in logic. The following basic rules
apply (take a look at the
[truth table](https://en.wikipedia.org/wiki/Boolean_algebra#Basic_operations)
for a bit of a high school refresher) for the "and", "or", and "negation" operations:
1. The **AND operator** (represented by `&` in R, or often `∧` in math):
_Both conditions must be `TRUE` for the expression to be `TRUE`._
- `TRUE` & `TRUE` == `TRUE`
- `TRUE` & `FALSE` == `FALSE`
- `FALSE` & `TRUE` == `FALSE`
- `FALSE` & `FALSE` == `FALSE`
2. The **OR operator** (represented by `|` in R, or often `∨` in math):
_At least one condition must be `TRUE` for the expression to be `TRUE`._
- `TRUE` | `TRUE` == `TRUE`
- `TRUE` | `FALSE` == `TRUE`
- `FALSE` | `TRUE` == `TRUE`
- `FALSE` | `FALSE` == `FALSE`
3. The **NOT operator** (represented by `!` in R, or often `¬` in math):
_The negation operator turns a logical value to its opposite._
- `!TRUE` == `FALSE`
- `!FALSE` == `TRUE`
4. **Comparison operators** `==` ("equal to"), `!=` ("not equal to"), `<` or `>`
("lesser / greater than"), and `<=` or `>=` ("lesser / greater or equal than"):
_Comparing two things with either of these results in `TRUE` or `FALSE` result._
**Note:** There are other operations and more complex rules, but we will
be using these four exclusively (plus, the more complex rules can be
derived using these basic operations anyway).
---
Let's practice working with logical conditions on some toy problems.
**Create two logical vectors with three elements each using the `c()` function
(pick random `TRUE` and `FALSE` values for each of them, it doesn't matter
at all), and store them in
variables named `A` and `B`. What happens when you run `A & B`, `A | B`, and `!A`
or `!B` in your R console? How do these logical operators work when you have multiple values, i.e. vectors?**
**For extra challenge, try to figure out the results of `A & B`, `A | B`, and `!A`
in your head before you run the code in your R console!**
**Hint:** Remember that a "single value" in R is a vector like any other
(specifically vector of length one).
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
It turns out that we can compare not just single values (scalars) but also
multiple values like vectors. When we do this, R performs the given operation
for every pair of elements at once!
```{r}
A <- c(TRUE, FALSE, TRUE)
B <- c(FALSE, FALSE, TRUE)
```
```{r}
A & B
A | B
!A
!B
```
:::
---
**What happens when you apply base R functions `all()` and `any()` on your
`A` and `B` (or `!A` and `!B`) vectors?**
**Note:** Remember the existence of `all()` and `any()` because they are very
useful in daily data science work!
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
These functions reduce a logical vector down to a single `TRUE` or `FALSE`
value.
```{r}
A
all(A)
any(A)
```
:::
---
**If the above all feels too technical and mathematical, you're kind of right. That
said, when you do data science, you will be using these logical expressions
literally every single day. Why? Let's return from abstract programming
concepts back to reality for a second.**
**Think about a table which has a column with some values, like sequencing
`coverage`. Every time you filter for samples
with, for instance, `coverage > 10`, you're performing exactly this kind of
logical operation. You essentially ask, for each sample (each value in the
column), which samples have `coverage > 10` (giving you `TRUE`) and which have
less than 10 (giving you `FALSE`).**
Filtering data is about applying logical operations on vectors of `TRUE` and
`FALSE` values (which boils down to "logical indexing" introduced below),
even though those logical values rarely feature as data in the tables we
generally work with. Keep this in mind!
---
**Consider the following vectors of coverages and origins of some
set of example aDNA individuals (let's imagine
these are columns in a table you got from a bioinformatics lab) and copy them
into your R script:**
```{r}
coverage <- c(15.09, 48.85, 36.5, 47.5, 16.65, 0.79, 16.9, 46.09, 12.76, 11.51)
origin <- c("mod", "mod", "mod", "anc", "mod", "anc", "mod", "mod", "mod", "mod")
```
**Then create a variable `is_high` which will contain a `TRUE` / `FALSE` vector
indicating whether a given `coverage` value is higher than 10. Then create
a variable `is_ancient` which will contain another logical vector indicating
whether a given sample is `"anc"` (i.e., "ancient").**
**Hint:** Remember that evaluating `coverage > 10` gives you a logical vector
and that you can store that vector in a variable.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
is_high <- coverage > 10
is_high
```
```{r}
is_ancient <- origin == "anc"
is_ancient
```
:::
---
**Use the AND operator `&` to test if there is any high coverage sample
(`is_high`) which is also ancient (`is_ancient`).**
**Hint:** Apply the `any()` function to a logical expression you get by
comparing both variables using the `&` operation.
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
This tests whether individual high coverage samples are also ancient:
```{r}
is_high & is_ancient
```
And this tests whether _any_ high coverage samples are also ancient:
```{r}
any(is_high & is_ancient)
```
**Note:** You don't always create dedicated throwaway temporary variables
like this. You could easily do the same much more concisely (although maybe
less readably). Both approaches are useful.
```{r}
any(coverage > 10 & origin == "anc")
```
:::
---
**Now let's say that you have a third vector in this hypothetical table,
indicating whether or not is a given sample from Estonia:**
```{r}
estonian <- c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE)
```
**Write now a more complex conditional expression, which will test if a
given individual has (again) coverage higher than 10 and is "ancient" OR
whether it's Estonian (and so it's coverage or "mod" state doesn't
matter).**
::: {.callout-note collapse="true" icon=false}
#### Click to see the solution
```{r}
(coverage > 10 & origin == "ancient") | estonian
```
:::
---
This was just a simple example of how you can use logical expressions to
do filtering based on values of (potentially many) variables, all at once,
in a so-called "vectorized" way (i.e., testing on many different values at
once, getting a vector of `TRUE` / `FALSE` values as a result).
You'll have
much more opportunity to practice this in our sessions on _tidyverse_,
but let's continue with other fundamentals first, which will make your
understanding of basic data manipulation principles even more solid.
---
**R has also equivalent operators `&&` and `||`. What do they do and how are
they different from the `&` and `|` you already worked with? Pick some exercises
from above and experiment with both versions of logical AND and OR operators to figure out what they do and how are they different.**
## Exercise 5: Indexing into vectors and lists
Vectors and lists are sequential collections of values.
To extract a specific values(s) of a vector or a list (or to assign some
value at its given position(s)), we use a so-called "indexing" operation
(often also called "subsetting" operation for reasons that will become
clear soon).
Generally speaking, we can do indexing in three ways:
1. **numerical-based indexing** (by specifying a set of integer numbers, each
representing a position in the vector/list we want to extract),
2. **logical-based indexing** (by specifying a vector of `TRUE` / `FALSE` values
of the same length as the vector/list we're indexing into, each representing
whether or not -- `TRUE` or `FALSE` -- should a position be included in the
indexing result)
3. **name-based indexing** (by specifying names of elements to index)
**Let's now practice those for vectors and lists separately. Later, when we
introduce data frames, we will return to the topic of indexing again.**