simgen/popgen-stats.qmd at main · bodkan/simgen · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
# Tree-sequence statistics

## Introduction

In this session, you are going to build on top of the results from the previous [session
on programming demographic models](https://bodkan.net/simgen/popgen-models.html).
Specifically, you will learn how to compute population genetic statistics on _slendr_-simulated
tree sequences using
[_slendr_'s interface](https://www.slendr.net/reference/index.html#tree-sequence-statistics)
to the _tskit_ Python module.

**Note:** If you ever get _totally stuck_, take a look at the provided solution. The point
of these exercises is to learn things, not fight through endless frustration.
That said, try not to look at the solution automatically, always take a stab at
solving each exercise on your own first. And when you do decide to look
at my solution, always try to understand the code on your own, line by
line, before you run it (and therefore at least mentally follow a
similar process to writing yourself).

## Setup

**First, create a new R script `popgen-stats.R` and paste in the following code.** This
is one of the possible solutions to the Exercise 1 of the session
on building demographic models, and it's easier if we all
use it to be on the same page from now on, starting from the same model and
the same simulated tree sequence:

```{r}
#| collapse: true
#| results: hide
library(slendr)
init_env(quiet = TRUE)

chimp <- population("CHIMP", time = 7e6, N = 5000)
afr <- population("AFR", parent = chimp, time = 6e6, N = 15000)
eur <- population("EUR", parent = afr, time = 60e3, N = 3000)
nea <- population("NEA", parent = afr, time = 600e3, N = 1000, remove = 40e3)

gf <- gene_flow(from = nea, to = eur, rate = 0.03, start = 55000, end = 50000)

model <- compile_model(
  populations = list(chimp, nea, afr, eur),
  gene_flow = gf,
  generation_time = 30
)

# Here we scheduled the sampling of two Neanderthals at 70kya and 40kya
nea_samples <- schedule_sampling(model, times = c(70000, 40000), list(nea, 1))

# Here we schedule one Chimpanzee sample, 5 African samples, and 10 European samples
present_samples <- schedule_sampling(model, times = 0, list(chimp, 1), list(afr, 5), list(eur, 10))

# We also schedule the recording of one European sample between 50kya and 2kya,
# every 2000 years
times <- seq(40000, 2000, by = -2000)
emh_samples <- schedule_sampling(model, times, list(eur, 1))

# Because those functions produce nothing but a data frame, we can bind
# individual sampling schedules together
schedule <- rbind(nea_samples, present_samples, emh_samples)

# Simulate a tree sequence (with mutations) -- feel free to increase the amount
# of sequence from 20Mb to something like 100Mb (100e6 bp) if you want  to reduce
# the statistical noise a bit
ts <-
  msprime(model, sequence_length = 20e6, recombination_rate = 1e-8, samples = schedule, random_seed = 1269258439) %>%
  ts_mutate(mutation_rate = 1e-8, random_seed = 1269258439)

ts
```

As a sanity check, let's use a couple of _tidyverse_ table-munging tricks
to make sure the tree sequence does contain a set of sample
which matches our intended sampling schedule (particularly the time series
of European individuals and the two Neanderthals):

```{r}
#| message: false
library(dplyr)
```

```{r}
# total number of recorded individuals in the tree sequence
ts_samples(ts) %>% nrow
# times of sampling of each recorded European individual
ts_samples(ts) %>% filter(pop == "EUR") %>% pull(time)
# times of sampling of each recorded Neanderthal
ts_samples(ts) %>% filter(pop == "NEA") %>% pull(time)
# count of individuals in each population
ts_samples(ts) %>%
  group_by(pop, present_day = time == 0) %>%
  summarize(n = n()) %>%
  select(pop, present_day, n) %>%
  arrange(present_day)
```

::: {.aside}
**Note:** These bits of _tidyverse_ code are extremely helpful when you're working
with large tree sequences with many individuals as sanity checks that your
sampling worked as intended. I'm listing them here in case you've never worked
with the _tidyverse_ family of R packages before (such as the _dplyr_ package
where `filter()`, `group_by()`, `tally()`, and `pull()` come from).
:::

Everything looks good! Having made sure that the `ts` object contains the
individuals we want, let's move to the first exercise.

## Exercise 1: Computing nucleotide diversity

The toy model of ancient human history plotted above makes a fairly clear prediction of what should be the
nucleotide diversity expected in the simulated populations.
**Compute the nucleotide diversity in all populations using the _slendr_ function
[`ts_diversity()`](https://www.slendr.net/reference/ts_diversity.html#ref-examples) in your tree sequence `ts`. Do you get numbers that (relatively between
all populations) match what would expect from the model given the $N_e$ that
you programmed for each?**

**Hint:** Nearly every _slendr_ statistic function interfacing with _tskit_
accepts a `ts` tree-sequence object as its first argument, with further arguments
being either a vector of individual names representing a group of samples to
compute a statistic on, or a (named) list of such vectors (each element of that
list for a group of samples) --- these lists are intended to be equivalent to
the `sample_sets =` argument of many _tskit_ Python methods (which you've learned
about in your activity on _tskit_), except that they allow symbolic names
of individuals, rather then integer indices of nodes in a tree sequence.

Although you can get all the above information by processing the table produced
by the `ts_samples()` function, _slendr_ provides a useful helper function
`ts_names()` which only returns the names of individuals as a vector
(or a named list of such vectors, one vector per population as shown below).

When you call it directly, you get a plain vector of individual names:

```{r}
ts_names(ts)
```

This is not super helpful, unless we want to compute some statistic for _everyone_
in the tree sequence, regardless of their population assignment. A bit
more useful is to call the function like this, because it will produce a result
which can be immediately used as the `sample_sets =` argument mentioned in the
**Hint** above:

```{r}
ts_names(ts, split = "pop")
```

As you can see, this gave us a normal R list, with each element containing
a vector of individual names in a population. Note that we can use standard R
named-list indexing to get subsets of individuals:

```{r}
names <- ts_names(ts, split = "pop")

names["NEA"]

names[c("EUR", "NEA")]
```

etc.

Many of the following exercises will use these kinds of tricks to instruct
various _slendr_ / _tskit_ functions to compute statistics on subsets of
all individuals sub-sampled in this way.

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

**Population-based nucleotide diversity:**

Let's first get a named list of individuals in each group we want to be
working with (slendr tree-sequence statistic functions generally operate
with this kind of structure):

```{r}
sample_sets <- ts_names(ts, split = "pop")
sample_sets
```

We can use such `sample_sets` object to compute nucleotide diversity (pi)\
in each population, in a bit of a similar manner to how we would do it
with the [standard _tskit_ in Python](https://tskit.dev/tskit/docs/stable/python-api.html#tskit.TreeSequence.diversity):

```{r}
pi_pop <- ts_diversity(ts, sample_sets = sample_sets)
arrange(pi_pop, diversity)
```

You can see that this simple computation fits  the extreme differences in
long-term $N_e$ encoded by your _slendr_ demografr model.

:::

---

**After you've computed nucleotide diversity per-population, compute it for
each individual separately using the same function `ts_diversity()`** (which,
in this setting, gives you effectively the heterozygosity for each individual).
**If you are familiar with plotting in R, visualize the individual-based
heterozygosities across all populations.**

**Hint:** You can do this by giving a vector of names as `sample_sets =` (so
not an R list of vectors like you did in the previous exercise).
One way to make plotting of the result easy is to first get a table of
all individuals using `pi_df <- ts_samples(ts)`. Then you can use
`pi_df$name` as the parameter to `ts_diversity(ts, sample_sets = pi_df$name)`, and assign the `diversity` column values to `pi_df` as
a new column (perhaps named `diversity` as well).

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

**Per-individual heterozygosity:**

We can do this by passing the vector of individual names directory as the `sample_sets =` argument, rather than in a list of groups as we did above.

For convenience, we first get a table of all individuals (which of course
contains also their names) and in the next step, we'll just add their
heterozygosities as a new column:

```{r}
pi_df <- ts_samples(ts)
pi_df$name

pi_df$diversity <- ts_diversity(ts, sample_sets = pi_df$name)$diversity
pi_df
```

Let's plot the results using the ggplot2 package (because a picture is worth
a thousand numbers!)

```{r}
library(ggplot2)

ggplot(pi_df, aes(pop, diversity, color = pop, group = pop)) +
  geom_boxplot(outlier.shape = NA) +
  geom_jitter() +
  theme_bw()
```

:::

## Exercise 2: Computing pairwise divergence

**Use the function
[`ts_divergence()`](https://www.slendr.net/reference/ts_divergence.html#ref-examples)
to compute genetic divergence between all pairs of populations. Again, do you
get results compatible with our demographic model in terms of expectation
given the split times between populations as you programmed them for your
model?**

**Hint:** Again, you can use the same concept of `sample_sets =` we discussed
in the previous exercise. In this case, the function computes _pairwise_
divergence between each element of the list given as `sample_sets =` (i.e.,
for each vector of individual names in that list).

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

```{r}
sample_sets <- ts_names(ts, split = "pop")

div_df <- ts_divergence(ts, sample_sets)
arrange(div_df, divergence)
```

We can see that the pairwise nucleotide divergences between populations
recapitulate the known population/species relationships we would expect from
our model.
:::

## Exercise 3: Outgroup $f_3$ statistic

**Use the function `ts_f3()` to compute the outgroup $f_3$ statistic
between pairs of African-European, African-Neanderthal, and
European-Neanderthal and a Chimpanzee outgroup. For instance, the first
one can be computed as `ts_f3(ts, B = "AFR_1", C = "EUR_1", A = "CHIMP_1")`.

**Hint:** The $f_3$ statistic is traditionally expressed as $f_3(A, B; C)$, where C represents the outgroup. Unfortunately, in _tskit_ the outgroup is named A, with B and C being the pair of samples from which we trace the length of branches towards the outgroup, so the statistic is interpreted as $f_3(B, C; A).$

**Note:** As a reminder, the higher the $f_3$ value, the more evolutionary
time has been shared between samples B and C.

**How do the outgroup $f_3$ results compare to your expectation based on simple population relationships (and to the divergence computation above)?**

**Do you see any impact of introgression on the $f_3$ value when a Neanderthal is included in the computation?**

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

```{r}
# Standard formalism is this:
#    f3(A, B; C) = E[ (A - C) * (B - C) ]
# But in tskit, A is the outgroup (different from ADMIXTOOLS!), see below...

# We can compute f3 for individuals...
ts_f3(ts, B = "AFR_1", C = "EUR_1", A = "CHIMP_1")

# ... but also whole populations (or population samples)
ts_f3(ts, B = sample_sets["AFR"], C = sample_sets["EUR"], A = "CHIMP_1")

ts_f3(ts, B = sample_sets["AFR"], C = sample_sets["NEA"], A = "CHIMP_1")

ts_f3(ts, B = sample_sets["EUR"], C = sample_sets["NEA"], A = "CHIMP_1")
```
:::


## Exercise 4: `mode = "branch"` vs `mode = "site"`

**Repeat the exercise 3 by setting `mode = "site"` in the
`ts_f3()` function calls** (this is actually the default behavior of all
_slendr_ tree-sequence based _tskit_ functions). This will switch the _tskit_
computation to using mutation counts along each branch of the tree sequence,
rather than using branch length themselves. **Why might the branch-based computation
be a bit better if what we're after is investigating the expected values of statistics
under some model?**


::: {.callout-note collapse="true" icon=false}
### Click to see the solution

See [this tutorial](https://tskit.dev/tutorials/no_mutations.html#genealogy-based-measures-are-less-noisy) (and particularly the directly linked section) for explanation.
:::


## Exercise 5: Outgroup $f_3$ statistic as a linear combination of $f_2$ statistics

You might have learned that any complex $f$-statistic can be expressed as a linear combination of multiple $f_2$ statistics (which represent simple branch length separating two lineages). **Verify that this is the case by looking up equation *(20b)* in [this amazing paper](https://academic.oup.com/genetics/article/202/4/1485/5930214) and computing an $f_3$ statistic for any arbitrary trio of individuals of your choosing using such linear combination of $f_2$ statistics.**

**Hint:** Here's that equation in a schematic form

```
      f3(A, B; C) = f2(A, C) + f2(B, C) - f2(A, B) / 2
```

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

```{r}
# standard f3
ts_f3(ts, B = "AFR_1", C = "AFR_2", A = "CHIMP_1")

# a "homemade" f3 statistic as a linear combination of f2 statistics
# f3(A, B; C) = f2(A, C) + f2(B, C) - f2(A, B) / 2
homemade_f3 <- (
  ts_f2(ts, A = "AFR_1", B = "CHIMP_1")$f2 +
  ts_f2(ts, A = "AFR_2", B = "CHIMP_1")$f2 -
  ts_f2(ts, A = "AFR_1", B = "AFR_2")$f2
) / 2

homemade_f3
```
:::


## Exercise 6: Detecting Neanderthal admixture in Europeans

Let's now pretend its about 2008, we've sequenced the first Neanderthal genome,
and we are working on a project that will
[change human evolution research forever](https://www.science.org/doi/10.1126/science.1188021).
We also have the genomes of a couple of people from Africa and Europe, which we
want to use to answer the most burning question of all evolutionary
anthropology: _"Do some people living today carry Neanderthal ancestry?"_

Perhaps you've heard earlier about an $f_4$ statistic (or its equivalent $D$ statistic) can be used
as a test of "treeness". Simply speaking, for some "quartet" of individuals
or population samples (whose names can be giving as characters such as
"EUR_1" to the `ts_f4()` function, as you can see in the help
page of `?ts_f4`), they can be used as a hypothesis test of whether the
history of those samples is compatible with there not having been an introgression.

**Compute the $f_4$ test of Neanderthal introgression in EUR individuals using
the _slendr_ function `ts_f4()`.**

**When you're computing the $f_4$, make sure to set `mode = "branch"` argument
of the `ts_f4()` function.**

**Note:** By default, each _slendr_ / _tskit_ statistic function operates
on mutations, and this will switch them to use branch length (as you might
know, $f$-statistics are mathematically defined using branch lengths in trees
and `mode = "branch"` does exactly that).

**Hint:** If you haven't learned this earlier, you want
to compute (and compare) the values of these two statistics using the _slendr_
function `ts_f4()`:

1. $f_4$(\<some African\>, \<another African\>; \<Neanderthal\>, \<Chimp\>)

2. $f_4$(\<some African\>, \<a test European\>; \<Neanderthal\>, \<Chimp\>),

here `<individual>` can be the name of any individual recorded in your
tree sequence, such as names you saw as `name` column in the table returned
by `ts_samples(ts)` (i.e. `"NEA_1"` could be used as a "representative"
\<Neanderthal\> in those equations, similarly for `"CHIMP_1"` as the fourth
sample in the $f_4$ quarted representing the outgroup).

To simplify things a lot, we can understand the above equations as comparing the
counts of so-called BABA and ABBA allele patterns between the quarted of samples
specified in the statistics:

$$
f_4(AFR, X; NEA, CHIMP) = \frac{\#BABA - \#ABBA}{\#SNPs}
$$

The first $f_4$ statistic above is not expected to give values "too different"
from 0 (even in case of Neanderthal introgression into Europeans) because we
don't expect two African individuals to differ "significantly" in terms of how
much alleles they share with a Neanderthal (because their ancestors
never met Neanderthals!). The other should --- if there was
a Neanderthal introgression into Europeans some time in their history --- be
"significantly negative".

**Is the second of those two statistics "much more negative" than the first,
as expected assuming introgression from Neanderthals into Europeans?**

**Why am I putting "significantly" and "much more
negative" in quotes in the previous sentence?
What are we missing here for this being a true hypothesis
test as you might be accustomed to from computing $f$-statistics using
a tool such as ADMIXTOOLS?** (We will get to this again in a following
exercise.)

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

Compute the difference in the amount of allele sharing between two African
individuals and a Neanderthal:

```{r}
f4_null <- ts_f4(ts, W = "AFR_1", X = "AFR_2", Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
f4_null
```

Compute the difference in the amount of allele sharing between an African
individual vs European individual and a Neanderthal:

```{r}
f4_alt <- ts_f4(ts, W = "AFR_1", X = "EUR_1", Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
f4_alt
```

We can see that the second test resulted in an f4 value about ~20 times more
negative than the first test, indicating that a European in our test carries
"significantly more" Neanderthal alleles compared to the baseline expectation
of no introgression established by the comparison to an African ...

```{r}
abs(f4_alt$f4 / f4_null$f4)
```

... although this is not a real test of significance (we have no Z-score or
standard error which would give us something like a p-value for the hypothesis
test, as we get by jackknife procedure in ADMIXTOOLS)
:::


## Exercise 7: Detecting Neanderthal admixture in Europeans v2.0

The fact that we don't get something equivalent to a p-value in these kinds of
simulations is generally not a problem, because we're often interested in
establishing a trend of a statistic under various conditions, and understanding
when and how its _expected value_ behaves in a certain way.
If statistical noise is a problem, we work around this by computing a
statistic on multiple simulation replicates or even increasing the sample sizes.

::: {.aside}
**Note:** To see this in practice, you can check out a
[paper](https://www.pnas.org/doi/10.1073/pnas.1814338116#fig02) in which I used
this approach quite successfully on a related problem.
:::

On top of that, p-value of something like an $f$-statistic (whether it's
significantly different from zero) is also strongly affected by quality of
the data, sequencing errors, coverage, etc. (which can certainly be examined
using simulations!). However, these are aspects of modeling which are quite
orthogonal to the problem of investigating the expectations and trends of
statistics given some underlying evolutionary model, which is what we're after
in these exercises.

That said, even in perfect simulated data, what exactly does
_"significantly different from zero compared to some baseline expectation"_
mean can be blurred by noise with simple single-individual comparisons that we
did above. Let's increase the sample size a bit to see if a statistical
pattern expected in $f_4$ statistic from our Neanderthal introgression model
becomes more apparent.

**Compute the first $f_4$ statistic (the baseline expectation between
a pair of Africans) and the second $f_4$ statistic (comparing an African and
a European), but this time on _all recorded Africans_ and _all recorded
Europeans_, respectively. Plot the *distributions* of those two sets of
statistics. This should remove lots of the uncertainty and make a statistical
trend stand out much more clearly.**

**Hint:** Whenever you need to compute something for many things in sequence,
looping is very useful. One way to do compute, say, an $f_4$ statistic over
many individuals is by using this kind of pattern using R's looping function
`lapply()` (you can get a refresher on looping in the R bootcamp session):

```{r}
#| eval: false

# Loop over vector of individual names (variable x) and apply a given ts_f4()
# expression on each individual (note the ts_f4(..., X = x, ...) in the code)
list_f4 <- lapply(
  c("ind_1", "ind_2", ...),
  function(x) ts_f4(ts, W = "AFR_1", X = x, Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
)

# The above gives us a list of data frames, so we need to bind them all into a
# single table for easier interpretation and visualization
df_f4 <- do.call(rbind, list_f4)
```

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

This gives us list of vectors of the names of all individuals in each
population...

```{r}
sample_sets <- ts_names(ts, split = "pop")
# ... which we can then access like this
sample_sets$AFR # all Africans
sample_sets$EUR # all Europeans
```

Let's compute the f4 statistic for all Africans...

```{r}
f4_afr_list <- lapply(
  sample_sets$AFR,
  function(x) ts_f4(ts, W = "AFR_1", X = x, Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
)
# ... and Europeans
f4_eur_list <- lapply(
  sample_sets$EUR,
  function(x) ts_f4(ts, W = "AFR_1", X = x, Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
)
```

Bind each list of data frames into a single data frame:

```{r}
f4_afr <- do.call(rbind, f4_afr_list)
f4_eur <- do.call(rbind, f4_eur_list)

# add population columns to each of the two results for easier plotting
f4_afr$pop <- "AFR"
f4_eur$pop <- "EUR"

# bind both tables together
f4_results <- rbind(f4_afr, f4_eur)
```

Now we can visualize the results:

```{r}
f4_results %>%
  ggplot(aes(pop, f4, color = pop)) +
  geom_boxplot() +
  geom_jitter() +
  geom_hline(yintercept = 0, linetype = 2) +
  ggtitle("f4(AFR, EUR; NEA, CHIMP)") +
  theme_bw()
```

We can see that the $f_4$ statistic test of Neanderthal introgression in
Europeans indeed does give a much more negative distribution of values compared
to the baseline expectation which compares two Africans to each other.
:::


## Exercise 8: Trajectory of Neanderthal ancestry in Europe over time

There used to be a lot of controversy about the question of whether or not did Neanderthal ancestry proportion in Europeans decline or not over the past 40 thousand years (see Figure 1 in [this paper](https://www.pnas.org/doi/full/10.1073/pnas.1814338116) and Figure 2 in [this paper](https://www.nature.com/articles/nature17993)).

Your simulated tree sequence contains a time-series of European individuals over time. Use the _slendr_ function `ts_f4ratio()` to compute (and then plot) the proportion (commonly designated as `alpha`) of Neanderthal ancestry in Europe over time. Use $f_4$-ratio statistic of the following form:

```{r}
#| eval: false
ts_f4ratio(ts, X = <vector of ind. names>, A = "NEA_1", B = "NEA_2", C = "AFR_1", O = "CHIMP_1")
```

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

```{r}
# Extract table with names and times of sampled Europeans (ancient and present day)
eur_inds <- ts_samples(ts) %>% filter(pop == "EUR")
eur_inds

# Compute f4-ration statistic (this will take ~30s) -- note that we can provide
# a vector of names for the X sample set to the `ts_f4ratio()` function
nea_ancestry <- ts_f4ratio(ts, X = eur_inds$name, A = "NEA_1", B = "NEA_2", C = "AFR_1", O = "CHIMP_1")

# Add the age of each sample to the table of proportions
nea_ancestry$time <- eur_inds$time
nea_ancestry

nea_ancestry %>%
  ggplot(aes(time, alpha)) +
  geom_point() +
  geom_smooth(method = "lm", linetype = 2, color = "red", linewidth = 0.5) +
  xlim(40000, 0) +
  coord_cartesian(ylim = c(0, 0.1)) +
  labs(x = "time [years ago]", y = "Neanderthal ancestry proportion") +
  theme_bw() +
  ggtitle("Neanderthal ancestry proportion in Europeans over time")

# For good measure, let's test the significance of the decline using a linear model
summary(lm(alpha ~ time, data = nea_ancestry))
```
:::


## Exercise 9 (<u>hard challenge</u>): How many unique f4 quartets are there?

If you've used $f$-statistics before, you've probably learned about various symmetries in $f_4$ (but also other $f$-statistics) depending on the arrangement of the "quartet". As a trivial example, an $f_3(A; B, C)$ and $f_3(A; C, B)$ will give you exactly the same value, and the same thing applies even to more complex $f$-statistics like $f_4$.

**Use simulations to compute how many unique** $f_4$ values involving a single quartet are there.

**Hint:** Draw some trees to figure out why could that be true. Also, when computing `ts_f4()`, set `mode = "branch"` to avoid the effect of statistical noise due to mutations.

::: {.callout-note collapse="true" icon=false}
### Click to see the solution

```{r}
# # install a combinatorics R package
# install.packages("combinat")

library(combinat)

# These are the four samples we can create quartet combinations from
quartet <- c("AFR_1", "EUR_1", "NEA_1", "CHIMP_1")
quartets <- permn(quartet)
quartets

# How many permutations there are in total?
#   4! = 4 * 3 * 2 * 1 = 24
factorial(4)

# We should therefore have 24 different quartet combinations of samples
length(quartets)

# Loop across all quartets, computing the corresponding f4 statistic (we want
# to do this using branch lengths, not mutations, as the mutation-based computation
# would involve statistical noise)
all_f4s <- lapply(quartets, function(q) ts_f4(ts, q[1], q[2], q[3], q[4], mode = "branch"))

# Bind the list of f4 results into a single data frame and inspect the results
all_f4s <- bind_rows(all_f4s) %>% arrange(abs(f4))
print(all_f4s, n = Inf)

# Narrow down the results to only unique f4 values
distinct(all_f4s, f4, .keep_all = TRUE)
distinct(all_f4s, abs(f4), .keep_all = TRUE)
```
:::