---
title: "Data format"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data format}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(anovir)
```

## Introduction {#top}

This vignette describes how survival data should be formatted for use with the 
functions in this package. 

There is a general data format that works for most functions 
([here](#general)), but some functions require data to be in a specific format, 
these are;

[nll_two_inf_subpops_obs](#nll_two_inf_subpops_obs) | 
[nll_recovery](#nll_recovery) | 
[nll_recovery_II](#nll_recovery_II)

## General format required {#general}

The negative log-likelihood (_nll_) functions in this package require the 
survival data to be analysed to be in a data frame. 

The default assumption is each row contains data for an individual host. 

Data can be grouped, where a row contains data on the frequency of individuals 
from a particular treatment or population experiencing the same event, in the 
same sampling interval. 
In this case, the frequency data __must__ be in a column named, '__fq__'. 
This column will be automatically detected and _nll_ calculations adjusted 
accordingly; frequencies of zero ('0') are allowed.

By default, most _nll_ functions assume a data frame will contain three 
columns named as follows,

* __censor__, 
* __time__,
* __infection_treatment__

containing the following information;

* __censor__
    * describes whether event was death or right-censoring
    * needs a numerical value of;
        * '0' for death,
        * '1' for right-censoring.
        
* __time__
    * describes the time when the event occurred
        * needs to be a numerical value > 0.
    
* __infection_treatment__
    * identifies whether data are from an infected or uninfected treatment
    * needs to be a numerical value of;
        * '0' for an uninfected treatment,
        * '1' for an infected treatment.

These columns can be renamed when specifying parameters for the _nll_ function 
to be sent for estimation by maximum likelihood. Columns with the default 
names above to not need to be specified, but the contents of their rows must 
be specified as above, i.e., data from an infected treatment must be specified 
as '1' and not 'infected', '+ve', etc.

All _nll_ functions assume individuals in an uninfected treatment 
are uninfected.

Not all functions assume all individuals in an infected treatment are infected.

[back to top](#top)

## Specific formats

Some _nll_ functions have specific data formatting requirements.

### nll_two_inf_subpops_obs {#nll_two_inf_subpops_obs}

This function applies to cases where two distinct subpopulations of hosts have 
been identified ('observed') within an infected population or treatment. 
In addition to the columns above, this function requires the data frame to be 
analysed to have a column identifying the two infected subpopulations;

* __infsubpop__
    * identifies which subpopulation of data infected hosts belong to;
        * '1' for subpopulation '1'
        * '2' for subpopulation '2'
            * values of '1' or '2' are arbitrary and only used for 
            identifying each subpopulation
        
The column can be renamed when specifying the _nll_ function, 
but it must contain values of '1' or '2' for the two subpopulations. 

[back to top](#top)

### nll_recovery {#nll_recovery}

The data frame required by this function has a specific structure. 
In this case, whether an event was death or right-censoring is not coded 
in the rows of a data frame, but in columns.

The data frame needs six columns with the following column names and these 
columns need to be filled with binary [0/1] data as follows;

* __control.d__
    * '1' for control individuals dying during the experiment, 
    * '0' otherwise
* __control.c__
    * '1' for control individuals censored during or at the end of the experiment
    * '0' otherwise
* __infected.d__
    * '1' for infected individuals dying while still infected during of 
    the experiment
    * '0' otherwise
* __infected.c__
    * '1' for infected individuals censored during or at the end of 
    the experiment
    * '0' otherwise
* __recovered.d__
    * '1' for recovered individuals dying during the end of the experiment
    * '0' otherwise
* __recovered.c__
    * '1' for recovered individuals censored during or at the end of 
    the experiment
    * '0' otherwise

Each of these six columns needs an individual row for __every__ 
sampling interval between the first and last sampling interval, 
i.e., from time _t = 1_ to time _t = tmax_, 
where _tmax_ is the last sampling interval. 

For example, if survival data was sampled each day from 
days 1 to 20 of an experiment, the data frame will need to have; 
6 x _tmax_ = 6 x 20 = 120 rows. 

NB it is assumed sampling intervals are equally spaced throughout the experiment.

There also needs to the following columns with the following names and contents, 

* __censor__
    * '1' for censored data
    * '0' otherwise
* __t__
    * data for the time of event; needs to be numeric with _t_ > 0
* __fq__
    * data for the frequency of events occuring at time _t_; 
    values of zero (0) are allowed
    
For example, the first few lines of the data frame 
_data_recovery_ are given below;

```{r}
head(recovery_data, 3)
```

they are for the population _control.d_, 
that is control individuals dying during the experiment (_control.d = 1_), 
and show these individuals were not censored (_censor = 0_), 
and for times _1, 2, 3_, the frequency of individuals dying was
_1, 4, 11_, respectively. 

The last few lines of the same data frame are,

```{r}
tail(recovery_data, 3)
```

for the population of hosts that recovered and were right-censored, 
_recovered.c = 1, censor = 1_, and for times _18, 19, 20_, 
the frequency of individauls censored in this population was _0, 0, 41_, 
respectively. 
NB all rows between _t = 1_ and _t = tmax_ need to be included and in 
ascending order, even if the frequency of individuals involved is zero. 

[back to top](#top)

### nll_recovery_II {#nll_recovery_II}

The data for this function needs to be in the same format as for 
_nll_recovery_ and needs to include the two columns, 
_control.d, control.c_, along with the frequency of individuals dying at 
each interval (= 0), and the number censored during or at the end of the 
experiment, even though they do not contribute towards the 
calculation of the negative log-likelihood.

[back to top](#top)