---
title: "Data format"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Data format}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(anovir)
```
## Introduction {#top}
This vignette describes how survival data should be formatted for use with the
functions in this package.
There is a general data format that works for most functions
([here](#general)), but some functions require data to be in a specific format,
these are;
[nll_two_inf_subpops_obs](#nll_two_inf_subpops_obs) |
[nll_recovery](#nll_recovery) |
[nll_recovery_II](#nll_recovery_II)
## General format required {#general}
The negative log-likelihood (_nll_) functions in this package require the
survival data to be analysed to be in a data frame.
The default assumption is each row contains data for an individual host.
Data can be grouped, where a row contains data on the frequency of individuals
from a particular treatment or population experiencing the same event, in the
same sampling interval.
In this case, the frequency data __must__ be in a column named, '__fq__'.
This column will be automatically detected and _nll_ calculations adjusted
accordingly; frequencies of zero ('0') are allowed.
By default, most _nll_ functions assume a data frame will contain three
columns named as follows,
* __censor__,
* __time__,
* __infection_treatment__
containing the following information;
* __censor__
* describes whether event was death or right-censoring
* needs a numerical value of;
* '0' for death,
* '1' for right-censoring.
* __time__
* describes the time when the event occurred
* needs to be a numerical value > 0.
* __infection_treatment__
* identifies whether data are from an infected or uninfected treatment
* needs to be a numerical value of;
* '0' for an uninfected treatment,
* '1' for an infected treatment.
These columns can be renamed when specifying parameters for the _nll_ function
to be sent for estimation by maximum likelihood. Columns with the default
names above to not need to be specified, but the contents of their rows must
be specified as above, i.e., data from an infected treatment must be specified
as '1' and not 'infected', '+ve', etc.
All _nll_ functions assume individuals in an uninfected treatment
are uninfected.
Not all functions assume all individuals in an infected treatment are infected.
[back to top](#top)
## Specific formats
Some _nll_ functions have specific data formatting requirements.
### nll_two_inf_subpops_obs {#nll_two_inf_subpops_obs}
This function applies to cases where two distinct subpopulations of hosts have
been identified ('observed') within an infected population or treatment.
In addition to the columns above, this function requires the data frame to be
analysed to have a column identifying the two infected subpopulations;
* __infsubpop__
* identifies which subpopulation of data infected hosts belong to;
* '1' for subpopulation '1'
* '2' for subpopulation '2'
* values of '1' or '2' are arbitrary and only used for
identifying each subpopulation
The column can be renamed when specifying the _nll_ function,
but it must contain values of '1' or '2' for the two subpopulations.
[back to top](#top)
### nll_recovery {#nll_recovery}
The data frame required by this function has a specific structure.
In this case, whether an event was death or right-censoring is not coded
in the rows of a data frame, but in columns.
The data frame needs six columns with the following column names and these
columns need to be filled with binary [0/1] data as follows;
* __control.d__
* '1' for control individuals dying during the experiment,
* '0' otherwise
* __control.c__
* '1' for control individuals censored during or at the end of the experiment
* '0' otherwise
* __infected.d__
* '1' for infected individuals dying while still infected during of
the experiment
* '0' otherwise
* __infected.c__
* '1' for infected individuals censored during or at the end of
the experiment
* '0' otherwise
* __recovered.d__
* '1' for recovered individuals dying during the end of the experiment
* '0' otherwise
* __recovered.c__
* '1' for recovered individuals censored during or at the end of
the experiment
* '0' otherwise
Each of these six columns needs an individual row for __every__
sampling interval between the first and last sampling interval,
i.e., from time _t = 1_ to time _t = tmax_,
where _tmax_ is the last sampling interval.
For example, if survival data was sampled each day from
days 1 to 20 of an experiment, the data frame will need to have;
6 x _tmax_ = 6 x 20 = 120 rows.
NB it is assumed sampling intervals are equally spaced throughout the experiment.
There also needs to the following columns with the following names and contents,
* __censor__
* '1' for censored data
* '0' otherwise
* __t__
* data for the time of event; needs to be numeric with _t_ > 0
* __fq__
* data for the frequency of events occuring at time _t_;
values of zero (0) are allowed
For example, the first few lines of the data frame
_data_recovery_ are given below;
```{r}
head(recovery_data, 3)
```
they are for the population _control.d_,
that is control individuals dying during the experiment (_control.d = 1_),
and show these individuals were not censored (_censor = 0_),
and for times _1, 2, 3_, the frequency of individuals dying was
_1, 4, 11_, respectively.
The last few lines of the same data frame are,
```{r}
tail(recovery_data, 3)
```
for the population of hosts that recovered and were right-censored,
_recovered.c = 1, censor = 1_, and for times _18, 19, 20_,
the frequency of individauls censored in this population was _0, 0, 41_,
respectively.
NB all rows between _t = 1_ and _t = tmax_ need to be included and in
ascending order, even if the frequency of individuals involved is zero.
[back to top](#top)
### nll_recovery_II {#nll_recovery_II}
The data for this function needs to be in the same format as for
_nll_recovery_ and needs to include the two columns,
_control.d, control.c_, along with the frequency of individuals dying at
each interval (= 0), and the number censored during or at the end of the
experiment, even though they do not contribute towards the
calculation of the negative log-likelihood.
[back to top](#top)