understand_gen_goup.Rmd
Animals with unknown parents are to be assigned into phantom groups. Phantom groups can be defined based on
Possible further grouping criteria can be country or region of origin.
The goal of this analysis is to understand the steps required to define genetic groups.
The input to the definition function of genetic groups is a csv-file named statGenGroupOutFile
. That file comes out of the f90-program refmtped.f90
which is used to re-format the pedigree. The first few lines of this file is shown below.
stat_gen_grp_input <- system.file("extdata", "statGenGroupOutFile", package = "qgengroup")
tbl_gen_grp <- readr::read_csv2(file = stat_gen_grp_input, skip = 2)
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#> Land = col_character(),
#> Rasse = col_character(),
#> GebJahr = col_character(),
#> SP_SB = col_double(),
#> SP_SC = col_double(),
#> SP_DB = col_double(),
#> SP_DC = col_double(),
#> SP_SX = col_logical(),
#> SP_DX = col_logical()
#> )
head(tbl_gen_grp)
#> # A tibble: 6 x 9
#> Land Rasse GebJahr SP_SB SP_SC SP_DB SP_DC SP_SX SP_DX
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 USA HO 1950 NA 1 NA 1 NA NA
#> 2 USA HO 1952 1 NA 1 NA NA NA
#> 3 USA HO 1954 1 NA 1 NA NA NA
#> 4 USA HO 1955 2 1 1 1 NA NA
#> 5 USA HO 1956 NA 1 NA 1 NA NA
#> 6 USA HO 1957 NA 1 NA 1 NA NA
Starting from the left-most column, the first three columns contain country
, breed
and year of birth
that specify groups of animals with unknown parents. The titles of all other columns start with SP
which stands for selection path. These columns contain counts of animals with missing parents for the respective country
-breed
-birthyear
-class given by the first three rows and the selection path specified by the column header. The selection path abbreviations are given in the table below
Abbreviation | Meaning |
---|---|
SP_SB | sire of bull |
SP_SC | sire of cow |
SP_DB | dam of bull |
SP_DC | dam of cow |
Hence the first row in the above input tibble means that there is a cow with a missing sire and a cow with a missing dam from the US
with breed HOL
and birthyear 1950.
The function create_GG
is used to create groups that are based on year of birth, breed and country. Inside that function the input is read into a dataframe. Then the function define_GG
is used to define the genetic groups.
In a first step different breed labels from the input file can be replaced with a common label. The mapping between the labels is specified in an input file called psBreedFile
. A similar mapping of labels can also be done with the countries. The groupings of the breeds and the countries is only done if input files for breed- and country mappings are specified.
Unused factor levels are removed after the re-mapping of breeds and countries. The removal works by re-converting all columns that are of type factor
in the original dataframe into a factor again using the function factor()
. This causes un-used factor levels to be dropped.
A new dataframe is defined with only sire selection paths. On that reduced dataframe, the first grouping criterion used are breeds. Breeds with too low group size are separated out and put in a dummy list called dummy_gg_def_list
. For all breeds with enough missing sires, groups are further subdivided according to country.
The function that creates genetic groups without considering the country of origin of the animals with missing parents is called define_GG_withoutCountry
. In this function the groupings is done over breeds and year of birth. Groups are formed in the usual way applying some specified lower limits of group sizes.