Understanding Creation of Genetic Groups

Background

Animals with unknown parents are to be assigned into phantom groups. Phantom groups can be defined based on

selection paths (male or female),
year of birth of the offspring and
breed

Possible further grouping criteria can be country or region of origin.

Step By Step Analysis

The goal of this analysis is to understand the steps required to define genetic groups.

Input

The input to the definition function of genetic groups is a csv-file named statGenGroupOutFile. That file comes out of the f90-program refmtped.f90 which is used to re-format the pedigree. The first few lines of this file is shown below.

stat_gen_grp_input <- system.file("extdata", "statGenGroupOutFile", package = "qgengroup")
tbl_gen_grp <- readr::read_csv2(file = stat_gen_grp_input, skip = 2)
#> Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
#> Parsed with column specification:
#> cols(
#>   Land = col_character(),
#>   Rasse = col_character(),
#>   GebJahr = col_character(),
#>   SP_SB = col_double(),
#>   SP_SC = col_double(),
#>   SP_DB = col_double(),
#>   SP_DC = col_double(),
#>   SP_SX = col_logical(),
#>   SP_DX = col_logical()
#> )
head(tbl_gen_grp)
#> # A tibble: 6 x 9
#>   Land  Rasse GebJahr SP_SB SP_SC SP_DB SP_DC SP_SX SP_DX
#>   <chr> <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 USA   HO    1950       NA     1    NA     1 NA    NA   
#> 2 USA   HO    1952        1    NA     1    NA NA    NA   
#> 3 USA   HO    1954        1    NA     1    NA NA    NA   
#> 4 USA   HO    1955        2     1     1     1 NA    NA   
#> 5 USA   HO    1956       NA     1    NA     1 NA    NA   
#> 6 USA   HO    1957       NA     1    NA     1 NA    NA

Starting from the left-most column, the first three columns contain country, breed and year of birth that specify groups of animals with unknown parents. The titles of all other columns start with SP which stands for selection path. These columns contain counts of animals with missing parents for the respective country-breed-birthyear-class given by the first three rows and the selection path specified by the column header. The selection path abbreviations are given in the table below

Abbreviation	Meaning
SP_SB	sire of bull
SP_SC	sire of cow
SP_DB	dam of bull
SP_DC	dam of cow

Hence the first row in the above input tibble means that there is a cow with a missing sire and a cow with a missing dam from the US with breed HOL and birthyear 1950.

Creating Groups With Country

The function create_GG is used to create groups that are based on year of birth, breed and country. Inside that function the input is read into a dataframe. Then the function define_GG is used to define the genetic groups.

Definition of Genetic Groups.

In a first step different breed labels from the input file can be replaced with a common label. The mapping between the labels is specified in an input file called psBreedFile. A similar mapping of labels can also be done with the countries. The groupings of the breeds and the countries is only done if input files for breed- and country mappings are specified.

Unused factor levels are removed after the re-mapping of breeds and countries. The removal works by re-converting all columns that are of type factor in the original dataframe into a factor again using the function factor(). This causes un-used factor levels to be dropped.

Groups on Missing Sires

A new dataframe is defined with only sire selection paths. On that reduced dataframe, the first grouping criterion used are breeds. Breeds with too low group size are separated out and put in a dummy list called dummy_gg_def_list. For all breeds with enough missing sires, groups are further subdivided according to country.

Definition of Genetic Groups Without Country

The function that creates genetic groups without considering the country of origin of the animals with missing parents is called define_GG_withoutCountry. In this function the groupings is done over breeds and year of birth. Groups are formed in the usual way applying some specified lower limits of group sizes.