I am working with a tsv file that contains information on genetic variants. Given that this information comes from multiple databases, my file contains duplicate data. In my case the databases are "dbSNP", which contains all kinds of variants (both benign and pathogenic ones, whose names begin with "rs") and "HGMD-PUBLIC" (which in my case only contains pathogenic variants, whose names begin with "CM").
Here is some sample data:
variant_name <- c("rs1320425680", "rs1217211965", "rs1257939387", "rs74315401", "CM890102")
variant_source <- c("dbSNP", "dbSNP", "dbSNP", "dbSNP", "HGMD_PUBLIC")
chromosome <- c(20, 20, 20, 20, 20)
position_start <- c(4699513, 4699516, 4699520, 4699525, 4699525)
position_end <- c(4699513, 4699516, 4699520, 4699525, 4699525)
ancestral_allele <- c("A", "G", "C", "C", "C")
clinical_significance <- c(NA, NA, NA, "pathogenic", NA)
variant_alleles <- c("A/C", "G/A/T", "C/G", "C/T", "HGMD_MUTATION")
example <- tibble(variant_name, variant_source, chromosome, position_start, position_end, ancestral_allele, clinical_significance, variant_alleles)
Here, the 4th and 5th rows are actually duplicates of the same variant from two different databases. It is important to me to know that the variant appears in both.
What I would like to do is to spread the data into "wide" format, with two new columns: dbSNP
and HGMD_PUBLIC
(which are currently found in variant_source
). The columns would contain the entry originally found in the column variant_name
, like this:
dbSNP <- c("rs1320425680", "rs1217211965", "rs1257939387", "rs74315401")
HGMD_PUBLIC <- c(NA, NA, NA, "CM890102")
chromosome <- c(20, 20, 20, 20)
position_start <- c(4699513, 4699516, 4699520, 4699525)
position_end <- c(4699513, 4699516, 4699520, 4699525)
ancestral_allele <- c("A", "G", "C", "C")
clinical_significance <- c(NA, NA, NA, "pathogenic")
variant_alleles <- c("A/C", "G/A/T", "C/G", "C/T")
result <- tibble(dbSNP, HGMD_PUBLIC, chromosome, position_start, position_end, ancestral_allele, clinical_significance, variant_alleles)
As you can see, the 5th row is no longer present, but the variant_name
has been moved to the dbSNP
or HGMD_PUBLIC
columns.
Perhaps the above could be achieved with the spread
function, making the rearrangement conditional on the name, but I don't know how to do that.