Working with R, I'm looking for ways to weight case (i.e., upper vs lower case) in a string_dist_left_join()
Here's a reproducible example:
library(tidyverse)
library(fuzzyjoin)
tibble1 <- tibble(words = c("Bedford", "Maidenhead", "New Forest", "Tier 3", "Citizenship", "Crown"))
tibble2 <- tibble(words = c("bedfords", "bedsford", "BEDFord", "Maidenshead", "Maidenhed", "News forest", "Tier 3", "Citisenships", "crowned", "crows"))
osa <- stringdist_left_join(tibble1, tibble2, distance_col = "distance", max_dist = 5, method = "osa", weight = c(d = 0.1, i = 0.1, s = 1, t = 1))
Above is the code to reproduce a fuzzyjoin powered stringsidt_left_join on a couple of tibbles. The output looks like this:
# A tibble: 55 x 3
words.x words.y distance
<chr> <chr> <dbl>
1 Bedford bedfords 0.3
2 Bedford bedsford 0.3
3 Bedford BEDFord 0.6
4 Bedford Maidenshead 1.4
5 Bedford Maidenhed 1.2
6 Bedford News forest 1.00
7 Bedford Tier 3 0.900
8 Bedford Citisenships 1.7
9 Bedford crowned 1.00
10 Bedford crows 1.00
# … with 45 more rows
What I'd like is for some way to weight the capitalisation e.g., comparing Bedford to BEDford: I'd like that to be a worse match than Bedford to Bedford, but better than Bedford to Bedsford. The option ignore_case = TRUE
treats BEDford as a perfect match with Bedford.
I'm liking the fuzzyjoin package, and I just discovered the custom weightings that you can pass to stringdist for each of deletion, insertion, substitution, and translocation. Which is fantastic; toys to play with, parameters to tune.
What I'd also like to be able to do is tune the case (capitalisation?) matching. I've got the option to ignore_case = TRUE
in stringdist_left_join, (in effect, weight case as 0 or 1), but being the annoying cur that I am, I'd like to play around with weightings between 0 and 1.
Does anyone know if there's an option somewhere that I'm missing?
Or is the answer: Do it the hard way? I guess there might be a long way round involving comparing the distances before and after having run tolower()
or computing a weighted distance comparing ignore_case = TRUE
with ignore_case = FALSE
, but does anyone know of a more elegant method or package that I can use to do that?
Thanks