Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
146 views
in Technique[技术] by (71.8m points)

r - How to identify illogical character strings/sentences

Suppose I have a dataset with each row containing a sentence stemming from an open-ended question in a very large survey (German and French). Most sentences (answers) are logical; i.e. meaningful combination of words. However, there are a few careless respondents who simply filled in illogical character strings of various kinds.

A useful first step would be to identify everything that is not a word or some other way to identify the illogical strings. Does a package exist that would facilitate this? How to approach this problem?

Example:

df <- structure(list(sentence = c("Das ist ein deutscher Satz.", "Ein kürzerer Satz", "34t34 t444tt", "C'est une sentence francaise", ".-......", "---2r13 1r-2r2")), .Names = c("sentence"), row.names = c(NA,6L), class = "data.frame")

head(df)
                      sentence
1  Das ist ein deutscher Satz.
2            Ein kürzerer Satz
3                  34t34 t444tt
4 C'est une sentence francaise
5                     .-......
6                ---2r13 1r-2r2


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use the share of a-z and A-Z to all characters and the number of words to detect sentences.

df$nonCharShare <- nchar(gsub("[[:alpha:] ]", "", df$sentence)) / nchar(df$sentence)
df$words <- lengths(strsplit(df$sentence, " ", TRUE))
df
#                      sentence nonCharShare words
#1  Das ist ein deutscher Satz.   0.03703704     5
#2            Ein kürzerer Satz   0.00000000     3
#3                  34t34t444tt   0.63636364     1
#4 C'est une sentence francaise   0.03571429     4
#5                     .-......   1.00000000     1
#6                ---2r131r-2r2   0.76923077     1

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...