Suppose I have a dataset with each row containing a sentence stemming from an open-ended question in a very large survey (German and French). Most sentences (answers) are logical; i.e. meaningful combination of words. However, there are a few careless respondents who simply filled in illogical character strings of various kinds.
A useful first step would be to identify everything that is not a word or some other way to identify the illogical strings. Does a package exist that would facilitate this? How to approach this problem?
Example:
df <- structure(list(sentence = c("Das ist ein deutscher Satz.", "Ein kürzerer Satz", "34t34 t444tt", "C'est une sentence francaise", ".-......", "---2r13 1r-2r2")), .Names = c("sentence"), row.names = c(NA,6L), class = "data.frame")
head(df)
sentence
1 Das ist ein deutscher Satz.
2 Ein kürzerer Satz
3 34t34 t444tt
4 C'est une sentence francaise
5 .-......
6 ---2r13 1r-2r2
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…