12 Text mining with sparklyr
12.0.1 Data source
For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr
package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.
readLines("/usr/share/bonus/arthur_doyle.txt", 30)
12.1 Data Import
- Open a Spark session
library(sparklyr)
library(dplyr)
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local", config = conf,version = "2.0.0")
- The
spark_read_text()
is a new function which works likereadLines()
but forsparklyr
. Use it to read the mark_twain.txt file into Spark.
twain_path <- paste0("file:///usr/share/bonus/mark_twain.txt")
twain <- spark_read_text(sc, "twain", twain_path)
- Read the arthur_doyle.txt file into Spark
doyle_path <- paste0("file:///usr/share/bonus/arthur_doyle.txt")
doyle <- spark_read_text(sc, "doyle", doyle_path)
12.2 Prepare the data
- Use
sdf_bind_rows()
to append the two files together
all_words <- doyle %>%
mutate(author = "doyle") %>%
sdf_bind_rows({
twain %>%
mutate(author = "twain")
}) %>%
filter(nchar(line) > 0)
- Use Hive’s regexp_replace to remove punctuation
all_words <- all_words %>%
mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))
- Use
ft_tokenizer()
to separate each word.
all_words <- all_words %>%
ft_tokenizer(input.col = "line",
output.col = "word_list")
head(all_words, 4)
- Remove “stop words” with the
ft_stop_words_remover()
transformer
all_words <- all_words %>%
ft_stop_words_remover(input.col = "word_list",
output.col = "wo_stop_words")
head(all_words, 4)
- Un-nest the tokens with explode
all_words <- all_words %>%
mutate(word = explode(wo_stop_words)) %>%
select(word, author) %>%
filter(nchar(word) > 2)
head(all_words, 4)
- Cache the all_words variable using
compute()
all_words <- all_words %>%
compute("all_words")
12.3 Data Analysis
- Words used the most by author
word_count <- all_words %>%
group_by(author, word) %>%
tally() %>%
arrange(desc(n))
word_count
- Figure out which words are used by Doyle but not Twain
doyle_unique <- filter(word_count, author == "doyle") %>%
anti_join(filter(word_count, author == "twain"), by = "word") %>%
arrange(desc(n)) %>%
compute("doyle_unique")
doyle_unique
- Use
wordcloud
to visualize the data in the previous step
doyle_unique %>%
head(100) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))
- Find out how many times Twain used the word “sherlock”
all_words %>%
filter(author == "twain",
word == "sherlock") %>%
tally()
- Against the
twain
variable, use Hive’s instr and lower to make all ever word lower cap, and then look for “sherlock” in the line
twain %>%
mutate(line = lower(line)) %>%
filter(instr(line, "sherlock") > 0) %>%
pull(line)
Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.
spark_disconnect(sc)