12 Text mining with sparklyr

12.0.1 Data source

For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.

readLines("/usr/share/bonus/arthur_doyle.txt", 30)

12.1 Data Import

Open a Spark session

library(sparklyr)
library(dplyr)

conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9

sc <- spark_connect(master = "local", config = conf,version = "2.0.0")

The spark_read_text() is a new function which works like readLines() but for sparklyr. Use it to read the mark_twain.txt file into Spark.

twain_path <- paste0("file:///usr/share/bonus/mark_twain.txt")
twain <-  spark_read_text(sc, "twain", twain_path)

Read the arthur_doyle.txt file into Spark

doyle_path <-  paste0("file:///usr/share/bonus/arthur_doyle.txt")
doyle <-  spark_read_text(sc, "doyle", doyle_path)

12.2 Prepare the data

Use sdf_bind_rows() to append the two files together

all_words <- doyle %>%
  mutate(author = "doyle") %>%
  sdf_bind_rows({
    twain %>%
      mutate(author = "twain")
  }) %>%
  filter(nchar(line) > 0)

Use Hive’s regexp_replace to remove punctuation

all_words <- all_words %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))

Use ft_tokenizer() to separate each word.

all_words <- all_words %>%
    ft_tokenizer(input.col = "line",
               output.col = "word_list")

head(all_words, 4)

Remove “stop words” with the ft_stop_words_remover() transformer

all_words <- all_words %>%
  ft_stop_words_remover(input.col = "word_list",
                        output.col = "wo_stop_words")

head(all_words, 4)

Un-nest the tokens with explode

all_words <- all_words %>%
  mutate(word = explode(wo_stop_words)) %>%
  select(word, author) %>%
  filter(nchar(word) > 2)
  
head(all_words, 4)

Cache the all_words variable using compute()

all_words <- all_words %>%
  compute("all_words")

12.3 Data Analysis

Words used the most by author

word_count <- all_words %>%
  group_by(author, word) %>%
  tally() %>%
  arrange(desc(n)) 
  
word_count

Figure out which words are used by Doyle but not Twain

doyle_unique <- filter(word_count, author == "doyle") %>%
  anti_join(filter(word_count, author == "twain"), by = "word") %>%
  arrange(desc(n)) %>%
  compute("doyle_unique")

doyle_unique

Use wordcloud to visualize the data in the previous step

doyle_unique %>%
  head(100) %>%
  collect() %>%
  with(wordcloud::wordcloud(
    word, 
    n,
    colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))

Find out how many times Twain used the word “sherlock”

all_words %>%
  filter(author == "twain",
         word == "sherlock") %>%
  tally()

Against the twain variable, use Hive’s instr and lower to make all ever word lower cap, and then look for “sherlock” in the line

twain %>%
  mutate(line = lower(line)) %>%
  filter(instr(line, "sherlock") > 0) %>%
  pull(line)

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.

spark_disconnect(sc)