12 Text mining with sparklyr

12.0.1 Data source

For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.

readLines("/usr/share/bonus/arthur_doyle.txt", 30) 

12.1 Data Import

  1. Open a Spark session
library(sparklyr)
library(dplyr)

conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9

sc <- spark_connect(master = "local", config = conf,version = "2.0.0")
  1. The spark_read_text() is a new function which works like readLines() but for sparklyr. Use it to read the mark_twain.txt file into Spark.
twain_path <- paste0("file:///usr/share/bonus/mark_twain.txt")
twain <-  spark_read_text(sc, "twain", twain_path) 
  1. Read the arthur_doyle.txt file into Spark
doyle_path <-  paste0("file:///usr/share/bonus/arthur_doyle.txt")
doyle <-  spark_read_text(sc, "doyle", doyle_path) 

12.2 Prepare the data

  1. Use sdf_bind_rows() to append the two files together
all_words <- doyle %>%
  mutate(author = "doyle") %>%
  sdf_bind_rows({
    twain %>%
      mutate(author = "twain")
  }) %>%
  filter(nchar(line) > 0)
  1. Use Hive’s regexp_replace to remove punctuation
all_words <- all_words %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) 
  1. Use ft_tokenizer() to separate each word.
all_words <- all_words %>%
    ft_tokenizer(input.col = "line",
               output.col = "word_list")

head(all_words, 4)
  1. Remove “stop words” with the ft_stop_words_remover() transformer
all_words <- all_words %>%
  ft_stop_words_remover(input.col = "word_list",
                        output.col = "wo_stop_words")

head(all_words, 4)
  1. Un-nest the tokens with explode
all_words <- all_words %>%
  mutate(word = explode(wo_stop_words)) %>%
  select(word, author) %>%
  filter(nchar(word) > 2)
  
head(all_words, 4)
  1. Cache the all_words variable using compute()
all_words <- all_words %>%
  compute("all_words")

12.3 Data Analysis

  1. Words used the most by author
word_count <- all_words %>%
  group_by(author, word) %>%
  tally() %>%
  arrange(desc(n)) 
  
word_count
  1. Figure out which words are used by Doyle but not Twain
doyle_unique <- filter(word_count, author == "doyle") %>%
  anti_join(filter(word_count, author == "twain"), by = "word") %>%
  arrange(desc(n)) %>%
  compute("doyle_unique")

doyle_unique
  1. Use wordcloud to visualize the data in the previous step
doyle_unique %>%
  head(100) %>%
  collect() %>%
  with(wordcloud::wordcloud(
    word, 
    n,
    colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))
  1. Find out how many times Twain used the word “sherlock”
all_words %>%
  filter(author == "twain",
         word == "sherlock") %>%
  tally()
  1. Against the twain variable, use Hive’s instr and lower to make all ever word lower cap, and then look for “sherlock” in the line
twain %>%
  mutate(line = lower(line)) %>%
  filter(instr(line, "sherlock") > 0) %>%
  pull(line)

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.

spark_disconnect(sc)