R Meetup Day 3


Third session of the R meetup for Digital Humanists main page previous
American University of Beirut

Today’s R meetup will begin with some small presentations from the group about what they have been discovered about basic digital textual analysis since the last meeting.

Here is @rfayed1’s pre-Halloween wordcloud she generated using tm and the wordcloud packages (who would have thought “good” and “friend” would be MFWs in Dracula?)

 

Here are some more snippets we used using the rvest package for scraping and the tm package for text mining:

 

# Use the rvest package to scrape pages from x to y, writing and appending what it takes to a txt file, clean out the blank spaces
# Let's say you want Flavius Philostratus's Life of Apollonius in English translation, you can find it here:
# http://www.livius.org/ap-ark/apollonius/life/va_00.html
# with book one 1, part 1 http://www.livius.org/ap-ark/apollonius/life/va_1_01.html#%A71
# and part 2 http://www.livius.org/ap-ark/apollonius/life/va_1_01.html#%A72
# and part 3 http://www.livius.org/ap-ark/apollonius/life/va_1_01.html#%A73    What is the pattern?
install.packages("rvest")
library("rvest")
for(i in x:y){ # x and y are the boundaries of the pages, for book 1 of Flavius it is 1:40
url <- paste("http://www.livius.org/ap-ark/apollonius/life/va_1_01.html#%A7", sep="", i)
flaviuspage <- read_html(url)
flaviushtml <- html_nodes(flaviuspage, "body")
text <- html_text(flaviushtml)
write(text, file ='flavius_Apollonius', append=TRUE)
}
#Install a text editor and open the file.  Use the regular expression "^(?:[\t ]*(?:\r?\n|\r))+" (without the quotes) and replace with nothing.
#Write a small script that will pull down several documents and make you a small corpus
#use the tm package, the code below is based on this
#first build a small corpus of 5 texts you take from gutenberg, or use 5 of your own texts
# for PC
cname <- file.path("C:", "AN_DH2016/corpus")
cname
dir(cname)
# for Mac
cname <- file.path("~", "Desktop", "your folder name")
cname
dir(cname)
#point RStudio to your corpus
library(tm)
docs <- Corpus(DirSource(cname))
#see the list of files
summary(docs)
#preprocessing
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, toLower)
docs <- tm_map(docs, removeWords, stopwords("english"))docs <- tm_map(docs, removeWords, c("word1", "word2")
docs <- tm_map(docs, stripWhitespace)
#at any point you can inspect using
inspect(docs[3])
# remind yourself how many documents you have
length(seq(docs))
#replace or remove some characters from documents
for(i in seq(docs))
{
docs[[i]] <- gsub("/", " ", docs[[i]])
}
#if you want to remove endings like -ing -es -s, it's called stemming
library(SnowballC)
docs <- tm_map(docs, stemDocument)
#finish the pre-processing
docs <-tm_map(docs, PlainTextDocument)
#create documenttermmatrix, termdocumentmatrix
dtm <- DocumentTermMatrix(docs)
dtm
inspect(dtm[1:3, 1:20])
dim(dtm)
tdm <- TermDocumentMatrix(docs)
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
m <- as.matrix(dtm)
dim(m)
write.csv(m, file = "ANdtm.csv")
freq2 <- colSums(as.matrix(tdm))
length(freq2)
ord2 <- order(freq2)
m2 <- as.matrix(tdm)
dim(m2)
write.csv(m2, file = "ANtdm2.csv")
# wordcloud
install.packages("wordcloud")
require(wordcloud)
wordcloud(docs)
wordcloud(docs, random.order = FALSE, scale = c(6, 0.5)