1. Recently, I had read an article on R-bloggers, titled Classifying Breast Cancer as Benign or Malignent using RTextTools by Timothy P. Jurka, who is the author of both that article and the RTextTools package. Having reproduced the results using the author's R code successfully, I was motivated to explore the usefulness of this package.

    Also, there is an excellent book by Conway & White (2012), Machine Learning for Hackers, that shows the reader how to build a Bayesian Spam Classifier (let's called it the Benchmark). I was interested to find out how a spam classifier model built using RTextTools would compare with the Benchmark. However, as this is largely an unexplored area as there are NOT many example models built using RTextTools package, I decided to explore the feasibility of building a model used to classify large text, i.e. raw text without ANY features.

    (1) Obtaining the data and loading it into R

    suppressPackageStartupMessages(require(RTextTools))
    suppressPackageStartupMessages(require(tm))
    source("C:/Users/denbrige/100 FxOption/103 FxOptionVerBack/080 Fx Git/R-source/PlusReg.R", echo=FALSE)
    spam.dir <- paste0(RegGetRNonSourceDir(), "spamassassin/")
    get.msg <- function(path.dir)
    {
      con <- file(path.dir, open="rt", encoding="latin1")
      text <- readLines(con)
      msg <- text[seq(which(text=="")[1]+1,length(text),1)]
      close(con)
      return(paste(msg, collapse="\n"))
    }
    get.msg.try <- function(path.dir)
    {
      con <- file(path.dir, open="rt", encoding="latin1")
      text <- readLines(con)
      options(warn=-1)
      msg <- tryCatch( text[seq(which(text=="")[1]+1,length(text),1)],
                          error=function(e) { 9999 }, finally={} )
      close(con)
      if( substr(msg, 1, 5)=="Error" )
      {
        return("Error")
      }
      else
      {
        return(paste(msg, collapse="\n"))
      }
    }
    get.all <- function(path.dir)
    {
      all.file <- dir(path.dir)
      all.file <- all.file[which(all.file!="cmds")]
      msg.all <- sapply(all.file, function(p) get.msg(paste0(path.dir,p)))
    }
    get.all.try <- function(path.dir)
    {
      all.file <- dir(path.dir)
      all.file <- all.file[which(all.file!="cmds")]
      msg.all <- sapply(all.file, function(p) get.msg.try(paste0(path.dir,p)))
    }
    easy_ham.all    <- get.all(paste0(spam.dir, "easy_ham/"))
    easy_ham_2.all  <- get.all(paste0(spam.dir, "easy_ham_2/"))
    hard_ham.all    <- get.all(paste0(spam.dir, "hard_ham/"))
    hard_ham_2.all  <- get.all(paste0(spam.dir, "hard_ham_2/"))
    spam.all        <- get.all.try(paste0(spam.dir, "spam/"))
    spam_2.all      <- get.all(paste0(spam.dir, "spam_2/"))


    First, we download the email data from the SpamAssassin public corpus. EACH classification has TWO (2) sub-folders, e.g. "easy_ham" and "easy_ham_2". This makes it easier as the first set is used for training data, and the second set (with "_2") is used for testing data. Remember to change the above code to work with your own folders, i.e. "spam.dir".

    Chapter 3 of Conway & White (2012) explains what the structure of an email looks like, why we are focusing on only the email message body and how to extract this text from the message files using the function get.msg() above. We use the get.all() function to apply get.msg() function to ALL of the filenames, except "cmds", for EACH folder and construct a vector of messages from the returned text.

    (2) Split the data into train/test sets

    easy_ham.dfr    <- as.data.frame(easy_ham.all)
    easy_ham_2.dfr  <- as.data.frame(easy_ham_2.all)
    hard_ham.dfr    <- as.data.frame(hard_ham.all)
    hard_ham_2.dfr  <- as.data.frame(hard_ham_2.all)
    spam.dfr        <- as.data.frame(spam.all)
    spam_2.dfr      <- as.data.frame(spam_2.all)
    rownames(easy_ham.dfr)    <- NULL
    rownames(easy_ham_2.dfr)  <- NULL
    rownames(hard_ham.dfr)    <- NULL
    rownames(hard_ham_2.dfr)  <- NULL
    rownames(spam.dfr)        <- NULL
    rownames(spam_2.dfr)      <- NULL
    easy_ham.dfr$outcome    <- 2
    easy_ham_2.dfr$outcome  <- 2
    hard_ham.dfr$outcome    <- 2
    hard_ham_2.dfr$outcome  <- 2
    spam.dfr$outcome        <- 4
    spam_2.dfr$outcome      <- 4
    names(easy_ham.dfr)   <- c("text", "outcome")
    names(easy_ham_2.dfr) <- c("text", "outcome")
    names(hard_ham.dfr)   <- c("text", "outcome")
    names(hard_ham_2.dfr) <- c("text", "outcome")
    names(spam.dfr)       <- c("text", "outcome")
    names(spam_2.dfr)     <- c("text", "outcome")
    train.data  <- rbind(easy_ham.dfr, hard_ham.dfr, spam.dfr)
    train.num   <- nrow(train.data)
    train.data  <- rbind(train.data, easy_ham_2.dfr, hard_ham_2.dfr, spam_2.dfr)
    names(train.data) <- c("text", "outcome")
    spam.str <- paste0(RegGetRNonSourceDir(),"Jurka_03_spam.rda")
    if( !file.exists(spam.str) )
    {
      save(train.data, train.num, file=spam.str)
    }


    Note: There is PROBABLY a more elegant way of writing ALL the above code, but it was late at night and I had a brain freeze, hence NON-elegant code was produced as a result...

    Basically, for EACH data frame, we add a new column "outcome". This contains a numeric integer that classifies both easy_ham and hard_ham to TWO (2), and spam to FOUR (4). We merge these data frames into "train.data" and we rename the columns to "text" and "outcome".

    We could increase execution speed by saving the train_data as an R native file (.rda) and loading that file, instead of loading the raw data EACH time. The saved variable "train.num" contains the number of rows for training data. Again, remember to change the above code to work with your own folders, i.e. "spam.str". 

    (3) Build the model

    set.seed(2012)
    train_out.data <- train.data$outcome
    train_txt.data <- train.data$text

    matrix <- create_matrix(train_txt.data, language="english", minWordLength=3, removeNumbers=TRUE, stemWords=FALSE, removePunctuation=TRUE, weighting=weightTfIdf)
    container <- create_container(matrix,t(train_out.data), trainSize=1:train.num, testSize=(train.num+1):nrow(train.data), virgin=FALSE)
    maxent.model    <- train_model(container, "MAXENT")
    svm.model       <- train_model(container, "SVM")


    The steps for training a model is as follows:
    1. Create a document-term matrix; 
    2. Create a container; 
    3. Create a model by feeding a container to the machine learning algorithm.
    I had used the default parameters in the function create_matrix(), except for the following:
    • removeNumbers=TRUE - numbers are removed; 
    • weighting=weightTfIdf - this parameter is from the package tm.
    • stemWords=TRUE - there is a LIMIT of 255 characters on the number of characters in a word being stemmed. (As a result of an error, I had to set this parameter back to default: FALSE)

    To create a container, you need to pass it BOTH a document-term matrix AND an outcome vector, i.e. train_out.data, which is the reason why I had to split the "train.data". I had used the default parameters in the function create_container(), except for the following:
    • trainSize=1:train.num - a range specifying the row numbers in the data to use for training the model;
    • testSize=(train.num+1):nrow(train.data) - a range specifying the row numbers in the data to use for cross-validation (out-of-sample testing);
    • virgin=FALSE - to specify whether the testing set is unclassified data with NO true value.

    To create a model, you need to pass it a container. I had used ALL NINE (9) algorithms initially, but due to the memory limitations of R (32-bit) project, I was forced to create models ONLY for TWO (2) algorithms:
    • SVM - Support Vector Machines; and
    • MAXENT - Maximum Entrophy.

    (3) Comparing the model to the Benchmark

    svm.result    <- classify_model(container, svm.model)
    svm.analytic  <- create_analytics(container, svm.result)
    svm.doc       <- svm.analytic@document_summary
    svm_spam.doc  <- svm.doc[svm.doc$MANUAL_CODE==4, ]
    svm_ham.doc   <- svm.doc[svm.doc$MANUAL_CODE==2, ]
    svm.true.pos  <- nrow(svm_spam.doc[svm_spam.doc$CONSENSUS_CODE==4,]) / nrow(svm_spam.doc)
    svm.false.neg <- nrow(svm_spam.doc[svm_spam.doc$CONSENSUS_CODE==2,]) / nrow(svm_spam.doc)
    svm.true.neg  <- nrow(svm_ham.doc[svm_ham.doc$CONSENSUS_CODE==2,]) / nrow(svm_ham.doc)
    svm.false.pos <- nrow(svm_ham.doc[svm_ham.doc$CONSENSUS_CODE==4,]) / nrow(svm_ham.doc)
    maxent.result   <- classify_model(container, maxent.model)
    maxent.analytic <- create_analytics(container, maxent.result)
    maxent.doc      <- maxent.analytic@document_summary
    maxent_spam.doc <- maxent.doc[maxent.doc$MANUAL_CODE==4, ]
    maxent_ham.doc  <- maxent.doc[maxent.doc$MANUAL_CODE==2, ]
    maxent.true.pos <- nrow(maxent_spam.doc[maxent_spam.doc$CONSENSUS_CODE==4,]) / nrow(maxent_spam.doc)
    maxent.false.neg<- nrow(maxent_spam.doc[maxent_spam.doc$CONSENSUS_CODE==2,]) / nrow(maxent_spam.doc)
    maxent.true.neg <- nrow(maxent_ham.doc[maxent_ham.doc$CONSENSUS_CODE==2,]) / nrow(maxent_ham.doc)
    maxent.false.pos<- nrow(maxent_ham.doc[maxent_ham.doc$CONSENSUS_CODE==4,]) / nrow(maxent_ham.doc)


    We compare the results of our model with the Benchmark, which is evaluated based on the FALSE-positive (Type I error) and FALSE-negative (Type II error) rates.

    For the benchmark, we had about 25% false-positive rate, with the classifier doing slightly better on easy_ham (22%) than the hard stuff (27%). On the other hand, the false-negative rate is much lower at only 15%.

    For the SVM algorithm, we beat the benchmark on TWO (2) counts:
    1. 3.2% false-positive rate (significantly lower than benchmark);
    2. 13.2% false-negative rate (slightly lower than benchmark).
    For the MAXENT algorithm, we again beat the benchmark on BOTH counts:
    1. 0.4% false-positive rate (the lowest of ALL three models);
    2. 14.7% false-negative rate (only marginally lower than benchmark).

     The Benchmark

    Email Type TRUE FALSE
    spamT-Pos: 85%F-Neg: 15%
    easy_hamT-Neg: 78%F-Pos: 22%
    hard_hamT-Neg: 73%F-Pos: 27%

    Our results using SVM algorithm
    Email Type TRUE FALSE
    spamT-Pos: 86.8%F-Neg: 13.2%
    hamT-Neg: 96.8%F-Pos: 3.2%

    Our results using MAXENT algorithm
    Email Type TRUE FALSE
    spamT-Pos: 85.3%F-Neg: 14.7%
    hamT-Neg: 99.6%F-Pos: 0.4%

    Conclusion


    I have shown you how to build a spam classifier model using RTextTools, and explored the feasibility of building a model used to classify large text. Alternatively, you could build a spam classifier model using a database that has features. You are NOT limited to only these TWO (2) algorithms, however the other algorithms require an R (64-bit) project, or maybe some big data packages. Although the model is quite simple (only ONE predictor), however the results did beat the Benchmark. Hence, it could be a useful tool to build classifier models for automating your work or tasks.

    The entire code can be viewed here.

    0

    Add a comment

  2. Many of the forecasting packages in R requires a time series that is covariance stationary. For those who are not familiar with this term, there is an excellent online textbook by Hyndman and Athanasopoulos, Forecasting: Principles and Practice. Click here to go directly to that chapter.

    The first step, therefore, in making a predictive or analytic procedure is to analyze a time series to see if it is already covariance stationary. More often than not, a transformation is required to convert the non-stationary time series to a stationary time series.

    Examples of time series, include stock prices, raw material prices, and ALL data that is ordered by a given interval in date and time. But to keep this article easy to follow, I will use a time series of Powerball, which is a type of lottery game played once a week in Australia.

    Fortunately, the historical data is easy to obtain from the Powerball website as a CSV file. For this example, I will be using only ONE (1) column to keep it easy to follow. However, you can extend this example to the rest of the columns.


    (1) Load the data into a data frame

    library(forecast)
    library(tseries)
    library(zoo)

    # (1-1) Load the data into a data.frame
    hist.df <- read.csv( "powerball.csv", colClasses = "character" )

    # (1-2) Coerce the column into a numeric vector and create a zoo object
    n1.zoo <- zoo( matrix( as.numeric(hist.df$Number_1), ncol=1 ),
                   as.Date(hist.df$Draw_date, "%Y/%m/%d") )
    names(n1.zoo) <- c("n1")

    # (1-3) Note: For the zoo object, the latest result is at the LAST row
    tail(n1.zoo)


    First, we load the data into a data.frame, then we coerce ONE (1) column into a numeric vector and create a zoo object. For the zoo object, the LAST row corresponds to latest result, the penultimate row corresponds to previous week's results, and so on.

    (2) Test the column for normality


    # (2-1) Sharpiro test (p>0.05 is normal) of diff(log(data))
    #         Requires a vector as input
    n1.vtr <- as.numeric(n1.zoo)

    p3 <- shapiro.test(diff(log(n1.vtr)))$p.value

    # (2-2) Plot histogram of
    hist(diff(log(n1.vtr)), prob = T,
         main = c("diff(log(n1))", paste("Shapiro p=", prettyNum(p3, digits = 2))), xlab = "diff(log(n1))")
    lines(density(diff(log(n1.vtr))))

    # (2-3) QQ plot of the above

    qqnorm(diff(log(n1.zoo)))

    The above code tests the column for a normal distribution. It can be seen using a histogram (and a QQ plot), that the diff(log(data)) is approximately normal. However, the Shapiro test indicates that the distribution is NOT normal (p-value is LESS THAN 0.05).

    (3) Test the column for autocorrelation

    # (3-1) Acf plot
    acf(diff(log(n1.zoo)), main="diff(log(n1))")


    The ACF plot has two horizontal blue lines for significance tests. If any of the period has a vertical line that crosses the blue lines, then there is a correlation between that period and period 0. It can be seen in the ACF plot, that the diff(log(data)) has an autocorrelation between period 1 and period 0.

    (4) Test the column for stationary

    # (4-1) Augmented Dickey-Fuller (ADF) test
    #       The null-hypothesis for an ADF test is that the data is non-stationary
    #       Therefore, p>0.05 is non-stationary

    adf.test(diff(log(n1.zoo)), alternative="stationary")

    Using the ADF test, we show the diff(log(data)) is a stationary time series.

    (5) Test the column for seasonality

    # (5-1) Seasonal differencing test
    #       If ns>0, then seasonal differencing is required

    nsdiffs(diff(log(n1.zoo)))

    Using the nsdiffs() function, we show the column does NOT require seasonal differencing.

    (6) Auto fit the column using Arima

    # (6-1) AIC test
    auto.arima(diff(log(n1.zoo)))

    Using the auto.arima() function from the package forecast, we find that the column has the best fitted model (with the lowest AIC) when using the diff(log(data)) form. This AIC has improved almost THREE (3) times that of the fitted model when using the data without transformation.

    (7) Predict ONE(1)-period lookahead

    # (7-1) Obtain ONE(1)-period lookahead forecast and confidence interval
    f3 <- forecast(auto.arima(diff(log(n1.zoo))), h=1)
    f3

    # (7-2) Transform diff(log(data)) back into data
    #       Select the 95% confidence interval
    #       Powerball numbers MUST be between ONE (1) AND FORTY-FIVE (45)
    u <- exp( log(n1.zoo[NROW(n1.zoo), 1]) + max(f3$upper) )
    p <- exp( log(n1.zoo[NROW(n1.zoo), 1]) + f3$mean )
    l <- exp( log(n1.zoo[NROW(n1.zoo), 1]) + min(f3$lower) )
    u <- min(45, trunc(u))
    p <- round(p)
    l <- max(1, round(l))
    data.frame(lower=l, predict=p, upper=u)


    Based on the above AIC, and taking into consideration ALL the previous tests, we select the diff(log(data)) as our time series input, and then predict a ONE(1)-period lookahead for this time series. We also obtain a confidence interval and then transform our predicted value (with intervals) back to its original form.

    Conclusion

    I have shown you how to analyze ONE (1) column of the Powerball time series to determine if it requires transformation prior to fitting a model for forecasting. The transformations diff(data) and diff(log(data)) are commonly used for ANY time series, however, you are NOT limited to these two types. In fact, you are encouraged to explore other types of transformation that may lead you to a better forecasting model than mine.

    The entire code (and output) can be viewed here.
    0

    Add a comment


  3. I write R scripts on both my laptop and desktop, so the main issue I have is making sure that the R scripts are updated on these devices. There are several ways to ensure this happens:
    • Use a version control system (on the cloud), e.g Github
    • Write R scripts on an RStudio server, which is accessible on any web browser
    • Transfer files via FTP or using a remote desktop software, such as Teamviewer
    Which of these methods do I use? Actually, I use all THREE (3) methods to ensure my R scripts are updated.

    The first method has an advantage that you can perform a diff of changes made to any file after it is checked in to the cloud, and these changes can be pulled (or merged) from any devices. I perform only basic functions, such as check in/out, merge/pull, sync although it did once save me from a disaster as I had to roll back several versions to get to a previous stable version of code.

    The second method is used when I am currently working on a project, but the R script is not stable yet to be checked-in. The RStudio GUI is available on a web browser and I can continue working on ANY devices. Also, an advantage of having an RStudio server is that I can run R scripts as a background process. For example, I currently have TWO (2) background scripts: (1) the first is a shiny app that I can access using a web browser; and (2) the second is an R app which periodically checks for new files to download and it automatically emails the files as attachment to my email address. However, the setup of RStudio server on Ubuntu was arduously long, and I may write a how-to blog post on this for newbies.

    The third method is rather trivial, as FTP and RDP protocols are very common. Suffice to say, that I can easily transfer files between my laptop and device once the software has been installed. 

    Although R is platform independent, however the functions that access the underlying system are NOT. For example, the function setwd() requires a path as input that depends on the OS, e.g. Windows path start with "C:/", but Linux path starts with "~/". Rather than dealing with these OS-specific details each time I access the system, I decided to create a library of functions to deal with this.

    Some query functions are: RegIsWindowsBln() and RegIsLinuxBln() returns true/false depending on the OS.

    RegIsWindowsBln <- function() {
      retBln <- (Sys.info()["sysname"] == "Windows")
      names(retBln) <- NULL
      retBln
    }
    RegIsLinuxBln <- function() {
      retBln <- (Sys.info()["sysname"] == "Linux")
      names(retBln) <- NULL
      retBln
    }


    Some system functions are: RegGetHomeDir() and RegGetRDir() returns paths that are specific to the OS, e.g. "C:\Users\myname" for Windows.

    RegGetHomeDir <- function() {
      retDir <- NULL
      if( RegIsLinuxBln() )
        retDir <- paste0("/home/",Sys.info()["user"],"/")
      if( RegIsWindowsBln() )
        retDir <- paste0("C:/Users/",Sys.info()["user"],"/")
      retDir
    }


    Shell command: RegSystemNum() calls the R function system() with different parameters that are OS-specific and returns an error number.

    RegSystemNum <- function(cmdChr) {
      if( RegIsLinuxBln() )
        errNum <- tryCatch( system(cmdChr, intern=FALSE, wait=TRUE),
                            error=function(e) { 9999 }, finally={} )
      if( RegIsWindowsBln() )
        errNum <- tryCatch( system(cmdChr, show.output.on.console=FALSE,
                                   intern=FALSE, wait=TRUE),
                            error=function(e) { 9999 }, finally={} )
      errNum
    }


    Also, I extended this library to include some type checking function: RegIsEmailBln() and returns true/false if email format is valid or invalid.

    RegIsEmailBln <- function( emailChr ) {  
      patStr <- "^([a-zA-Z0-9]+[a-zA-Z0-9._%-]*@(?:[a-zA-Z0-9-])+(\\.+[a-zA-Z]{2,4}){1,2})$"
      grepl(patStr, emailChr)
    }


    I will stop here and I hope this post has encouraged some of you readers to create and extend this library of functions.
    0

    Add a comment

Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.