The aim of this assessment is to design, implement and evalu


 

CS5608 Big Data Analytics

Coursework for 2019/20

04/04/2020

The dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset will be used to create machine learning classification model which can learn the job descriptions which are fraudulent.

The following tables describes all the variables in the dataset.

Variable Name

Variable Description

job_id

Unique Job ID

title

The title of the job ad entry.

location

Geographical location of the job ad.

department

Corporate department (e.g. sales).

salary_range

Indicative salary range (e.g. $50,000-$60,000)

company_profile

A brief company description.

description

The details description of the job ad.

requirements

Enlisted requirements for the job opening.

benefits

Enlisted offered benefits by the employer.

telecommuting

True for telecommuting positions.

has_company_logo

True if company logo is present.

has_questions

True if screening questions are present.

employment_type

Full-type, Part-time, Contract, etc.

required_experience

Executive, Entry level, Intern, etc.

required_education

Doctorate, Master???s Degree, Bachelor, etc.

industry

Automotive, IT, Health care, Real estate, etc.

function

Consulting, Engineering, Research, Sales etc.

fraudulent

target - Classification attribute.

The data was loaded into R workspace in such a way that the empty string was treated as NA.

Also loading the data containing 19,000 job postings that were posted through the Armenian human resource portal CareerCenter.

The common columns in the two datasets are title, location, description, requirements, required Experience,

The following table shows the total number of instances of each fraudalent class

        [,1] [,2]

fraudulent 0 1 count 17014 866

Therefore, we have 17014 instances of non-fake jobs and 866 instances of fake jobs.

The following table shows the number of missing values in each of the variables described in the table above for both of the target levels.

                     [,1] [,2]
fraudulent              0    1
job_id                  0    0
title                   0    0
location              327   19
department          11016  531
salary_range        14369  643
company_profile      2721  587
description             0    0
requirements         2541  153
benefits             6845  363
telecommuting           0    0
has_company_logo        0    0
has_questions           0    0
employment_type      3230  241
required_experience  6615  435
required_education   7654  451
industry             4628  275
function.            6118  337

We can see that were are missing alot of values in both the fraudalent class and therefore, it will be not a good idea to drop all the incomplete cases from the dataset. But since, we are missing so many data in many characteristics of the dataset, we will be using only description, telecommuting, has_company_logo and has_questions variables for building the machine learning prediction model to predict whether a job posting is fake or not.

The following plot shows the distribution of the fraudulent class on whether the job is for telecommuting positions or not.

The following plot shows the distribution of the fraudulent class on whether company logo is present or not.

The following plot shows the distribution of the fraudulent class on whether screening questions are present or not.

From the three plots above, we can see that it is almost impossible to comment divide the two classes that is whther the job is fraud or not based on the three features plots above, namely telecommuting, has_company_logo and has_questions. Therefore, we will be building the machine learning classification model only using the description of the job.

The following steps were performed in order to build a machine learning classification model using the tect in the description of the job description.

  • Build a Corpus

corpus <-VCorpus(VectorSource(job$description))

  • Remove Punctuation

corpus <-tm_map(corpus,
content_transformer(removePunctuation))

  • Remove numbers

corpus <-tm_map(corpus, removeNumbers)

  • Convert to lower case

corpus <-tm_map(corpus, content_transformer(tolower))

  • Remove stop words

corpus <-tm_map(corpus,
content_transformer(removeWords),
stopwords("english"))

  • Stemming Process

corpus <-tm_map(corpus, stemDocument)

  • Strip Whitespace

corpus <-tm_map(corpus, stripWhitespace)

  • Create a DTM or Document Term Matrix

Performing Principal Component Analysis on the data in order to reduce the dimensionality of the whole dataset. The followinf Scree plot shows the percentage of variances explained by each principal component.

From the above Scree plot, we selected first 10 principal components to build the machine learning classification model

Then the data was splitted into training (80%) and testing (20%) datasets.

The Bayesian Generalized Linear Model was used from the caret package available in R in order to build a predictive model that when trained on the training model can predict whether a job from test

The following shows the results/performance of the selected Bayesian Generalized Linear classification model on the testing (out-of-sample) dataset using only the description of the job.

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 13618   674
         1     6     6

               Accuracy : 0.9525         
                 95% CI : (0.9488, 0.9559)
    No Information Rate : 0.9525         
    P-Value [Acc > NIR] : 0.5102     

                  Kappa : 0.0157         

 Mcnemar's Test P-Value : <2e-16         

            Sensitivity : 0.999560 
            Specificity : 0.008824       
         Pos Pred Value : 0.952841       
         Neg Pred Value : 0.500000       
             Prevalence : 0.952461       
         Detection Rate : 0.952041       
   Detection Prevalence : 0.999161
      Balanced Accuracy : 0.504192       

       'Positive' Class : 0              

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 3386  184
         1    4    2

               Accuracy : 0.9474         
                 95% CI : (0.9396, 0.9545)
    No Information Rate : 0.948          
    P-Value [Acc > NIR] : 0.5789         

                  Kappa : 0.0176         

 Mcnemar's Test P-Value : <2e-16         

            Sensitivity : 0.99882        
            Specificity : 0.01075        
         Pos Pred Value : 0.94846        
         Neg Pred Value : 0.33333        
             Prevalence : 0.94799        
         Detection Rate : 0.94687        
   Detection Prevalence : 0.99832        
      Balanced Accuracy : 0.50479        

       'Positive' Class : 0              

In this papers, we built a Bayesian Generalized Linear classification model for predicting the various characteristics of a job positing. In initial analysis, we found that most of the characteristics of the job listings were missing in both the classes that is whther that job listing is fake or not. We then plotted the distribution of the remaining complete characteristics of the job listings and found that telecommuting, has_company_logo and has_questions variables were unable to divide the two classes using visual inspection and ignoring any posibility of interaction between different levels of these three variables. We then used just the description of the job listings to develop the machine learning classification model. This was done by first building a Sparsed Document Term Matrix from the decription after removing all the punctuations, whitespaces, stopwords, number and converting all the description to lower cases. We then performed PCA on the resulting DTM and kept only first 10 Principal Component for building the model. The DTM was divided into two dataset that is the training (80%) and testing dataset. The Bayesian Generalized Linear classification model was trained on the training dataset and then tested on both training and the testing dataset. From the evaluation of the performance of the model, we received a training and testing accuracy of 95.25% and 94.74% respectively which is very good. But when we look at both the testing and training confusion matrix, we realize that the model is predicting most of the instances to be not fake and therefore, unable to detect the fake cases effeciently. A major cause of this problem is the imbalanced class instances in the job listing dataset. The issues also include the lose of explained variance in the dataset by using less number of principal components for reducing the dimensionality of the dataset and therefore, the processing power and processing time required for building the classification model for predicting whether a job listing is fake or not.

 

Appendix

if(!require(dplyr)) {
install.packages("dplyr")
library(dplyr)
}

if(!require(kableExtra)) {
install.packages("kableExtra")
library(kableExtra)
}

if(!require(ggplot2)) {
install.packages("ggplot2")
library(ggplot2)
}

if(!require(tm)) {
install.packages("tm")
library(tm)
}

if(!require(SnowballC)) {
install.packages("SnowballC")
library(SnowballC)
}

if(!require(caret)) {
install.packages("caret")
library(caret)
}

if(!require(factoextra)) {
install.packages("factoextra")
library(factoextra)
}

# Data preparation and cleaning

job <-read.csv(
"fake_job_postings.csv",
na.strings=c("","NA")
)

data <-read.csv("data job posts.csv")

data <-data.frame(
"id" =c(job$job_id, data$jobpost),
"title" =c(job$title, data$Title),
"location" =c(job$location, data$Location),
"salary" =c(job$salary_range, data$Salary),
"company" =c(job$company_profile, data$Company),
"description" =c(job$description, data$JobDescription),
"requirements" =c(job$requirements, data$RequiredQual)
)

kable(job %>%
group_by(fraudulent) %>%
summarise(
count =n()
        ))

kable(t(job %>%
group_by(fraudulent) %>%
summarise_all(~sum(is.na(.)))))

job <-job %>%
select(
    description,
    telecommuting,
    has_company_logo,
    has_questions,
    fraudulent
  ) %>%
mutate(
telecommuting =as.factor(telecommuting),
has_company_logo =as.factor(has_company_logo),
has_questions =as.factor(has_questions),
fraudulent =as.factor(fraudulent)
  )

# Exploratory data analysis

job %>%
ggplot(aes(telecommuting, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")

job %>%
ggplot(aes(has_company_logo, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")

job %>%
ggplot(aes(has_questions, ..count..)) +
geom_bar(aes(fill=fraudulent),
position ="dodge")

job <-job %>%
select(
    description,
    fraudulent
  )

# Machine learning prediction

corpus <-VCorpus(VectorSource(job$description))

corpus <-tm_map(corpus,
content_transformer(removePunctuation))

corpus <-tm_map(corpus, removeNumbers)

corpus <-tm_map(corpus, content_transformer(tolower))

corpus <-tm_map(corpus,
content_transformer(removeWords),
stopwords("english"))

corpus <-tm_map(corpus, stemDocument)

corpus <-tm_map(corpus, stripWhitespace)

dtm <-DocumentTermMatrix(corpus)
dtm <-removeSparseTerms(dtm, 0.9)
dtm.matrix <-as.matrix(dtm)
description <-as.data.frame(dtm.matrix)

rm(dtm, dtm.matrix, corpus)

res.pca <-prcomp(description,
scale = T)
fviz_eig(res.pca,
ncp =15)

description <-as.data.frame(res.pca$x[,1:10])
description$fraudulent <-job$fraudulent
set.seed(123)
smp <-floor(0.8*nrow(description))
trainIndex <-sample(seq_len(nrow(description)),
size = smp)
train <-description[ trainIndex,]
test  <-description[-trainIndex,]

rm(description, smp, trainIndex, res.pca, job)

# Performance evaluation

model <-train(fraudulent ~.,
data = train,
method ="bayesglm")

## Training Accuracy

prediction <-predict(
  model,
newdata = train,
type ="raw"
)

reference <-train$fraudulent

confusionMatrix(prediction, reference)

## Testing Accuracy

prediction <-predict(
  model,
newdata = test,
type ="raw"
)

reference <-test$fraudulent

confusionMatrix(prediction, reference)

 

RCode file attached 


Attachment:- R_code.zip

Request for Solution File

Ask an Expert for Answer!!
Management Theories: The aim of this assessment is to design, implement and evalu
Reference No:- TGS03065293

Expected delivery within 24 Hours