Categories
Video

How to do the Titanic Kaggle competition in R – Part 1



As part of submitting to Data Science Dojo’s Kaggle competition you need to create a model out of the titanic data set. We will show you how to do this using RStudio.

Titanic Data Set:
https://www.kaggle.com/c/titanic

Download RStudio:
https://www.rstudio.com/products/rstudio


Learn more about Data Science Dojo here:
https://datasciencedojo.com/data-science-bootcamp/

Watch the latest video tutorials here:
https://tutorials.datasciencedojo.com/

See what our past attendees are saying here:
https://datasciencedojo.com/bootcamp/reviews/#videos

Like Us: https://www.facebook.com/datasciencedojo
Follow Us: https://twitter.com/DataScienceDojo
Connect with Us: https://www.linkedin.com/company/datasciencedojo

Also find us on:
Instagram: https://www.instagram.com/data_science_dojo
Vimeo: https://vimeo.com/datasciencedojo

#rtutorial #kaggle #rprogramming

source

48 replies on “How to do the Titanic Kaggle competition in R – Part 1”

at 31:45 for output.df$Survived <- Survived I keep getting the Error:
Error in `$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :

replacement has 891 rows, data has 418. Seems like the training set is used instead of the test set, but I don't know why?

at 27:34 i don't know why my R show

"titanic.model <- randomForest(formula = Sur.formula, date = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))

Error in eval(predvars, data, env) : cannot find 'Survived'

Error in randomForest.default(m, y, …) :

NA/NaN/Inf in foreign function call (arg 1)

In addition: Warning messages:

1: In data.matrix(x) : NAs introduced by coercion

2: In data.matrix(x) : NAs introduced by coercion

Anyone? I am sure there are no NAs values in the variables selected…

Here is my code:
survived.equation <- "Survived ~ Pclass + Oclass + Sex + Age + SibSp + Parch + Fare + Embarked"

survived.formula <- as.formula(survived.equation)

model1 <-randomForest(formula = survived.formula, data = train_data, ntree = 500, mtry = 3, nodesize = 0.1 * nrow(train_data))

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Oclass"

Survived <- predict(model1, newdata = test_data)

When I run this in Kaggle Notebooks, I just get a list of 0's an 1's, but it doesn't include the PassengerId.
What did I miss? How do I apply the feature to the predict line?

Error in randomForest.default(m, y, …) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion

at 31:45 for output.df$Survived <- Survived I keep getting the Error:
Error in `$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :

replacement has 418 rows, data has 0
Anyone else got this and can you help please?

Error in na.fail.default(list(Survived = c(1L, 2L, 2L, 2L, 1L, 1L, 1L, : missing values in object
Can someone guide me what am I missing

#import the data
titanic.train <- read.csv("Titanic/train.csv", stringsAsFactors = FALSE)
titanic.test <- read.csv("Titanic/test.csv", stringsAsFactors = FALSE)
#tail(titanic.train)
#tail(titanic.test)
#str(titanic.train)
#str(titanic.test)
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#ncol(titanic.train)
#ncol(titanic.test)
titanic.test$Survived <- NA
#names(titanic.train)
#names(titanic.test)
titanic.full <- rbind(titanic.train,titanic.test)
tail(titanic.full)
#tail(titanic.test)

table(titanic.full$IsTrainSet)
titanic.full[titanic.full$Embarked=='',"Embarked"] <- "S"

#tail(titanic.full)

fare.median <- median(titanic.full$Fare, na.rm = TRUE)

titanic.full[is.na(titanic.full$Fare), "Fare"] <- fare.median

age.median <- median(titanic.full$Age, na.rm = TRUE)

titanic.full[is.na(titanic.full$age), "Age"] <- age.median

titanic.full$Pclass <- as.factor(titanic.full$Pclass)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Embarked <- as.factor(titanic.full$Embarked)

titanic.train <- titanic.full[titanic.full$IsTrainSet==TRUE,]
titanic.test <- titanic.full[titanic.full$IsTrainSet==FALSE,]

titanic.train[!is.na(titanic.train)]

titanic.train$Survived <- as.factor(titanic.train$Survived)
table(titanic.train$Survived)
tail(titanic.train)
survived.equation = "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
survived.formula = as.formula(survived.equation)

#install.packages("randomForest")
library(randomForest)

table(titanic.train$Survived)

titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"

Survived <- predict(titanic.model, newdata = titanic.test)

PassengerId <- titanic.test$PassengerId

output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived
tail(output.df)

That's a good video. However I did not understand that what was the need to converting variables(pClass, Sex, Embarked) into factors through as.factors. Also why did we convert survived into factors after spitting the two data sets here. I would be grateful if someone could explain it in a bit detail.
Many thanks 🙂

HI, how does the R Random Forest classifier deals with data with categorical and quantitative data ?
and also any suggestions on dealing with the same problem if we want to use any other classification algorithms .?

titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.test))
after this, " Error in eval(predvars, data, env) : object 'Sibsp' not found " is coming. Nor is it working with
titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.train))
as someone has commented below. The same error is coming. Do you have any solution for this?

Hi,

In the 20:00 min he basically introduced the concept of "Categorial Casting " and you convert only certain types of columns to factors.

My doubt is

1. What is "Categorial Casting"?

2. Why is it used in only for certain data and not all?

3. What will happen if we do our model without it?

Many thanks in advance.

Leave a Reply

Your email address will not be published. Required fields are marked *