Categories

# How to do the Titanic Kaggle competition in R – Part 1

As part of submitting to Data Science Dojo’s Kaggle competition you need to create a model out of the titanic data set. We will show you how to do this using RStudio.

Titanic Data Set:
https://www.kaggle.com/c/titanic

https://www.rstudio.com/products/rstudio

https://datasciencedojo.com/data-science-bootcamp/

Watch the latest video tutorials here:
https://tutorials.datasciencedojo.com/

See what our past attendees are saying here:
https://datasciencedojo.com/bootcamp/reviews/#videos

Also find us on:
Instagram: https://www.instagram.com/data_science_dojo
Vimeo: https://vimeo.com/datasciencedojo

#rtutorial #kaggle #rprogramming

source

## 48 replies on “How to do the Titanic Kaggle competition in R – Part 1”

Can someone tell me why he combined the two datasets together? is that mean he change the facts?

Sir could you please explain why did the passenger id change in the submission file?

Can anyone explain strings as factors? I’m new and don’t get it.

at 31:45 for output.df\$Survived <- Survived I keep getting the Error:
Error in `\$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :

replacement has 891 rows, data has 418. Seems like the training set is used instead of the test set, but I don't know why?

Why fill in with the mode? Why not just put NA for not available? 12.41

can any one keep the code in detail step by step i am facing in combinig of train&test data frame

at 27:34 i don't know why my R show

"titanic.model <- randomForest(formula = Sur.formula, date = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))

Error in eval(predvars, data, env) : cannot find 'Survived'

share the r code also

Need at least two classes to do classification error while running the model. Please tell me how to resolve.

as.formula is not working

It's not allowing me to do rbind

rf_model <- randomForest(formula = Survived.Fml, data = ttrain, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(ttest))

Error in nrow(x) : argument "x" is missing, with no default

How to resolve this?

Error in randomForest.default(m, y, …) :

NA/NaN/Inf in foreign function call (arg 1)

1: In data.matrix(x) : NAs introduced by coercion

2: In data.matrix(x) : NAs introduced by coercion

Anyone? I am sure there are no NAs values in the variables selected…

Here is my code:
survived.equation <- "Survived ~ Pclass + Oclass + Sex + Age + SibSp + Parch + Fare + Embarked"

survived.formula <- as.formula(survived.equation)

model1 <-randomForest(formula = survived.formula, data = train_data, ntree = 500, mtry = 3, nodesize = 0.1 * nrow(train_data))

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Oclass"

Survived <- predict(model1, newdata = test_data)

When I run this in Kaggle Notebooks, I just get a list of 0's an 1's, but it doesn't include the PassengerId.
What did I miss? How do I apply the feature to the predict line?

The github link is not working.

Does anybody have the code?

Error in randomForest.default(m, y, …) :
NA/NaN/Inf in foreign function call (arg 1)
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion

I get an error when rbinding titanic.test with train. Error in match.names (clabs, names(xi)) : Names do not match previous names? What do I do to fix this?

at 31:45 for output.df\$Survived <- Survived I keep getting the Error:
Error in `\$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :

replacement has 418 rows, data has 0
Anyone else got this and can you help please?

Can you upload step by step project with data cleaning and missing values in r

I am not getting final Survived values as bunch of 0 and 1 but I am getting a bunch of probabilities. Why am I getting this any idea?

Take a shot every time he does that annoying click noise.

Thank you so much! It's exactly what i was looking for

Error in na.fail.default(list(Survived = c(1L, 2L, 2L, 2L, 1L, 1L, 1L, : missing values in object
Can someone guide me what am I missing

#import the data
titanic.train <- read.csv("Titanic/train.csv", stringsAsFactors = FALSE)
titanic.test <- read.csv("Titanic/test.csv", stringsAsFactors = FALSE)
#tail(titanic.train)
#tail(titanic.test)
#str(titanic.train)
#str(titanic.test)
titanic.train\$IsTrainSet <- TRUE
titanic.test\$IsTrainSet <- FALSE
#ncol(titanic.train)
#ncol(titanic.test)
titanic.test\$Survived <- NA
#names(titanic.train)
#names(titanic.test)
titanic.full <- rbind(titanic.train,titanic.test)
tail(titanic.full)
#tail(titanic.test)

table(titanic.full\$IsTrainSet)
titanic.full[titanic.full\$Embarked=='',"Embarked"] <- "S"

#tail(titanic.full)

fare.median <- median(titanic.full\$Fare, na.rm = TRUE)

titanic.full[is.na(titanic.full\$Fare), "Fare"] <- fare.median

age.median <- median(titanic.full\$Age, na.rm = TRUE)

titanic.full[is.na(titanic.full\$age), "Age"] <- age.median

titanic.full\$Pclass <- as.factor(titanic.full\$Pclass)
titanic.full\$Sex <- as.factor(titanic.full\$Sex)
titanic.full\$Embarked <- as.factor(titanic.full\$Embarked)

titanic.train <- titanic.full[titanic.full\$IsTrainSet==TRUE,]
titanic.test <- titanic.full[titanic.full\$IsTrainSet==FALSE,]

titanic.train[!is.na(titanic.train)]

titanic.train\$Survived <- as.factor(titanic.train\$Survived)
table(titanic.train\$Survived)
tail(titanic.train)
survived.equation = "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
survived.formula = as.formula(survived.equation)

#install.packages("randomForest")
library(randomForest)

table(titanic.train\$Survived)

titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"

Survived <- predict(titanic.model, newdata = titanic.test)

PassengerId <- titanic.test\$PassengerId

output.df <- as.data.frame(PassengerId)
output.df\$Survived <- Survived
tail(output.df)

why the 2 missing embarked cells were assigned to only S and not to C or Q ?

am getting that error

Why is he going so fucking fast

That's a good video. However I did not understand that what was the need to converting variables(pClass, Sex, Embarked) into factors through as.factors. Also why did we convert survived into factors after spitting the two data sets here. I would be grateful if someone could explain it in a bit detail.
Many thanks 🙂

titanic.combine <- rbind(titanic.train, titanic.testing)

Error in match.names(clabs, names(xi)) :

names do not match previous names

im getting this error

why particularly random forest? Why not any other classification technique?

Nice Video!
But, I get an Error for nodesize = 0.01 * nrow(iris.train) stating "Error in nrow(x) : argument "x" is missing, with no default"
Can anybody help me?

very well explained! Thank you so much..

Why we using median here and not the mean?

Can you stop doing that annoying chup chup sound ??

Hi , why did we add 'S' only for the missing values. ? Also by mistake i added small s which now shows me C Q s S
270 123 2 914 .. how can i remove the small s now ?

HI, how does the R Random Forest classifier deals with data with categorical and quantitative data ?
and also any suggestions on dealing with the same problem if we want to use any other classification algorithms .?

Awesome tutorial. Thanks for sharing. 🙂

Hi… After using rbind() to create titanic.full, Im getting this :-
Error in titanic.full\$IsTrainSet :
\$ operator is invalid for atomic vectors
What should i do?

Does this video display as blurry for anyone else?

titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.test))
after this, " Error in eval(predvars, data, env) : object 'Sibsp' not found " is coming. Nor is it working with
titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.train))
as someone has commented below. The same error is coming. Do you have any solution for this?

titanic.train <- titanic.full[titanic.full\$ISTrainSet==TRUE,]
titanic.test <- titanic.full[titanic.full\$ISTrainSet==FALSE,]
those 2 lines , make me lose all my observations …

Hi

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"

I could not see this features.equation to be used anywhere

Hi,

In the 20:00 min he basically introduced the concept of "Categorial Casting " and you convert only certain types of columns to factors.

My doubt is

1. What is "Categorial Casting"?

2. Why is it used in only for certain data and not all?

3. What will happen if we do our model without it?