As part of submitting to Data Science Dojo’s Kaggle competition you need to create a model out of the titanic data set. We will show you how to do this using RStudio.

Titanic Data Set:

https://www.kaggle.com/c/titanic

Download RStudio:

https://www.rstudio.com/products/rstudio

—

Learn more about Data Science Dojo here:

https://datasciencedojo.com/data-science-bootcamp/

Watch the latest video tutorials here:

https://tutorials.datasciencedojo.com/

See what our past attendees are saying here:

https://datasciencedojo.com/bootcamp/reviews/#videos

—

Like Us: https://www.facebook.com/datasciencedojo

Follow Us: https://twitter.com/DataScienceDojo

Connect with Us: https://www.linkedin.com/company/datasciencedojo

Also find us on:

Instagram: https://www.instagram.com/data_science_dojo

Vimeo: https://vimeo.com/datasciencedojo

#rtutorial #kaggle #rprogramming

source

## 48 replies on “How to do the Titanic Kaggle competition in R – Part 1”

Can someone tell me why he combined the two datasets together? is that mean he change the facts?

Sir could you please explain why did the passenger id change in the submission file?

Can anyone explain strings as factors? I’m new and don’t get it.

at 31:45 for output.df$Survived <- Survived I keep getting the Error:

Error in `$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :

replacement has 891 rows, data has 418. Seems like the training set is used instead of the test set, but I don't know why?

What is the 70/30 split. A link to the minute he talked about this would be helpful :

Why fill in with the mode? Why not just put NA for not available? 12.41

can any one keep the code in detail step by step i am facing in combinig of train&test data frame

at 27:34 i don't know why my R show

"titanic.model <- randomForest(formula = Sur.formula, date = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))

Error in eval(predvars, data, env) : cannot find 'Survived'

share the r code also

Need at least two classes to do classification error while running the model. Please tell me how to resolve.

as.formula is not working

It's not allowing me to do rbind

rf_model <- randomForest(formula = Survived.Fml, data = ttrain, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(ttest))

Error in nrow(x) : argument "x" is missing, with no default

How to resolve this?

Nice video

Error in randomForest.default(m, y, …) :

NA/NaN/Inf in foreign function call (arg 1)

In addition: Warning messages:

1: In data.matrix(x) : NAs introduced by coercion

2: In data.matrix(x) : NAs introduced by coercion

Anyone? I am sure there are no NAs values in the variables selected…

Here is my code:

survived.equation <- "Survived ~ Pclass + Oclass + Sex + Age + SibSp + Parch + Fare + Embarked"

survived.formula <- as.formula(survived.equation)

model1 <-randomForest(formula = survived.formula, data = train_data, ntree = 500, mtry = 3, nodesize = 0.1 * nrow(train_data))

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Oclass"

Survived <- predict(model1, newdata = test_data)

When I run this in Kaggle Notebooks, I just get a list of 0's an 1's, but it doesn't include the PassengerId.

What did I miss? How do I apply the feature to the predict line?

The github link is not working.

Does anybody have the code?

batman 15:21

Error in randomForest.default(m, y, …) :

NA/NaN/Inf in foreign function call (arg 1)

In addition: Warning messages:

1: In data.matrix(x) : NAs introduced by coercion

2: In data.matrix(x) : NAs introduced by coercion

I get an error when rbinding titanic.test with train. Error in match.names (clabs, names(xi)) : Names do not match previous names? What do I do to fix this?

at 31:45 for output.df$Survived <- Survived I keep getting the Error:

Error in `$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :

replacement has 418 rows, data has 0

Anyone else got this and can you help please?

Can you upload step by step project with data cleaning and missing values in r

I am not getting final Survived values as bunch of 0 and 1 but I am getting a bunch of probabilities. Why am I getting this any idea?

Take a shot every time he does that annoying click noise.

Thank you so much! It's exactly what i was looking for

Error in na.fail.default(list(Survived = c(1L, 2L, 2L, 2L, 1L, 1L, 1L, : missing values in object

Can someone guide me what am I missing

#import the data

titanic.train <- read.csv("Titanic/train.csv", stringsAsFactors = FALSE)

titanic.test <- read.csv("Titanic/test.csv", stringsAsFactors = FALSE)

#tail(titanic.train)

#tail(titanic.test)

#str(titanic.train)

#str(titanic.test)

titanic.train$IsTrainSet <- TRUE

titanic.test$IsTrainSet <- FALSE

#ncol(titanic.train)

#ncol(titanic.test)

titanic.test$Survived <- NA

#names(titanic.train)

#names(titanic.test)

titanic.full <- rbind(titanic.train,titanic.test)

tail(titanic.full)

#tail(titanic.test)

table(titanic.full$IsTrainSet)

titanic.full[titanic.full$Embarked=='',"Embarked"] <- "S"

#tail(titanic.full)

fare.median <- median(titanic.full$Fare, na.rm = TRUE)

titanic.full[is.na(titanic.full$Fare), "Fare"] <- fare.median

age.median <- median(titanic.full$Age, na.rm = TRUE)

titanic.full[is.na(titanic.full$age), "Age"] <- age.median

titanic.full$Pclass <- as.factor(titanic.full$Pclass)

titanic.full$Sex <- as.factor(titanic.full$Sex)

titanic.full$Embarked <- as.factor(titanic.full$Embarked)

titanic.train <- titanic.full[titanic.full$IsTrainSet==TRUE,]

titanic.test <- titanic.full[titanic.full$IsTrainSet==FALSE,]

titanic.train[!is.na(titanic.train)]

titanic.train$Survived <- as.factor(titanic.train$Survived)

table(titanic.train$Survived)

tail(titanic.train)

survived.equation = "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"

survived.formula = as.formula(survived.equation)

#install.packages("randomForest")

library(randomForest)

table(titanic.train$Survived)

titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"

Survived <- predict(titanic.model, newdata = titanic.test)

PassengerId <- titanic.test$PassengerId

output.df <- as.data.frame(PassengerId)

output.df$Survived <- Survived

tail(output.df)

hi please help me not able to download data set.

why the 2 missing embarked cells were assigned to only S and not to C or Q ?

Error in tail(titanic.train) : object 'titanic.train' not found.

am getting that error

Why is he going so fucking fast

That's a good video. However I did not understand that what was the need to converting variables(pClass, Sex, Embarked) into factors through as.factors. Also why did we convert survived into factors after spitting the two data sets here. I would be grateful if someone could explain it in a bit detail.

Many thanks 🙂

Learn Machine Learning with placement assistance visit http://www.trainingmarathahalli.com/machine-learning-using-r-training-in-marathahalli/

great!

titanic.combine <- rbind(titanic.train, titanic.testing)

Error in match.names(clabs, names(xi)) :

names do not match previous names

im getting this error

anyone please help me out

why particularly random forest? Why not any other classification technique?

Nice Video!

But, I get an Error for nodesize = 0.01 * nrow(iris.train) stating "Error in nrow(x) : argument "x" is missing, with no default"

Can anybody help me?

very well explained! Thank you so much..

Why we using median here and not the mean?

Can you stop doing that annoying chup chup sound ??

Hi , why did we add 'S' only for the missing values. ? Also by mistake i added small s which now shows me C Q s S

270 123 2 914 .. how can i remove the small s now ?

HI, how does the R Random Forest classifier deals with data with categorical and quantitative data ?

and also any suggestions on dealing with the same problem if we want to use any other classification algorithms .?

Awesome tutorial. Thanks for sharing. 🙂

Hi… After using rbind() to create titanic.full, Im getting this :-

Error in titanic.full$IsTrainSet :

$ operator is invalid for atomic vectors

What should i do?

Does this video display as blurry for anyone else?

titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.test))

after this, " Error in eval(predvars, data, env) : object 'Sibsp' not found " is coming. Nor is it working with

titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.train))

as someone has commented below. The same error is coming. Do you have any solution for this?

titanic.train <- titanic.full[titanic.full$ISTrainSet==TRUE,]

titanic.test <- titanic.full[titanic.full$ISTrainSet==FALSE,]

those 2 lines , make me lose all my observations …

Hi

features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"

I could not see this features.equation to be used anywhere

Hi,

In the 20:00 min he basically introduced the concept of "Categorial Casting " and you convert only certain types of columns to factors.

My doubt is

1. What is "Categorial Casting"?

2. Why is it used in only for certain data and not all?

3. What will happen if we do our model without it?

Many thanks in advance.