As part of submitting to Data Science Dojo’s Kaggle competition you need to create a model out of the titanic data set. We will show you how to do this using RStudio.
Titanic Data Set:
https://www.kaggle.com/c/titanic
Download RStudio:
https://www.rstudio.com/products/rstudio
—
Learn more about Data Science Dojo here:
https://datasciencedojo.com/data-science-bootcamp/
Watch the latest video tutorials here:
https://tutorials.datasciencedojo.com/
See what our past attendees are saying here:
https://datasciencedojo.com/bootcamp/reviews/#videos
—
Like Us: https://www.facebook.com/datasciencedojo
Follow Us: https://twitter.com/DataScienceDojo
Connect with Us: https://www.linkedin.com/company/datasciencedojo
Also find us on:
Instagram: https://www.instagram.com/data_science_dojo
Vimeo: https://vimeo.com/datasciencedojo
#rtutorial #kaggle #rprogramming
source
48 replies on “How to do the Titanic Kaggle competition in R – Part 1”
Can someone tell me why he combined the two datasets together? is that mean he change the facts?
Sir could you please explain why did the passenger id change in the submission file?
Can anyone explain strings as factors? I’m new and don’t get it.
at 31:45 for output.df$Survived <- Survived I keep getting the Error:
Error in `$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :
replacement has 891 rows, data has 418. Seems like the training set is used instead of the test set, but I don't know why?
What is the 70/30 split. A link to the minute he talked about this would be helpful :
Why fill in with the mode? Why not just put NA for not available? 12.41
can any one keep the code in detail step by step i am facing in combinig of train&test data frame
at 27:34 i don't know why my R show
"titanic.model <- randomForest(formula = Sur.formula, date = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))
Error in eval(predvars, data, env) : cannot find 'Survived'
share the r code also
Need at least two classes to do classification error while running the model. Please tell me how to resolve.
as.formula is not working
It's not allowing me to do rbind
rf_model <- randomForest(formula = Survived.Fml, data = ttrain, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(ttest))
Error in nrow(x) : argument "x" is missing, with no default
How to resolve this?
Nice video
Error in randomForest.default(m, y, …) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
Anyone? I am sure there are no NAs values in the variables selected…
Here is my code:
survived.equation <- "Survived ~ Pclass + Oclass + Sex + Age + SibSp + Parch + Fare + Embarked"
survived.formula <- as.formula(survived.equation)
model1 <-randomForest(formula = survived.formula, data = train_data, ntree = 500, mtry = 3, nodesize = 0.1 * nrow(train_data))
features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Oclass"
Survived <- predict(model1, newdata = test_data)
When I run this in Kaggle Notebooks, I just get a list of 0's an 1's, but it doesn't include the PassengerId.
What did I miss? How do I apply the feature to the predict line?
The github link is not working.
Does anybody have the code?
batman 15:21
Error in randomForest.default(m, y, …) :
NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
I get an error when rbinding titanic.test with train. Error in match.names (clabs, names(xi)) : Names do not match previous names? What do I do to fix this?
at 31:45 for output.df$Survived <- Survived I keep getting the Error:
Error in `$<-.data.frame`(`*tmp*`, Survived, value = c(`892` = 1L, `893` = 1L, :
replacement has 418 rows, data has 0
Anyone else got this and can you help please?
Can you upload step by step project with data cleaning and missing values in r
I am not getting final Survived values as bunch of 0 and 1 but I am getting a bunch of probabilities. Why am I getting this any idea?
Take a shot every time he does that annoying click noise.
Thank you so much! It's exactly what i was looking for
Error in na.fail.default(list(Survived = c(1L, 2L, 2L, 2L, 1L, 1L, 1L, : missing values in object
Can someone guide me what am I missing
#import the data
titanic.train <- read.csv("Titanic/train.csv", stringsAsFactors = FALSE)
titanic.test <- read.csv("Titanic/test.csv", stringsAsFactors = FALSE)
#tail(titanic.train)
#tail(titanic.test)
#str(titanic.train)
#str(titanic.test)
titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#ncol(titanic.train)
#ncol(titanic.test)
titanic.test$Survived <- NA
#names(titanic.train)
#names(titanic.test)
titanic.full <- rbind(titanic.train,titanic.test)
tail(titanic.full)
#tail(titanic.test)
table(titanic.full$IsTrainSet)
titanic.full[titanic.full$Embarked=='',"Embarked"] <- "S"
#tail(titanic.full)
fare.median <- median(titanic.full$Fare, na.rm = TRUE)
titanic.full[is.na(titanic.full$Fare), "Fare"] <- fare.median
age.median <- median(titanic.full$Age, na.rm = TRUE)
titanic.full[is.na(titanic.full$age), "Age"] <- age.median
titanic.full$Pclass <- as.factor(titanic.full$Pclass)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Embarked <- as.factor(titanic.full$Embarked)
titanic.train <- titanic.full[titanic.full$IsTrainSet==TRUE,]
titanic.test <- titanic.full[titanic.full$IsTrainSet==FALSE,]
titanic.train[!is.na(titanic.train)]
titanic.train$Survived <- as.factor(titanic.train$Survived)
table(titanic.train$Survived)
tail(titanic.train)
survived.equation = "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
survived.formula = as.formula(survived.equation)
#install.packages("randomForest")
library(randomForest)
table(titanic.train$Survived)
titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))
features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
Survived <- predict(titanic.model, newdata = titanic.test)
PassengerId <- titanic.test$PassengerId
output.df <- as.data.frame(PassengerId)
output.df$Survived <- Survived
tail(output.df)
hi please help me not able to download data set.
why the 2 missing embarked cells were assigned to only S and not to C or Q ?
Error in tail(titanic.train) : object 'titanic.train' not found.
am getting that error
Why is he going so fucking fast
That's a good video. However I did not understand that what was the need to converting variables(pClass, Sex, Embarked) into factors through as.factors. Also why did we convert survived into factors after spitting the two data sets here. I would be grateful if someone could explain it in a bit detail.
Many thanks 🙂
Learn Machine Learning with placement assistance visit http://www.trainingmarathahalli.com/machine-learning-using-r-training-in-marathahalli/
great!
titanic.combine <- rbind(titanic.train, titanic.testing)
Error in match.names(clabs, names(xi)) :
names do not match previous names
im getting this error
anyone please help me out
why particularly random forest? Why not any other classification technique?
Nice Video!
But, I get an Error for nodesize = 0.01 * nrow(iris.train) stating "Error in nrow(x) : argument "x" is missing, with no default"
Can anybody help me?
very well explained! Thank you so much..
Why we using median here and not the mean?
Can you stop doing that annoying chup chup sound ??
Hi , why did we add 'S' only for the missing values. ? Also by mistake i added small s which now shows me C Q s S
270 123 2 914 .. how can i remove the small s now ?
HI, how does the R Random Forest classifier deals with data with categorical and quantitative data ?
and also any suggestions on dealing with the same problem if we want to use any other classification algorithms .?
Awesome tutorial. Thanks for sharing. 🙂
Hi… After using rbind() to create titanic.full, Im getting this :-
Error in titanic.full$IsTrainSet :
$ operator is invalid for atomic vectors
What should i do?
Does this video display as blurry for anyone else?
titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.test))
after this, " Error in eval(predvars, data, env) : object 'Sibsp' not found " is coming. Nor is it working with
titanic.formula <- randomForest(formula=survived.formula, data= titanic.train, ntree= 500, mtry= 3, nodesize= 0.01*nrow(titanic.train))
as someone has commented below. The same error is coming. Do you have any solution for this?
titanic.train <- titanic.full[titanic.full$ISTrainSet==TRUE,]
titanic.test <- titanic.full[titanic.full$ISTrainSet==FALSE,]
those 2 lines , make me lose all my observations …
Hi
features.equation <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
I could not see this features.equation to be used anywhere
Hi,
In the 20:00 min he basically introduced the concept of "Categorial Casting " and you convert only certain types of columns to factors.
My doubt is
1. What is "Categorial Casting"?
2. Why is it used in only for certain data and not all?
3. What will happen if we do our model without it?
Many thanks in advance.