unit4_R

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE UNIT-4 Statistical Testing Statistical tests are mathematical tools for analysing quantitative data generated in a research study and making inference.Here are the general steps involved in statistical testing: Formulate Hypotheses: Null Hypothesis (H0): This is a statement of no effect or no difference in the population or data samples. Alternative Hypothesis (H1 or Ha): This is a statement that there is an effect or a difference in the population. Select the Appropriate Test: Choose a statistical test based on the nature of your data and the type of comparison you are making (e.g., t-test, chi-square test, ANOVA, etc.). Collect and Prepare Data: Ensure that your sample is representative and meets the assumptions of the chosen test. Clean and organize the data for analysis. Calculate Test Statistic: Compute the test statistic based on the formula associated with the chosen statistical test. Determine the Critical Region: Identify the critical region or critical values for the test statistic based on the chosen significance level. Make a Decision: Compare the calculated test statistic with the critical value(s). If the test statistic falls in the critical region, reject the null hypothesis. If it falls outside the critical region, fail to reject the null hypothesis. Draw Conclusions: Based on your decision, draw conclusions about the null hypothesis. . Report Results: Clearly communicate the results of the statistical test, including the test statistic, p-value (if applicable), and any relevant confidence intervals. Consider Limitations: Discuss any limitations or assumptions made during the analysis. Statistical Modelling Statistical modelling is a powerful technique used in data analysis to uncover patterns, relationships, and trends within datasets. By applying statistical methods and models, researchers and analysts can gain insights, make predictions, and support decision-making processes. Key steps are INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Data Collection and preparation- Gather and preprocess data Model Selection- Choose appropriate statistical model Model Fitting- Use techniques like square estimation, Bayesian interference to estimate the parameters. Model Evaluation-Access the performance using metrics as goodness to fit measures, prediction accuracy and diagnostic tests. Model Interpretation- Interpret Results and make conclusions. Eg., Linear Regression, logistic regression, time series analysis. Sampling Distributions in R A sampling distribution is a statistic that is arrived out through repeated sampling from a larger population It describes a range of possible outcomes that of a statistic, such as the mean or mode of some variable, as it truly exists a population. The majority of data analyzed by researchers are actually drawn from samples(a part of the pool of data), and not populations(entire pool of data). Steps to Calculate Sampling Distributions in R: Step 1: Here, first we have to define a number of samples(n=1000). n<-1000 Step 2: Next we create a vector(sample_means) of length ‘n’ with Null(NA) values [ rep() function is used to replicate the values in the vector Syntax: rep(value_to_be_replicated,number_of_times) Step 3: Later we filled the created sample_means null vector with sample means from the considered population using the mean() function which are having a sample mean of 10(mean) and standard deviation of 10(sd) of 20 samples(n) using rnorm() which is used to generate normal distributions. Syntax: mean(x, trim = 0) Syntax: rnorm(n, mean, sd) Step 4: To check the created samples we used head() which returns the first six samples of the dataframe (vector,list etc,.). Syntax:head(data_frame,no_of_rows_be_returned) #By default second argument is set to 6 in R. INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Step 5: Finally to visualize the sample_mean dataset we plotted a histogram ( for better visualization ) using hist() function in R. Syntax:hist(v,main,xlab,ylab,col) Step 6: Finally we found the probability of generated sample means which are having mean greater than or equal to 10. Example: # define number of samples n < -1000 # create empty vector of length n sample_means = rep(NA, n) # fill empty_vector with means for(i in 1: n){ sample_means[i] = mean(rnorm(20, mean=10, sd=10)) } head(sample_means) # create histogram to visualize hist(sample_means, main="Sampling Distribution", xlab="Sample Means", ylab="Frequency", col="blue") # To cross check find mean and sd of sample mean(sample_means) sd(sample_means) # To find probability sum(sample_means >= 10)/length(sample_means) INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Hypothesis Testing As we might know, when we infer something from data, we make an inference based on a collection of samples rather than the true population. The main question that comes from it is: can we trust the result from our data to make a general assumption of the population? This is the main goal of hypothesis testing. There are several steps that we should do to properly conduct a hypothesis testing. The Four Key steps involved are State the Hypotheses, form our null hypothesis and alternative hypothesis. Null Hypothesis (H0): This is a statement of no effect or no difference in the population or data samples. Alternative Hypothesis (H1 or Ha): This is a statement that there is an effect or a difference in the population. Formulate an analysis plan and set the criteria for decision(Set our significance level). The significance level varies depending on our use case, but the default value is 0.05. Calculate the Test statistic and P-value. Perform a statistical test that suits our data. The probability is known as the p-value. Check the resulting p-Value and Make a Decision. If the p-Value is smaller than our significance level, then we reject the null hypothesis in favour of our alternative hypothesis. If the p-Value is higher than our significance level, then we go with our null hypothesis. INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE One Sample T-Testing One sample T-Testing approach collects a huge amount of data and tests it on random samples. This test is used to test the mean of the sample with the population. Syntax: t.test(x, mu) Parameters: x: represents numeric vector of data mu: represents true value of the mean Example # Defining sample vector x <- rnorm(100) # One Sample T-Test t.test(x, mu = 5) Output: One Sample t-test data: x t = -49.504, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 5 95 percent confidence interval: -0.1910645 0.2090349 sample estimates: mean of x 0.008985172 Data: The dataset ‘x’ was used for the test. The determined t-value is -49.504. Degrees of Freedom (df): The t-test has 99 degrees of freedom. The p-value is 2.2e-16, which indicates that there is substantial evidence refuting the null hypothesis. Alternative hypothesis: The true mean is not equal to five, according to the alternative hypothesis. 95 percent confidence interval: (-0.1910645, 0.2090349) is the confidence interval’s value. This range denotes the values that, with 95% confidence, correspond to the genuine population mean. Two Sample T-Testing In two sample T-Testing, the sample vectors are compared. If var. equal = TRUE, the test assumes that the variances of both the samples are equal. INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Syntax: t.test(x, y) Parameters: x and y: Numeric vectors Example: # Defining sample vector x <- rnorm(100) y <- rnorm(100) # Two Sample T-Test t.test(x, y) Output: Welch Two Sample t-test data: x and y t = -1.0601, df = 197.86, p-value = 0.2904 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.4362140 0.1311918 sample estimates: mean of x mean of y -0.05075633 0.10175478 Wilcoxon Signed-Rank Test in R This test can be divided into two parts: One-Sample Wilcoxon Signed Rank Test Paired Samples Wilcoxon Test One-Sample Wilcoxon Signed Rank Test The one-sample Wilcoxon signed-rank test is a non-parametric alternative to a one-sample t-test when the data cannot be assumed to be normally distributed. It’s used to determine whether the median of the sample is equal to a known standard value i.e. a theoretical value. Syntax: wilcox.test(x, mu = 0, alternative = “two.sided”) Parameters: x: a numeric vector containing your data values mu: the theoretical mean/median value. Default is 0 but you can change it. alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”. INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Example # R program to illustrate # one-sample Wilcoxon signed-rank test # The data set set.seed(1234) myData = data.frame( name = paste0(rep("R_", 10), 1:10), weight = round(rnorm(10, 30, 2), 1) ) # One-sample wilcoxon test wilcox.test(myData$weight, mu = 25, alternative = "less") # Printing the results print(result) Output: Wilcoxon signed rank exact test data: myData$weight V = 55, p-value = 1 alternative hypothesis: true location is less than 25 Paired Samples Wilcoxon Test in R The paired samples Wilcoxon test is a non-parametric alternative to paired t-test used to compare paired data. It’s used when data are not normally distributed. Syntax: wilcox.test(x, y, paired = TRUE, alternative = “two.sided”) Parameters: x, y: numeric vectors paired: a logical value specifying that we want to compute a paired Wilcoxon test alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”. Example # R program to illustrate # Paired Samples Wilcoxon Test INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE # The data set # Weight of the rabbit before treatment before <-c(190.1, 190.9, 172.7, 213, 231.4, 196.9, 172.2, 285.5, 225.2, 113.7) # Weight of the rabbit after treatment after <-c(392.9, 313.2, 345.1, 393, 434, 227.9, 422, 383.9, 392.3, 352.2) # Create a data frame myData <- data.frame( group = rep(c("before", "after"), each = 10), weight = c(before, after) ) # Paired Samples Wilcoxon Test result = wilcox.test(weight ~ group, # Printing the results print(result) Output: Wilcoxon signed rank test data: weight by group V = 55, p-value = 1 alternative hypothesis: true location shift is less than 0 Paired t-test Paired test is used to check whether there is a significant difference between two population means when their data is in the form of matched pairs. Syntax: t.test(x, y, paired = TRUE, alternative = "two.sided") where x,y: numeric vectors data = myData, paired = TRUE, alternative = "less") INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE paired: a logical value specifying that we want to compute a paired t-test alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”. Example: # Define the datasets before <- c(39,43,41,32,37,40,42,40,37,38) after <- c(42,45,42,43,40,44,40,43,41,40) # Perform the paired t-test t.test(x=before,y=after,paired = TRUE,alternative = "greater") Output: Paired t-test data: before and after t = -2.9876, df = 9, p-value = 0.9924 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -5.002085 Inf sample estimates: mean of the differences -3.1 Chi-Square test Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc. Syntax: chisq.test(data) Parameters: data: data is a table containing count values of the variables in the table. Example: # Load the library. library("MASS") # Create a data frame from the main data set. car.data <- data.frame(Cars93$AirBags, Cars93$Type) INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE # Create a table with the needed variables. car.data = table(Cars93$AirBags, Cars93$Type) print(car.data) # Perform the Chi-Square test. print(chisq.test(car.data)) Output: Compact Large Midsize Small Sporty Van Driver & Passenger 2 4 7 0 3 0 Driver only 9 7 11 5 8 3 None 5 0 4 16 3 6 Pearson's Chi-squared test data: car.data X-squared = 33.001, df = 10, p-value = 0.0002723 Warning message: In chisq.test(car.data) : Chi-squared approximation may be incorrect Advantages of Hypothesis Testing: Objectivity:Hypothesis testing provides a structured and objective approach to decision-making in statistical analysis. Inference:Hypothesis testing enables researchers to make inferences about population parameters based on sample data. Standardization:The use of standardized procedures in hypothesis testing allows for consistency across different studies and ensures that statistical analyses are conducted in a systematic manner. Decision-Making:Hypothesis testing provides a clear framework for decision-making Scientific Rigor:By setting up null and alternative hypotheses and applying statistical tests, hypothesis testing contributes to the scientific rigor of research. Disadvantages of Hypothesis Testing: Assumptions:Many hypothesis tests rely on assumptions about the data, If these assumptions are violated, the results may be unreliable. INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Sensitivity to Sample Size:Small sample sizes can lead to less reliable results. The power of a test (the ability to detect a true effect) increases with larger sample sizes, and small samples may fail to detect real differences. Risk of Errors:The balance between these errors depends on the chosen significance level and statistical power. Limited Scope:Hypothesis testing typically focuses on specific hypotheses and may not provide a complete picture of the data. Proportion Test Proportion testing is commonly used to ananlyze categorical data, especially when working with binary outcomes or proportions. Syntax: prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"), conf.level = 0.95, correct = TRUE) where x->a vector of counts of successes, a one-dimensional table with two entries, or a two-dimensional table (or matrix) with 2 columns, giving the counts of successes and failures, respectively. n->a vector of counts of trials; ignored if x is a matrix or a table. p->a vector of probabilities of success. The length of p must be the same as the number of groups specified by x, and its elements must be greater than 0 and less than 1. Alternative->a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less". You can specify just the initial letter. Only used for testing the null that a single proportion equals a given value, or that two proportions are equal; ignored otherwise. conf.level->confidence level of the returned confidence interval. Must be a single number between 0 and 1. Only used when testing the null that a single proportion equals a given value, or that two proportions are equal; ignored otherwise. Correct->a logical indicating whether Yates' continuity correction should be applied where possible. Example: smokers <- c( 83, 90, 129, 70 ) patients <- c( 86, 93, 136, 82 ) prop.test(smokers, patients) output: 4-sample test for equality of proportions without continuity correction INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE data: smokers out of patients X-squared = 12.6, df = 3, p-value = 0.005585 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 0.9651163 0.9677419 0.9485294 0.8536585 One-Proportion Z-Test in R Programming The One proportion Z-test is used to compare an observed proportion to a theoretical one when there are only two categories. For example, we have a population of mice containing half male and half females (p = 0.5 = 50%). Some of these mice (n = 160) have developed spontaneous cancer, including 95 males and 65 females. We want to know, whether cancer affects more males than females? So in this problem: The number of successes (male with cancer) is 95 The observed proportion (po) of the male is 95/160 The observed proportion (q) of the female is 1 – po The expected proportion (pe) of the male is 0.5 (50%) The number of observations (n) is 160 The Formula for One-Proportion Z-Test The test statistic (also known as z-test) can be calculated as follow: where, po: the observed proportion q: 1 – po pe: the expected proportion n: the sample size Implementation in R In R Language, the function used for performing a z-test is binom.test() and prop.test(). Syntax: binom.test(x, n, p = 0.5, alternative = “two.sided”) prop.test(x, n, p = NULL, alternative = “two.sided”, correct = TRUE) Parameters: x = number of successes and failures in data set. n = size of data set. INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE p = probabilities of success. It must be in the range of 0 to 1. alternative = a character string specifying the alternative hypothesis. correct = a logical indicating whether Yates’ continuity correction should be applied where possible. Example: # Using prop.test() prop.test(x = 95, n = 160, p = 0.8, correct = FALSE) o/p: 1-sample proportions test without continuity correction data: 95 out of 160, null probability 0.8 X-squared = 42.539, df = 1, p-value = 6 . 928e-11 alternative hypothesis: true p is not equal to 0.8 95 percent confidence interval: 0.5163169 0.6667870 sample estimates: p 0.59375 It returns p-value which is 6.928462 alternative hypothesis a 95% confidence intervals. probability of success is 0.59 # Using binom.test() binom.test(x =25, n = 100, p = 0.15) o/p: Exact binomial test data: 25 and 100 number of successes = 25, number of trials = 100, p-value = 0.007633 alternative hypothesis: true probability of success is not equal to 0.15 95 percent confidence interval: 0.1687797 0.3465525 sample estimates: probability of success 0.25 Two-Proportions Z-Test in R Programming A two-proportion z-test allows us to compare two proportions to see if they are the same. For example, let there be two groups of individuals: Group A with lung cancer: n = 500 INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Group B, healthy individuals: n = 500 The number of smokers in each group is as follows: Group A with lung cancer: n = 500, 490 smokers, pA = 490/500 = 98 Group B, healthy individuals: n = 500, 400 smokers, pB = 400/500 = 80 In this setting: The overall proportion of smokers is p = frac(490+400) 500 + 500 = 89 The overall proportion of non-smokers is q = 1 – p = 11 So we want to know, whether the proportions of smokers are the same in the two groups of individuals. The Formula for Two-Proportion Z-Test The test statistic (also known as z-test) can be calculated as follow: where, pA: the proportion observed in group A with size nA pB: the proportion observed in group B with size nB p and q: the overall proportions In R, the function used for performing a z-test is prop.test(). Syntax: prop.test(x, n, p = NULL, alternative = c(“two.sided”, “less”, “greater”), correct = TRUE) Parameters: x = number of successes and failures in data set. n = size of data set. p = probabilities of success. It must be in the range of 0 to 1. alternative = a character string specifying the alternative hypothesis. correct = a logical indicating whether Yates’ continuity correction should be applied where possible. Example: # prop Test in R prop.test(x = c(342, 290),n = c(400, 400)) Output: 2-sample test for equality of proportions with continuity correction data: c(342, 290) out of c(400, 400) X-squared = 19.598, df = 1, p-value = 9.559e-06 alternative hypothesis: two.sided 95 percent confidence interval: 0.07177443 0.18822557 sample estimates: INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE prop 1 prop 2 0.855 0.725 Errors in Hypothesis Testing Errors in Hypothesis Testing is the estimate of the approval or rejection of a particular hypothesis. There are two types of errors. Type I Error: A type I error appears when the null hypothesis (H0) of an experiment is true, but still, it is rejected. It is stating something which is not present or a false hit. A type I error is often called a false positive (an event that shows that a given condition is present when it is absent). It is denoted by alpha (α). Type II Error A type II error appears when the null hypothesis is false but mistakenly fails to be refused. It is losing to state what is present and a miss. A type II error is also known as false negative (where a real hit was rejected by the test and is observed as a miss), in an experiment checking for a condition with a final outcome of true or false.A type II error is assigned when a true alternative hypothesis is not acknowledged. It is denoted by beta(β) Type I and Type II Errors Example Example 1: Let us consider a null hypothesis – A man is not guilty of a crime. Then in this case: Type I error (False Positive) He is condemned to crime, though he is not guilty or committed the crime. Type II error (False Negative) He is condemned not guilty when the court actually does commit the crime by letting the guilty one go free. Example for Type 1 error in R # Set the parameters alpha <- 0.05 sample_size <- 30 num_simulations <- 10000 # Set the seed for reproducibility set.seed(123) # Initialize the counter for false positives false_positives <- 0 # Perform the simulations for (i in 1:num_simulations) { INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE # Generate two samples from the same normal # distribution (null hypothesis is true) sample1 <- rnorm(sample_size, mean = 0, sd = 1) sample2 <- rnorm(sample_size, mean = 0, sd = 1) # Conduct a t-test test_result <- t.test(sample1, sample2) # Check if the p-value is less than the alpha level if (test_result$p.value < alpha) { false_positives <- false_positives + 1 } } # Calculate the Type I error rate type1_error_rate <- false_positives / num_simulations # Print the Type I error rate cat("Type I Error Rate:", type1_error_rate) Output > # Print the Type I error rate > cat("Type I Error Rate:", type1_error_rate) Type I Error Rate: 0.0481 Example for Type II error in R # Install and load required packages if (!require(pwr)) install.packages("pwr") library(pwr) # Parameters effect_size <- 0.5 # The difference between null and alternative hypotheses sample_size <- 100 # The number of observations in each group sd <- 15 # The standard deviation alpha <- 0.05 # The significance level # Calculate Type II Error pwr_result <-pwr.t.test(n = sample_size,d = effect_size / sd,sig.level = alpha,type = "two.sample",alternative = "two.sided") INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE type_II_error <- 1 - pwr_result$power # Print Type II Error print(type_II_error) Output > # Print Type II Error > print(type_II_error) [1] 0.9436737 Power of a Hypothesis Test The probability of not committing a Type II error is called the power of a hypothesis test. Factors That Affect Power The power of a hypothesis test is affected by these factors. Effect size(ES):The difference between the hypothesized value of a parameter and its true value. A larger effect size increases statistical power. Sample size (n). Other things being equal, the greater the sample size, the greater the power of the test. Significance level (α). The lower the significance level, the lower the power of the test. If you reduce the significance level (e.g., from 0.05 to 0.01), the region of acceptance gets bigger. As a result, you are less likely to reject the null hypothesis. This means you are less likely to reject the null hypothesis when it is false, so you are more likely to make a Type II error. In short, the power of the test is reduced when you reduce the significance level; and vice versa. The "true" value of the parameter being tested. The greater the difference between the "true" value of a parameter and the value specified in the null hypothesis, the greater the power of the test. That is, the greater the effect size, the greater the power of the test. Analysis of Variance(ANOVA) ANOVA also known as Analysis of variance is used to investigate relations between categorical variables and continuous variables in the R Programming Language. It is a type of hypothesis testing for population variance. ANOVA test involves setting up: Null Hypothesis: The default assumption, or null hypothesis, is that there is no meaningful relationship or impact between the variables. The null hypothesis is commonly written as H0. INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE Alternate Hypothesis: The opposite of the null hypothesis is the alternative hypothesis. It implies that there is a significant relationship, difference, or link among the population’s variables. Alternative hypotheses are sometimes referred to as H1 or HA. Syntax in R: aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, …) Arguments Formula-A formula specifying the model. Data-A data frame in which the variables specified in the formula will be found. If missing, the variables are searched for in the standard way. Projections-Logical flag: should the projections be returned? Qr-Logical flag: should the QR decomposition be returned? Contrasts-A list of contrasts to be used for some of the factors in the formula. These are not used for any Error term, and supplying contrasts for factors only in the Error term will give a warning. …-Arguments to be passed to lm, such as subset or na.action. One-way ANOVA: One-way When there is a single categorical independent variable (also known as a factor) and a single continuous dependent variable, an ANOVA is employed. It seeks to ascertain whether there are any notable variations in the dependent variable’s means across the levels of the independent variable. Eg.,In the one-way ANOVA, we test the effects of 3 types of fertilizer on crop yield. One-way ANOVA test is performed using mtcars dataset which comes preinstalled with dplyr package between disp attribute, a continuous attribute and gear attribute, a categorical attribute. here are some steps. Setup Null Hypothesis and Alternate Hypothesis H0 = mu = mu01 = mu02(There is no difference between average displacement for different gears) H1 = Not all means are equal. # Installing the package install.packages("dplyr") # Loading the package library(dplyr) INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE head(mtcars) mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear)) summary(mtcars_aov) Output: mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Here we will print top 5 record of our dataset to get an idea about our dataset. Df Sum Sq Mean Sq F value Pr(>F) factor(mtcars$gear) 2 280221 140110 20.73 2.56e-06 *** Residuals 29 195964 6757 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Df: The model’s degrees of freedom. Sum Sq: The sums of squares, which represent the variability that the model is able to account for. Mean Sq: The variance explained by each component is represented by the mean squares. F-value: It is the measure used to compare the mean squares both within and between groups. Pr(>F): The F-statistics p-value, which denotes the factors’ statistical significance. Residuals: Relative deviations from the group mean, are often known as residuals and their summary statistics. Two-way ANOVA: When there are two categorical independent variables (factors) and one continuous dependent variable, two-way ANOVA is used as an extension of one-way ANOVA. You can evaluate both the direct impacts of each independent variable and how they interact with one another on the dependent variable. Eg.,In the two-way ANOVA, we add an additional independent INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE variable: planting density. We test the effects of 3 types of fertilizer and 2 different planting densities on crop yield. A two-way ANOVA test is performed using mtcars dataset which comes preinstalled with dplyr package between disp attribute, a continuous attribute and gear attribute, a categorical attribute, am attribute, a categorical attribute. Setup Null Hypothesis and Alternate Hypothesis H0 = mu0 = mu01 = mu02(There is no difference between average displacement for different gear) H1 = Not all means are equal # Installing the package install.packages("dplyr") # Loading the package library(dplyr) # Variance in mean within group and between group histogram(mtcars$disp~mtcars$gear, subset = (mtcars$am == 0), xlab = "gear", ylab = "disp", main = "Automatic") histogram(mtcars$disp~mtcars$gear, subset = (mtcars$am == 1), xlab = "gear", ylab = "disp", main = "Manual") Output: INNAHAI ANUGRAHAM

BCA V SEM R PROGRAMMING RAJADHANI DEGREE COLLGE The histogram shows the mean values of gear with respect to displacement. Hear categorical variables are gear and am on which factor function is used and continuous variable is disp. Calculate test statistics using aov function mtcars_aov2 <- aov(mtcars$disp~factor(mtcars$gear) * summary(mtcars_aov2) Output: Df Sum Sq Mean Sq F value Pr(>F) factor(mtcars$gear) 2 280221 140110 20.695 3.03e-06 *** factor(mtcars$am) 1 6399 6399 0.945 0.339 Residuals 28 189565 6770 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 factor(mtcars$am)) INNAHAI ANUGRAHAM

unit4_R

unit4_R

Presentation Transcript