1 / 50

Data Science Tutorial | Introduction To Data Science | Data Science Training | Edureka

This Edureka Data Science tutorial will help you understand in and out of Data Science with examples. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Data Science concepts. Below are the topics covered in this tutorial: <br><br>1. Why Data Science? <br>2. What is Data Science? <br>3. Who is a Data Scientist? <br>4. How a Problem is Solved in Data Science? <br>5. Data Science Components

EdurekaIN
Download Presentation

Data Science Tutorial | Introduction To Data Science | Data Science Training | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Edureka’s Data Science Certification Training www.edureka.co/data-science

  2. Agenda for Today’s Session Why Data Science? What is Data Science? Who is a Data Scientist? How a Problem is Solved in Data Science? Data Science Components Demo Edureka’s Data Science Certification Training www.edureka.co/data-science

  3. Why Data Science? Edureka’s Data Science Certification Training www.edureka.co/data-science

  4. Why Data Science? The most abundant thing today, is data. We have data about everything which is increasing multifolds everyday! Then Increase in data Edureka’s Data Science Certification Training www.edureka.co/data-science

  5. What is Data Science? Edureka’s Data Science Certification Training www.edureka.co/data-science

  6. What is Data Science? It is called data-driven science, it is an inter-disciplinary field about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured. A question that usually is asked to data scientists is “Tell us something, that we don’t know?” It involves: Programming + Statistics + Business Edureka’s Data Science Certification Training www.edureka.co/data-science

  7. Who is a Data Scientist? Edureka’s Data Science Certification Training www.edureka.co/data-science

  8. Who is a Data Scientist? MATHS Statistics Discrete Maths Information Theory Combinatorics Decision Theory Machine Learning Data Viz Builders Econometricians Management Scientists DATA SCIENTIST Statistical Programmers Actuaries BUSINESS INFORMATION SYSTEMS Computer Science Economics BI Developers Finance Data Analysis Marketing Software Engineering Operations Systems Development Management Edureka’s Data Science Certification Training www.edureka.co/data-science

  9. Role Of A Data Scientist The Data Scientist will be responsible for designing and creating processes and layouts for complex, large-scale data sets used for modeling, data mining, and research purposes. Responsibilities Selecting features, building and optimizing classifiers using machine learning techniques. ➢ Data mining using state-of-the-art methods. ➢ Extending company’s data with third party sources of information when needed. ➢ Processing, cleansing, and verifying the integrity of data for analysis. ➢ Building predictive models using Machine Learning algorithms. ➢ Edureka’s Data Science Certification Training www.edureka.co/data-science

  10. How a problem is solved in Data Science? Edureka’s Data Science Certification Training www.edureka.co/data-science

  11. Problem Solving in Data Science Edureka’s Data Science Certification Training www.edureka.co/data-science

  12. Problem Solving in Data Science ➢ Discovery involves acquiring data from all the identified internal and external Discovery sources that can help answer the business question. ➢ This data could be Data Preparation • logs from webservers Model Planning • social media data Model Building • census datasets • data streamed from online sources via APIs Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  13. Problem Solving in Data Science Doctor gets this data from the medical history of the patient. Discovery Data Preparation Attributes: npreg glucose – bp skin bmi ped age income – Income Number of times pregnant Plasma glucose concentration Blood pressure – Triceps skinfold thickness – Body mass index – Diabetes pedigree function – Age – Model Planning – Model Building Operationalize Communicate Results Income is an irrelevant attribute in the prediction of diabetes Edureka’s Data Science Certification Training www.edureka.co/data-science

  14. Problem Solving in Data Science ➢ The data can have a lot of inconsistencies like missing values, blank columns, abrupt values and incorrect data format which need to be cleaned. Discovery ➢ It is required to explore, preprocess and condition data prior to modeling. Data Preparation ➢ This will help you to spot the outliers and establish a relationship between the variables. Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  15. Problem Solving in Data Science This data has lot of anomalies and needs cleansing before further analysis can be done. Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  16. Problem Solving in Data Science We clean and preprocess this data by removing the outliers, filling up the null values and normalizing the data type. Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  17. Problem Solving in Data Science Here, we determine the methods and techniques to draw the relationships between variable. ➢ Discovery Apply Exploratory Data Analytics (EDA) using various statistical formulas and visualization tools. ➢ Data Preparation Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  18. Problem Solving in Data Science Use of visualization techniques like histograms, line graphs, box plots to get a fair idea of the distribution of data. Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  19. Problem Solving in Data Science Develop datasets for training and testing purposes. ➢ Discovery Consider whether existing tools will suffice for running the models. ➢ Analyze various learning techniques like classification, association and clustering to build the model. ➢ Data Preparation Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  20. Problem Solving in Data Science This is a decision tree based on different attributes. Discovery Data Preparation Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  21. Problem Solving in Data Science ➢Deliver final reports, briefings, code and technical documents. Discovery ➢Implement pilot project in a real-time production environment. ➢Look for performance constraints if any. Data Preparation Model Planning Model Building Operationalize Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  22. Problem Solving in Data Science Identify all the key findings and communicate to the stakeholders. ➢ Discovery Explaining the model and result to medical authorities. ➢ Determine if the results of the project are a success or a failure based on the criteria developed. ➢ Initialization Model Planning Model Building Deployment Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  23. Problem Solving in Data Science Diabetes Positive set: ➢ Discovery • glucose > 154 • glucose >127 & <= 154 + bmi >30.9 • glucose<=127 + pregnant >5 • glucose<=127 + pregnant <=5 + age >28 • glucose<=127 + pregnant <=5 + age <=28 +bmi > 30.9 Initialization Model Planning Diabetes Negative set: ➢ glucose > 154 glucose >127 & <= 154 + bmi <=30.9 glucose<=127 + pregnant <=5 + age <=28 +bmi <= 30.9 • • • Model Building Deployment We can use this decision tree result to know whether the patient is vulnerable to diabetes or not. ➢ Communicate Results Edureka’s Data Science Certification Training www.edureka.co/data-science

  24. How to choose Algorithms in Data Science? Edureka’s Data Science Certification Training www.edureka.co/data-science

  25. Problem Solving in Data Science We take a top down approach to answer the same: These are the 5 questions which can be answered in data science. Classification Algorithm Is this A or B? Q1. Anomaly Detection Algorithm Is this weird? Q2. Regression Algorithms How much or how many? Q3. How is this organized? Clustering Algorithms Q4. Reinforcement Learning What should I do next? Q5. These algorithms are fitted into three types of categories, which are the following: Edureka’s Data Science Certification Training www.edureka.co/data-science

  26. Categories of Algorithms Types of Learning Supervised Learning Reinforcement Learning Unsupervised Learning Edureka’s Data Science Certification Training www.edureka.co/data-science

  27. Supervised Learning Supervised learning is a type of machine learning algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. From it, the supervised learning algorithm seeks to build a model that can make predictions of the response values for a new dataset. Supervised Learning Let’s take an example here. Say you are a teacher, and your way of teaching is, To teach by example, i.e for every problem in their life you are providing solutions to them, this type of learning is called supervised learning. Unsupervised Learning Teaching by Example Reinforcement Learning Let’s take the same example forward: Edureka’s Data Science Certification Training www.edureka.co/data-science

  28. Unsupervised Learning Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. Supervised Learning When your kids are taking decisions out of their own understanding, this type of learning would be Unsupervised Learning. Unsupervised Learning Self Learning Reinforcement Learning Edureka’s Data Science Certification Training www.edureka.co/data-science

  29. Reinforcement Learning Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Supervised Learning If a new situation comes up, the kid will take actions on his own i.e from his past experiences, but as a parent towards the end of an action you can tell him whether he did good or not. Unsupervised Learning Good or Bad? Reinforcement Learning Edureka’s Data Science Certification Training www.edureka.co/data-science

  30. Data Science Tools Edureka’s Data Science Certification Training www.edureka.co/data-science

  31. Data Science Tools The tool that is widely used by Data Analysts is R R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Edureka’s Data Science Certification Training www.edureka.co/data-science

  32. Why R? Programming and Statistical Language Apart from being used as a statistical language , it can also be used a programming language for analytical purposes. Data Analysis and Visualization Apart from being one of the most dominant analytics tools, R also is one of the most popular tools used for data visualization. Edureka’s Data Science Certification Training www.edureka.co/data-science

  33. Why R? Simple and Easy to Learn R is a simple and easy to learn, read & write Free and Open Source R is an example of a FLOSS (Free/Libre and Open Source Software) which means one can freely distribute copies of this software, read it's source code, modify it, etc. Edureka’s Data Science Certification Training www.edureka.co/data-science

  34. Datasets Now to do any kind of analysis, you need data right? This need of data is fulfilled through Data Sets. What are datasets? A collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a computer Sample Dataset Edureka’s Data Science Certification Training www.edureka.co/data-science

  35. Datasets But what if you have a HUGE dataset! Ever heard of Big Data? Edureka’s Data Science Certification Training www.edureka.co/data-science

  36. What is Big Data? Edureka’s Data Science Certification Training www.edureka.co/data-science

  37. What is Big Data? “Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Volume Variety Velocity Value Veracity Processing increasing huge data sets Processing different types of data Data is being generated at an alarming rate Finding correct meaning out of the data Uncertainty and inconsistencies in the data EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  38. Big Data Now these problems had to be dealt with, right? Hence, Hadoop came into the picture. EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  39. What is Hadoop? Edureka’s Data Science Certification Training www.edureka.co/data-science

  40. What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion H A D O O P Processing: Allows parallel & distributed processing Storage: Distributed File System EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  41. What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion H A D O O P Processing: Allows parallel & distributed processing Storage: Distributed File System EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  42. What is Hadoop? Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion H A D O O P Processing: Allows parallel & distributed processing Storage: Distributed File System EDUREKA HADOOP CERTIFICATION TRAINING www.edureka.co/big-data-and-hadoop

  43. What is Hadoop? Now you need a data analytics tool, which can handle this much processing and data. For that we use Spark R Edureka’s Data Science Certification Training www.edureka.co/data-science

  44. What is Spark R? SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark 2.1.1,SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. WOW! Edureka’s Data Science Certification Training www.edureka.co/data-science

  45. Demo Edureka’s Data Science Certification Training www.edureka.co/data-science

  46. Demo This dataset provides detailed road safety data about the circumstances of personal injury road accidents from 1979 -2013. Our aim is to find the following things: To find the number of accidents happened: In various weather conditions In various light conditions In various road surface conditions With make information of the accident vehicles During various days of week On various road types Number of casualties per accident per year Number of accidents happening at various speed limits ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ We have to find the results of the queries in Hadoop Edureka’s Data Science Certification Training www.edureka.co/data-science

  47. Demo 1 2 3 Data Stored Using R for Analysis Huge amount of Accident data in HDFS Analyze the following queries for accidents in various weather conditions in various light conditions in various road surface conditions with make information of the accident vehicles Edureka’s Data Science Certification Training www.edureka.co/data-science

  48. Session In A Minute Why Data Science? What is Data Science? Who is a Data Scientist? How is a problem solved in Data Science? Data Science Components Demo Edureka’s Data Science Certification Training www.edureka.co/data-science

  49. Thank You … Questions/Queries/Feedback Edureka’s Data Science Certification Training www.edureka.co/data-science

More Related