E N D
SRI RAMAKRISHNA ENGINEERING COLLEGE [Educational Service : SNR Sons Charitable Trust] [Autonomous Institution, Accredited by NAAC with ‘A’ Grade] [Approved by AICTE and Permanently Affiliated to Anna University, Chennai] [ISO 9001:2015 Certified and all Eligible Programmes Accredited by NBA] VATTAMALAIPALAYAM, N.G.G.O. COLONY POST, COIMBATORE – 641 022. Department of Artificial Intelligence and Data Science 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING PRESENTATION BY Mrs.V.GomathiSankari ASSISTANT PROFESSOR/ AI&DS 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
20AD2E51 PYTHON FRAMEWORKS FOR MACHINE LEARNING 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Pandas Library 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Introduction • Pandas is a Python library used for working with data sets. • It has functions for analyzing, cleaning, exploring, and manipulating data. • The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008. • Pandas allows us to analyze big data and make conclusions based on statistical theories. • Pandas can clean messy data sets, and make them readable and relevant. • Relevant data is very important in data science. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Import Pandas • Once Pandas is installed, import it in your applications by adding the import keyword: Import pandas • Now Pandas is imported and ready to use. Example import pandasmydataset = { 'cars': ["BMW", "Volvo", "Ford"], 'passings': [3, 7, 2]}myvar = pandas.DataFrame(mydataset)print(myvar) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Pandas Series • A Pandas Series is like a column in a table. • It is a one-dimensional array holding data of any type. import pandas as pda = [1, 7, 2]myvar = pd.Series(a)print(myvar) • If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc. • This label can be used to access a specified value. import pandas as pda = [1, 7, 2]myvar = pd.Series(a, index = ["x", "y", "z"])print(myvar) print(myvar["y"]) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Pandas DataFrames • A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. • import pandas as pddata = { "calories": [420, 380, 390], "duration": [50, 40, 45]}#load data into a DataFrame object:df = pd.DataFrame(data)print(df) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Locate Row • As you can see from the result above, the DataFrame is like a table with rows and columns. • Pandas use the loc attribute to return one or more specified row(s) #refer to the row index:print(df.loc[0]) • Return row 0 and 1: #use a list of indexes:print(df.loc[[0, 1]]) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Named Indexes • With the index argument, you can name your own indexes. • import pandas as pddata = { "calories": [420, 380, 390], "duration": [50, 40, 45]}df = pd.DataFrame(data, index = ["day1", "day2", "day3"])print(df) • Locate Named Indexes • #refer to the named index:print(df.loc["day2"]) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Load Files Into a DataFrame • If your data sets are stored in a file, Pandas can load them into a DataFrame. • Load a comma separated file (CSV file) into a DataFrame: • import pandas as pddf = pd.read_csv('data.csv')print(df) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Analyzing DataFrames Viewing the Data • The head() method returns the headers and a specified number of rows, starting from the top. • The tail() method returns the headers and a specified number of rows, starting from the bottom. Info About the Data • print(df.info()) • Empty values, or Null values, can be bad when analyzing data, and you should consider removing rows with empty values. This is a step towards what is called cleaning data 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cleaning Data • Data cleaning means fixing bad data in your data set. • Bad data could be: • Empty cells • Data in wrong format • Wrong data • Duplicates 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cleaning Data Empty Cells • Empty cells can potentially give you a wrong result when you analyze data. • One way to deal with empty cells is to remove rows that contain empty cells. Remove Rows • import pandas as pddf = pd.read_csv('data.csv')new_df = df.dropna()print(new_df.to_string()) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cleaning Data • Removing Duplicates • Discovering Duplicates • The duplicated() method returns a Boolean values for each row: • print(df.duplicated()) • Removing Duplicates • To remove duplicates, use the drop_duplicates() method. df.drop_duplicates(inplace = True) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
NumPy Introduction • NumPy is a Python library used for working with arrays. • It also has functions for working in domain of linear algebra, fourier transform, and matrices. • NumPy was created in 2005 by Travis Oliphant. It is an open source project and you can use it freely. • NumPy stands for Numerical Python. • NumPy aims to provide an array object that is up to 50x faster than traditional Python lists. • The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. • Arrays are very frequently used in data science, where speed and resources are very important. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Import NumPy • import numpyarr = numpy.array([1, 2, 3, 4, 5])print(arr) • Checking NumPy Version • import numpy as npprint(np.__version__) • print(type(arr)) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Dimensions in Arrays • 0-D Arrays • 0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array. • arr = np.array(42) • 1-D Arrays • An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array. • arr = np.array([1, 2, 3, 4, 5]) • 2-D Arrays • An array that has 1-D arrays as its elements is called a 2-D array. • arr = np.array([[1, 2, 3], [4, 5, 6]])print(arr) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Check Number of Dimensions • NumPy Arrays provides the ndim attribute that returns an integer that tells us how many dimensions the array have. • import numpy as npa = np.array(42)b = np.array([1, 2, 3, 4, 5])c = np.array([[1, 2, 3], [4, 5, 6]])d = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])print(a.ndim)print(b.ndim)print(c.ndim)print(d.ndim) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Shape of an Array • The shape of an array is the number of elements in each dimension. • import numpy as nparr = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])print(arr.shape) • O/P: (2, 4) • import numpy as nparr = np.array([1, 2, 3, 4], ndmin=5)print(arr)print('shape of array :', arr.shape) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Reshaping arrays • import numpy as nparr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])newarr = arr.reshape(4, 3)print(newarr) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Flattening the arrays • Flattening array means converting a multidimensional array into a 1D array. • We can use reshape(-1) to do this. • import numpy as nparr = np.array([[1, 2, 3], [4, 5, 6]])newarr = arr.reshape(-1)print(newarr) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Data Types in NumPy • Below is a list of all data types in NumPy and the characters used to represent them. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Checking the Data Type of an Array • arr = np.array([1, 2, 3, 4])print(arr.dtype) • O/P: int64 • arr = np.array(['apple', 'banana', 'cherry'])print(arr.dtype) • O/P: U6 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Creating Arrays With a Defined Data Type • We use the array() function to create arrays, this function can take an optional argument: dtype that allows us to define the expected data type of the array elements: • Example1: • arr = np.array([1, 2, 3, 4], dtype='S')print(arr)print(arr.dtype) Example2: arr = np.array([1, 2, 3, 4], dtype='i4')print(arr)print(arr.dtype) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Creating Arrays With a Defined Data Type • import numpy as np • arr = np.array(['a', '2', '3'], dtype='i') 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Converting Data Type on Existing Arrays • The best way to change the data type of an existing array, is to make a copy of the array with the astype() method. • The astype() function creates a copy of the array, and allows you to specify the data type as a parameter. • arr = np.array([1.1, 2.1, 3.1])newarr = arr.astype('i')print(newarr)print(newarr.dtype) • arr = np.array([1, 0, 3]) • newarr = arr.astype(bool) • print(newarr) • print(newarr.dtype) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Scikit Learn - Introduction • Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. • It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. • This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. • Scikit-learn is a community effort and anyone can contribute to it. This project is hosted on https://github.com/scikit-learn/scikit-learn. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Features • Supervised Learning algorithms • Unsupervised Learning algorithms • Clustering • Cross Validation • Dimensionality Reduction • Ensemble methods • Feature extraction • Feature selection 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Scikit Learn - Modelling Process • Dataset Loading • A collection of data is called dataset. It is having the following two components • Features − The variables of data are called its features. They are also known as predictors, inputs or attributes. • Feature matrix − It is the collection of features, in case there are more than one. • Feature Names − It is the list of all the names of the features. • Response − It is the output variable that basically depends upon the feature variables. They are also known as target, label or output. • Response Vector − It is used to represent response column. Generally, we have just one response column. • Target Names − It represent the possible values taken by a response vector. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Example Datasets • Scikit-learn have few example datasets like iris and digits for classification and the Boston house prices for regression. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Splitting the dataset • To check the accuracy of our model, we can split the dataset into two pieces-a training set and a testing set. • Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cont… • As seen in the example above, it uses train_test_split() function of scikit-learn to split the dataset. This function has the following arguments − • X, y − Here, X is the feature matrix and y is the response vector, which need to be split. • test_size − This represents the ratio of test data to the total given data. As in the above example, we are setting test_data = 0.3 for 150 rows of X. It will produce test data of 150*0.3 = 45 rows. • random_size − It is used to guarantee that the split will always be the same. This is useful in the situations where you want reproducible results. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Train the Model • Next, we can use our dataset to train some prediction-model. As discussed, scikit-learn has wide range of Machine Learning (ML) algorithms which have a consistent interface for fitting, predicting accuracy, recall etc. • from sklearn.neighbors • import KNeighborsClassifier from sklearn • import metrics classifier_knn = KNeighborsClassifier(n_neighbors = 3) classifier_knn.fit(X_train, y_train) y_pred = classifier_knn.predict(X_test) • # Finding accuracy by comparing actual response values(y_test)with predicted response value(y_pred) • print("Accuracy:", metrics.accuracy_score(y_test, y_pred)) • # Providing sample data and the model will make prediction out of that data • sample = [[5, 5, 3, 2], [2, 4, 3, 5]] • preds = classifier_knn.predict(sample) • pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Logistic regression • Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables. • For example, a logistic regression could be used to predict whether a political candidate will win or lose an election or whether a high school student will be admitted or not to a particular college. These binary outcomes allow straightforward decisions between two alternatives. • Logistic regression has become an important tool in the discipline of machine learning. It allows algorithms used in machine learning applications to classify incoming data based on historical data. As additional relevant data comes in, the algorithms get better at predicting classifications within data sets. • Logistic regression can also play a role in data preparation activities by allowing data sets to be put into specifically predefined buckets during the extract, transform, load process in order to stage the information for analysis. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Logistic regression - Implementation • Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). • If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), and otherwise it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). • This makes it a binary classifier. • Estimating Probabilities: • Logistic Regression model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.Gomathi Sankari
Once the Logistic Regression model has estimated the probability p = hθ (x) that an instance x belongs to the positive class, it can make its prediction ŷ easily (see Equation 4-15). 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.Gomathi Sankari
Training and Cost Function: • The objective of training is to set the parameter vector θ so that the model estimates high probabilities for positive instances (y = 1) and low probabilities for negative instances (y = 0). This idea is captured by the cost function shown in Equation 4-16 for a single training instance x 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.Gomathi Sankari
This cost function makes sense because –log(t) grows very large when t approaches 0, so the cost will be large if the model estimates a probability close to 0 for a positive instance, and it will also be very large if the model estimates a probability close to 1 for a negative instance. On the other hand, –log(t) is close to 0 when t is close to 1, so the cost will be close to 0 if the estimated probability is close to 0 for a negative instance or close to 1 for a positive instance, which is precisely what we want • The cost function over the whole training set is the average cost over all training instances. It can be written in a single expression called the log loss, shown in Equation 4-17. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.Gomathi Sankari
The bad news is that there is no known closed-form equation to compute the value of θ that minimizes this cost function (there is no equivalent of the Normal Equation). The good news is that this cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is not too large and you wait long enough). The partial derivatives of the cost function with regard to the jth model parameter θj are given by Equation 4-18. (Refer Ex.no.4) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.Gomathi Sankari
Sklearn Decision Trees Classifier • Classifiers • A classifier algorithm can be used to anticipate and understand what qualities are connected with a given class or target by mapping input data to a target variable using decision rules. • In this supervised machine learning technique, we already have the final labels and are only interested in how they might be predicted. • Based on variables such as Sepal Width, Petal Length, Sepal Length, and Petal Width, we may use the Decision Tree Classifier to estimate the sort of iris flower we have. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cont… • Decision Tree • A decision tree is a decision model and all of the possible outcomes that decision trees might hold. This might include the utility, outcomes, and input costs, that uses a flowchart-like tree structure. • The decision-tree algorithm is classified as a supervised learning algorithm. It can be used with both continuous and categorical output variables. • The node's result is represented by the branches/edges, and either of the following are contained in the nodes: • [Decision Nodes] Conditions • [End Nodes] Result 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Step-By-Step Implementation of Sklearn Decision Trees • Importing the Dataset • import pandas as pd • import numpy as np • from sklearn.datasets import load_iris • data = load_iris() • #convert to a dataframe • df = pd.DataFrame(data.data, columns = data.feature_names) • #create the species column • df['Species'] = data.target • #replace this with the actual names • target = np.unique(data.target) • target_names = np.unique(data.target_names) • targets = dict(zip(target, target_names)) • df['Species'] = df['Species'].replace(targets) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cont… • Extracting Datasets • x = df.drop(columns="Species") • y = df["Species"] • feature_names = x.columns • labels = y.unique() • #split the dataset • from sklearn.model_selection import train_test_split • X_train, test_x, y_train, test_lab = train_test_split(x,y,test_size = 0.4,random_state = 42) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.Gomathi Sankari
Cont… • Importing Decision Tree Classifier • from sklearn.tree import DecisionTreeClassifier • Fitting Algorithm to Training Data • clf = DecisionTreeClassifier(max_depth =3, random_state = 42) • clf.fit(X_train, y_train) • Checking the Algorithms • As a tree diagram • As a text-based diagram 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cont… • 1. As a Tree Diagram • from sklearn import tree • import matplotlib.pyplot as plt • plt.figure(figsize=(30,10), facecolor ='k') • a = tree.plot_tree(clf, • feature_names = feature_names, • class_names = labels, • rounded = True, • filled = True, • fontsize=14) • plt.show() 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Cont… • 2. As a Text-Based Diagram • from sklearn.tree import export_text • tree_rules = export_text(clf, • feature_names = list(feature_names)) • print(tree_rules) 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Random Forest classifier • Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. • It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. • As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.“ • The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.Gomathi Sankari
Working of the Random Forest algorithm 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari
Why use Random Forest? • It takes less training time as compared to other algorithms. • It predicts output with high accuracy, even for the large dataset it runs efficiently. • It can also maintain accuracy when a large proportion of data is missing. 20AD2E51 - PYTHON FRAMEWORKS FOR MACHINE LEARNING– Mrs.V.GomathiSankari