STATS 2DA3 Introduction to Data Science Methods Lecture 2: Datasets
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Introduction to Data Science Methods
Lecture 2: Datasets
STATS 2DA3
Introduction
• To work with data, it must be stored in an appropriate format.
• With R you;
• select a data structure to store your data.
• enter the data into this structure.
• There are many data structures available for use with R, which we will discuss today.
Data Structures
• The types of data structures available in R include
• scalars
• vectors
• matrices
• arrays
• data frames
• lists
• A Data Frame holds data. The columns are variables and the rows are observations. A benefit is that you can use variables of diferent type in the same data frame, e.g. numeric and categorical data in the same data frame.
Data Structures
image from R in Action by Robert Kabacof
Objects
• An object in R is a data object which has a class attribute.
• Examples include all of the data structures listed on the previous slide, but also constants, functions, graphs....
• Basically, everything in R is an object.
• An object has a mode (which describes how the object is stored) and a class (which tells R functions how to handle the data).
Vectors
• Vectors are 1-dimensional arrays that hold numeric, character or logical data.
• A vector is formed using the combine function c()
• Data must be of the same type, i.e. you cannot mix modes in a vector.
• Scalars are vectors that contain 1 element. They are used to hold constants, e.g. f < − 5, h < − FALSE .
• You can refer to elements of a vector by referencing their position within the vector.
• A colon operator : generates a sequence of numbers between 2 bounds.
Matrices and Arrays
• A matrix is a 2-dimensional array.
• Data must be of the same type, numeric, character, or logical , i.e. you cannot mix modes in a matrix.
• A matrix is formed using the matrix() function.
• Arrays are similar to matrices but can have more than two dimensions.
• An array is formed using the array() function.
Data Frames
• Data frames are the most popular type of data structure in R.
• Data can be of mixed type, numeric, character, or logical , i.e. you CAN mix modes in a data frame.
• A data frame is formed using the data .frame() function.
• Each column can only have one mode.
• Columns usually represent variables.
Factors
• Variables can be nominal, ordinal, or continuous.
• Nominal variables are categorical, without an implied order, e.g. Diabetes (Type1, Type2).
• Ordinal variables imply order but not amount, e.g. Status (poor, improved, excellent).
• Continuous variables can take any value within a range, and both order and amount are implied, e.g. Age in years.
Factors
• Categorical variables, both nominal and ordinal, are called factors in R.
• The function factor() stores a variable as categorical data.
• For ordinal variables, add the parameter ordered=TRUE to the factor() function.
• Factor levels for character vectors are created in alphabetical order by default.
• You can change the level order using the levels option.
Lists
• Lists are the most complex data structures in R.
• A list in as an ordered collection of objects.
• A list can contain vectors, data frames, other lists etc.... all combined under one name.
• A list is formed using the list() function.
• You can specify elements of a list by indicating a component within double brackets [ [ ] ] .
• In R, many functions return lists. You then need to select the components of interest to you.
2023-01-18