Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Introduction to Data Science Methods

Lecture 2: Datasets

STATS 2DA3

Introduction

• To work with data, it must be stored in an appropriate format.

• With R you;

• select a data structure to store your data.

• enter the data into this structure.

• There are many data structures available for use with R, which we will discuss today.

Data Structures

• The types of data structures available in R include

• scalars

• vectors

• matrices

• arrays

• data frames

• lists

• A Data Frame holds data. The columns are variables and the rows are observations. A benefit is that you can use variables of diferent type in the same data frame, e.g. numeric and categorical data in the same data frame.

Data Structures

 

image from R in Action by Robert Kabacof

Objects

• An object in R is a data object which has a class attribute.

•  Examples include all of the data structures listed on the previous slide, but also constants, functions, graphs....

•  Basically, everything in R is an object.

• An object has a mode (which describes how the object is stored) and a class (which tells R functions how to handle the data).

Vectors

• Vectors are 1-dimensional arrays that hold numeric, character or logical data.

• A vector is formed using the combine function c()

•  Data must be of the same type, i.e. you cannot mix modes in a vector.

• Scalars are vectors that contain 1 element. They are used to hold constants, e.g. f < −  5,    h < −   FALSE .

• You can refer to elements of a vector by referencing their position within the vector.

A colon operator : generates a sequence of numbers between 2 bounds.

Matrices and Arrays

• A matrix is a 2-dimensional array.

•  Data must be of the same type, numeric, character, or logical , i.e. you cannot mix modes in a matrix.

• A matrix is formed using the matrix() function.

• Arrays are similar to matrices but can have more than two dimensions.

• An array is formed using the array() function.

Data Frames

•  Data frames are the most popular type of data structure in R.

•  Data can be of mixed type, numeric, character, or logical , i.e. you CAN mix modes in a data frame.

• A data frame is formed using the data .frame() function.

•  Each column can only have one mode.

Columns usually represent variables.

Factors

• Variables can be nominal, ordinal, or continuous.

•  Nominal variables are categorical, without an implied order, e.g. Diabetes (Type1, Type2).

• Ordinal variables imply order but not amount, e.g. Status (poor, improved, excellent).

• Continuous variables can take any value within a range, and both order and amount are implied, e.g. Age in years.

Factors

• Categorical variables, both nominal and ordinal, are called factors in R.

• The function factor() stores a variable as categorical data.

•  For ordinal variables, add the parameter ordered=TRUE to the factor() function.

•  Factor levels for character vectors are created in alphabetical order by default.

You can change the level order using the levels option.

Lists

•  Lists are the most complex data structures in R.

• A list in as an ordered collection of objects.

• A list can contain vectors, data frames, other lists etc.... all combined under one name.

• A list is formed using the list() function.

• You can specify elements of a list by indicating a component within double brackets [ [     ] ] .

•  In R, many functions return lists. You then need to select the components of interest to you.