Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Introduction to Data Science Methods

STATS 2DA3

Introduction

• Take a look at the course outline.

•  Notes will be posted online ahead of classes on Avenue to Learn. Please check Avenue regularly for course announcements and assignments.

• Assignments will be administered using Avenue to Learn.

• Oice hours.

• We have 2 lectures and 1 lab per week. Please check Mosaic for information on labs.

• There is no manditary textbook, however a lot of my notes are taken from R in Action by Robert Kabacof. There is a list of Suggested Reading on  Avenue to Learn.

Data Analytics

•  Data analysis can involve some or all of the following;

• transforming the data.

• imputation of missing values.

• variable selection.

• statistical modelling.

Data Analytics II

•  Modern Data Analytics also includes;

• pulling data from a variety of sources, such as database management systems, text files, spreadsheets, a variety of diferent statistical packages, and web pages.

• merging of data obtained from diferent sources.

• data cleaning

• analysis with modern techniques such as Machine Learning techniques.

creating graphical displays of results.

Data Analysis Flowchart

 

image from R in Action by Robert Kabacof

About R

• R is an environment and programming language used for statistical computing.

•  It is open-source (and hence free!).

• There are many powerful graphics packages, such as ggplot2.

•  Results from any step in an analysis can be saved, manipulated, and used as a new input.

• R functionality can be integrated into other languages e.g. C++, Python, SAS...

R can run on basically any platform, e.g. Mac OS, Windows, Unix....

Installing R

•  R is open source (and free) and can be downloaded from the Comprehensive R Archive Network (CRAN).

• Go to http://cran .rproject .org and download the version appropriate for your operating system (probably Windows or Mac).

• We will download additional libraries, such as ggplot2, later.

R Basics

• R is:

• case sensitive.

• an interpreted language (more on this later).

• You can enter commands in the prompt line (>) and they will be executed one at a time, however I recommend running your commands from a source file.

• R uses lots of data types, e.g. vectors, data frames, matrices, and lists (more on this later).

• There are lots of built in functions, and users can create their own.

Statements consist of Functions and Assignments .

R Basics II

• Objects can be created and manipulated. An object is anything that can be assigned a value, e.g. data, results...

• An object must have a class attribute, which tells R how to handle it correctly.

• < −   is used for assignments (not =).

Note: < −   is treated the same as −  >, but don’t use −  >, it’s not standard convention.

• To comment out text, use #. R will ignore anything that comes after #.

Language Type

• Any program is basically a set of instructions.

•  Both compiled and interpreted languages take human-readable code and convert it into machine code, which can be read by a computer.

• With complied languages, the target machine translates the program.

• With interpreted languages, an interpreter program reads and executes the code line by line.

Compiled Languages

• Compiled Languages;

• are directly converted into machine code, which the target machine executes.

• are fast.

• allow control of aspects such as memory and CPU use.

• have to be manually compiled before execution.

•  If a change to the program is desired, once the change is made the whole program needs to be recompiled.

•  Examples include C and C++.

Interpreted Language

•  Interpreted Languages;

• use an interpreter, which executes the program line by line.

• are usually slower to execute, relative to complied languges.

• The main advantage of interpreted language, besides the ease of editing    the code, is that the interpreter executes the source code. Hence the code is platform independent.

  Examples include R and Python.

R Studio

• The R interface is very simple (and I like it).

•  However most people use R Studio.

•  RStudio Desktop

http://www .rstudio .com

specifically

https://www.rstudio.com/products/rstudio/#rstudio-desktop the last time I checked you could get a free version here.

•  RStudio uses multiple windows, has tools for importing data, visualizing output and writing reports (R Markdown).

•  RStudio is just an interface. Make sure you install R before installing RStudio!

The Workspace

• Your workspace includes all user defined objects, such as vectors, matrices, functions, data frames and lists.

• Your working directory is where R reads les from, and will save files to by default unless told otherwise.

• The function getwd() tells you your current directory.

• The function setwd() allows you to re-set the directory.

• You can call in a file that is not in the current working directory by using the full path name.

•  Use  around file and directory names.

Packages

•  Packages are collections of R functions, data, and code.

• R comes with many built in packages, and you can download and install other packages that are of interest to you.

• You must load a package into your coding session to be able to access it.

•  Packages are stored in the library directory.

• The function .libPaths() tells you where your library is located.

• The function library() shows you what packages are in your library.

• The function search() displays what packages are currently loaded.

Examples

•  Let’s now look at some basic examples in R.

•  Please download R and make sure you can run and understand the examples before the next class.