Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Machine Learning for Airline Ticket Price Prediction

Objective: The goal of this project is to apply machine learning techniques to predict airline ticket prices using a provided dataset. You will engage in the end-to-end process of predictive modeling, from data preprocessing to model training and evaluation.

Dataset: The dataset provided contains data of airline ticket prices along with various attributes that may influence the price, such as departure and arrival locations, dates and times, airline, flight duration, number of stops, etc. There are 140000 observations in the train set.

The datasets you need is attached below:

· final_train.csv

· final_test.csv

The training set (train.csv) is the primary resource for building your machine learning models. In this set, each record includes the actual price (variable name: 'price') of airline tickets (i.e., the "ground truth"), which serves as the "label" in your predictive model. The dataset features various attributes that might influence ticket prices, such as flight dates, times, airlines, routes, and other relevant factors. You are encouraged to leverage your domain knowledge or data analysis skills to create new features that could enhance your model's predictive accuracy.

The testing set is used to evaluate how well your model performs on unseen data. This set does not include information on the ticket prices. Your task is to apply the model you developed to predict the market value of each airline ticket in the test set (in USD). There are no restrictions on the types of models or algorithms you can use. You have the freedom to experiment with various methodologies, including but not limited to linear models, non-linear models, logistic regression, tree-based models, and more.

==========================================================================================================

2. Submission Files Format

You should upload two files use the upload file link on canvas by Dec 19 (Tuesday) 11:59 pm.  No extension will be granted.

The two files are:

· Your python script

· The prediction outcome

For the python script, you should submit a python file named exactly as "final.py"

For the prediction outcome:

· You should submit a csv file named exactly as "prediction.csv"

· The submitted csv file should have exactly  60000 entries plus a header row.

· The file should have exactly 2 columns:

o id (sorted in any order)

o price (the predicted price)

I posted an example_sumbission.csv here as an example of what a submission file should look like. This example submission predicts that all price are just the mean price of the train set (By no means this is a good prediction, I believe you can absolutely do better than this example).

==========================================================================================================

3. Evaluation

Your goal is to predict the market price for each ticket in the test set. Try your best to maximize the coefficient of determination ( i.e.,  R-squared) of your model.  The R^2 provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. Specifically:

Your grade depends on the following criteria:

Achievement

Credit

Explanation

Submission

submit both files by the deadline

10%

fail this step will result in zero grade for the second part of the final exam

the submitted files are in the required format

10%

see section 2 for detail

Completion

the submitted script works without run time error

10%

if any run time error occur, you will only receive partial grade for the submission (30% at most)

the submitted script produce the prediction.csv file you uploaded

10%

if the file produced by your script is inconsistent with the csv file you uploaded, you will be graded based on the uploaded csv file only, and you will receive 50% of the total grade at most

the submitted csv file contains all the cases in the test set. i.e., 60000 rows

10%

you will loss 5x% grade if you drop cases in the test set, x is the percentage of cases you dropped.

Modeling and performance

Your modeling process and model performance will be evaluated based on the following criteria
1. if the data is well-cleaned and prepared for machine learning models
2. if feature engineering is being done properly.
3. if the trained model can generate reasonable results on the test set

50%

Just try your best. the best prediction (i.e., highest R^2 or lowest mean squared error) will receive $25 amazon gift card as reward


· All students are required to work on this project independently and submit individual effort work.

· Discussions shall be limited to general modeling ideas.

· You can use coding mate or any other generative AI tool.

· Sharing code is NOT ALLOWED.

· Plagiarism in coding is easy to detect and will not be tolerated. Both parties will receive 0.

==========================================================================================================

4. Variable Dictionary

· id: An identifier for the flight.

· searchDate: The date (YYYY-MM-DD) on which this entry was taken from Expedia.

· flightDate: The date (YYYY-MM-DD) of the flight.

· startingAirport: Three-character IATA airport code for the initial location.

· destinationAirport: Three-character IATA airport code for the arrival location.

· fareBasisCode: The fare basis code.

· travelDuration: The travel duration in hours and minutes.

· elapsedDays: The number of elapsed days (usually 0).

· price: The price of the ticket (in USD). <--------------this is the label

· seatsRemaining: Integer for the number of seats remaining.

· totalTravelDistance: The total travel distance in miles. This data is sometimes missing.

· segmentsDepartureTimeEpochSeconds: String containing the departure time (Unix time) for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsDepartureTimeRaw: String containing the departure time (ISO 8601 format: YYYY-MM-DDThh:mm:ss.000±[hh]:00) for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsArrivalTimeEpochSeconds: String containing the arrival time (Unix time) for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsArrivalTimeRaw: String containing the arrival time (ISO 8601 format: YYYY-MM-DDThh:mm:ss.000±[hh]:00) for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsArrivalAirportCode: String containing the IATA airport code for the arrival location for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsDepartureAirportCode: String containing the IATA airport code for the departure location for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsAirlineName: String containing the name of the airline that services each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsAirlineCode: String containing the two-letter airline code that services each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsEquipmentDescription: String containing the type of airplane used for each leg of the trip (e.g. "Airbus A321" or "Boeing 737-800"). The entries for each of the legs are separated by '||'.

· segmentsDurationInSeconds: String containing the duration of the flight (in seconds) for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsDistance: String containing the distance traveled (in miles) for each leg of the trip. The entries for each of the legs are separated by '||'.

· segmentsCabinCode: String containing the cabin for each leg of the trip (e.g. "coach"). The entries for each of the legs are separated by '||'.

=======================================================================

5. Tips

§ Please allow yourselves AT LEAST 8 hours working on this project. Do not wait until the last minute.

§ Be careful of missing values. Depending on your modeling strategy, you may drop observations with missing data in the training set. You can also use imputation algorithms to fill-in the missing values before train your model.

§ Carefully think about what are the best ways to deal with strings. Can you simply drop some of them? For those you need to keep, do you need categorical variable? Dummies? Or numbers?

§ Do not underestimate the importance of exploratory data analysis. Effective EDA can help you know the dataset and make better modeling decisions.

§ Have fun!