Performance improvement and reporting techniques using SparklyR and

No Thumbnail Available
Pilli, Happy Justin
Issue Date
MSc in Data Analytics
Dublin Business School
Items in eSource are protected by copyright. Previously published items are made available in accordance with the copyright policy of the publisher/copyright holder.
The aim of the current study was to analyse ways of reducing cost and improving performance for machine learning by integrating driverless AI such as H2O with Spark in R and generate report. The current research pits regression models such as LM, GBM, XGBoost and Random Forest with one another and focuses on identifying the best performing model in terms of RMSE, time to execute and hardware cost. The datasets contained 29 variables and 65000+ observations out of which, Origin, Dest, UniqueCarrier, FlightNum, Month, DayOfWeek, DayofMonth, Distance, DepDelay, ArrDelay, AirTime, Cancelled, hour and gain were considered. The analyses showed that, GBM was the best performing model with optimal cost followed by XGBoost, Random Forest and LM. In conclusion, it was proved that machine learning is cost effective by integrating H2O with Spark in R and professional reports can be generated with feedback and test results from Shiny.