Abstract
The aim of the current study was to analyse ways of reducing cost and improving performance for machine learning by integrating driverless AI such as H2O with Spark in R and generate report. The current research pits regression models such as LM, GBM, XGBoost and Random Forest with one another and focuses on identifying the best performing model in terms of RMSE, time to execute and hardware cost. The datasets contained 29 variables and 65000+ observations out of which, Origin, Dest, UniqueCarrier, FlightNum, Month, DayOfWeek, DayofMonth, Distance, DepDelay, ArrDelay, AirTime, Cancelled, hour and gain were considered. The analyses showed that, GBM was the best performing model with optimal cost followed by XGBoost, Random Forest and LM. In conclusion, it was proved that machine learning is cost effective by integrating H2O with Spark in R and professional reports can be generated with feedback and test results from Shiny.