Popularization of friendly tools for machine learning (ML), such as R and Weka, made possible the ML application in several knowledge areas. In this way, it was possible to use new approaches for old problems as well as to explore datasets through new lens. After all, different from, for example, multiple regressions, ML techniques deal with non-linear responses, threshold based responses, different types of variables, correlated variables and interactions between variables. However, for the real gain in using ML, these easy to use tools should be coupled with good practices: there are several measures that should be taken in the process of model generation. It is not enough to click the Random Forest button, instead of the multiple regression one. In our most recent paper, we show these effects for sugarcane yield modeling (how many tons per hectare will be produced) using data from the sugarcane mill own database.
- ML models are strongly influenced by hyperparameters tuning. Think of it like gears in a bicycle. For each relief and desired speed, it is not enough to own the 21 gears bike. It is also necessary to change the proper one. When someone uses a ML algorithm without tuning hyperparameters, it is like riding a 21-gear bike with the configuration that came from the store.
- Besides that, using your knowledge of the domain to “chew” the data could make the job of the algorithm easier. In this case, I will cite fertilization, an example from the paper. Fertilizers appeared in the dataset by their commercial formula, something like NN-PP-KK, where NN is the percentage of nitrogen, PP of phosphorus and KK of potassium. Along with the amount applied, it is possible to tell how much of each nutrient was applied. This converts two features in three, but also makes the information more transparent to the model.
- At last, even though the techniques can deal with large amount of attributes, selecting good features also increases performance of the models. It is always important to remember that given the learning ability of the algorithms, this evaluation should never be performed in the dataset used for model generation.
In our work, we show how these activities impact sugarcane yield modeling and its evaluation, but these procedures could be adapted for many different crops.
To learn more of our work, it can be found on this link.
Model evaluation was performed using the REC curve and discussed by Thiago, one undergraduate student of our research group, here (in Portuguese). The direct link for the evaluation app is here.