Data transformation and pre-processing
As part of the corrosion study assessment, Integrity Operating Windows (IOW) are defined for key parameters that can affect the probability and progression rate of a particular damage mechanism. IOW may generally fall into two categories i.e., chemical parameters or physical parameters and may be obtained from online analysers, local indicators or process sampling. Examples of chemical parameters include pH and concentration of corrodent whereas physical parameters are those that are not chemical in nature such as operating pressure and temperature. Excursion to any of the IOW will require a timely response to bring the parameter back within the acceptable limit.
Data preprocessing is a step that involves transforming raw data so that issues owing to the incompleteness, inconsistency, and/or lack of appropriate representation of trends are resolved so as to arrive at a data set that is in an understandable format6. The goal of data preprocessing is to clean and generate a data set that can simplify the process while performing feature engineering and model training stages. Some of the methods that can be applied under data preprocessing include:
- Pivot – when data that is extracted from the database as a “row-format” are converted into “columnar-format” through a process called pivoting to prepare the data for the next step.
- Trim – involves the removal of observations without target “y” value or in this case corrosion rate.
- Reindex – applied to address a variety of data to fix the problem during which a discontinuity of time range exists in any of the data set producing gaps.
- Bad Tags Removal – is also known as noisy data, meaningless data that cannot be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc.
In addressing the presence of too many parameters within a corrosion group that contributes to corrosion, we will need to figure out the optimum way forward to identify the best features and data characteristics to be selected. There is no single solution to this problem, hence, feature engineering is considered an “art” where the study of parameter correlation is to be performed to obtain the most suitable combination. Under feature engineering, feature selection is carried out before feeding the data to a predictive model to remove unnecessary features which can lead to undesired longer model processing time.
One important step under feature engineering is handling missing values. In any real-world data set, there are always a few null values. No model can handle these NULL or NaN values on its own. Very often, standard approaches to solving this problem do not exist, because the approaches largely depend on the context and nature of the data. One easy way to solve this problem is to simply ignore or delete rows that lack data measurements by deleting them from the analysis. However, this method may not be effective due to information loss. Therefore, we can analyse methods to fill in the missing values which will yield a high accuracy as output of the corrosion rate prediction modelling. These methods may involve data imputation via mean/mode/median approximation, regression and forecasting algorithms such as K-Nearest Neighbour and Multiple Imputation by Chained Equations (MICE). The corrosion data set encountered contains input features that can be labelled originating from Integrated Operating Window parameters such as dewpoint temperature, chloride ions, and output such as corrosion rate. Therefore, the supervised learning algorithm is applicable.
Common algorithms in supervised learning include logistic regression, naive bayes, support vector machines, artificial neural networks, and random forests. The final step of the machine learning modelling is the model validation. The predicted corrosion rates will be verified with actual thickness measurement at site.