Machine Learning
Last updated
Last updated
The utilization of machine learning algorithms is immensely useful in different aspects of data categorization. Categorization or classification algorithm is able to achieve exactly what MinerGate Protocol needs - less computational overhead which also means increased data processing and higher specificity and accuracy of detection. At this juncture, pattern recognition is a crucial aspect of identifying the transactional history of an address, ramifications of the transaction paths, and commonalities in behavior. These similarities in behavior are specifically what we are interested in since they will, potentially, enable MinerGate Protocol to operate at almost instantaneous speeds, allocating even more time for entities using MinerGate to act on the illicit activity.
The complex issue arises when there is a need to utilize machine learning but there is not sufficient data on fraudulent transactions, for that reason, the first iteration of machine learning implementation will have a binary classification algorithm that will only have two potential predictors 0 or 1. 1 - identifies that the transaction in question is in fact fraudulent or poses a potential risk and 0 - refers to a safe risk score.
The core of the model then is created using customary variables, for the sake of simplicity and visualization, we will use 10 common variables: block_timestamp, block_n_txs, n_inputs, input_sum, output_sum, n_outputs, output_seq, and input_seq. These will represent the base of the model. However, in order to refine and increase the accuracy of the baseline, we will need to employ a feature engineering technique that will help to achieve both granularities in classification and lay the foundation for model building.
While the Supervised machine learning model is well-suited for our needs due to our model's specific dataset, it does present a significant challenge in our context - the risk of overfitting. Overfitting occurs when the model becomes overly reliant on the same predetermined datasets used for training. To address overfitting effectively, we employ a cross-validation workflow.
In essence, cross-validation enables us to experiment with an independent dataset while holding out the test data, facilitating an effective data partitioning strategy. The data is divided into the following percentages: 80% for training purposes and 20% for testing.
We implement a 10-Fold Cross-Validation approach, dividing the data into ten distinct parts for rigorous evaluation.