Abstract:At present, the cross-integration of computer technology and forestry field had emerged a large number of forestry texts to be explored, and the shortcomings of related research could be summarized in two aspects: the classification labels in the existing classification system were set unscientific, leading to the classification model lacking of ability to classify the texts on net;the classification algorithm was mostly trained in the single-machine environment without considering its parallelism, then the algorithm could not deal with the actual large-scale data classification problem. Therefore, it was pretty realistic and urgency to design more scientific classification labels and classify forestry texts based on Spark framework. A new crawler technology was used to collect forestry-related texts, and re-construct labels by referring to the existing information retrieval system of forestry to improve the adaptability of classification models. Then the XGBoost parallelization implementation method was realized based on Spark, which completed the computing of training and prediction by RDD program mode. Through cross-validation method, the accuracy of XGBoost parallel algorithm could reach 0.9234. The lowest F1-measure value was 0.8604 and the highest was 0.9984. By training on the 21 thousand, 42 thousand and 84 thousand data sets, the speedup ratios could reach 2.13, 3.47 and 3.82, respectively. The results showed that the new classification labels were set more scientific, and the system had better adaptability to the forestry-related texts on the existing internet. The precision and recall values of the XGBoost algorithm were significantly better than the four kinds of parallel algorithms based on Spark which included NB, gradient boosting decision tree, back propagation neural network, extreme learning machine and ran more effective than the stand-alone version. And with the increase of the data number, the acceleration ratio could be improved, which meant it was pretty useful to deal with the problem about the real-time and accurate classification of massive forestry texts.