Abstract:Aiming at the problems of low storage efficiency and poor retrieval performance of a large number of unstructured data such as images, videos, GIS data and ecological indicators in the forest ecological station, a forest ecological station big data storage framework was proposed based on Hadoop and HBase. Based on the proposed framework, the business process of forest ecological data storage was given and the core technologies involved in the forest ecological big data platform was optimized.A pre-partitioning algorithm was designed to ensure that the data was evenly distributed in the cluster. According to the characteristics of ecological data, the RowKey was scientifically designed to achieve rapid retrieval of ecological data. Aiming at the problem that native HBase did not support multi-condition query, an ElasticSearch index shard placement strategy was designed based on index data and server performance evaluation, and the multi-condition search HBase ecological database was optimized based on ElasticSearch's secondary non-primary key index technology. In view of the difficulty of storing large amounts of small pictures in the ecological station, a package and merge strategy was proposed based on data sites and time relevance. GIS data was analyzed for efficient storage. The above theory was verified through experiments. The results showed that the ElasticSearch index shard placement strategy reduced the query time by an average of 20 ms compared with the default shard strategy. The average query time was reduced by 20 ms compared with that based on changing the ElasticSearch scoring policy. When the structured data size was 1×108, the retrieval time of the system was 1.045 s, which was 3.99 times faster than the native HBase retrieval, and when the unstructured data was 1×107 pieces, the based on data site and time correlation package small picture strategy was 1.15 times that of SequenceFile-based merging efficiency and 1.79 times that of native HBase.In the case of 1×104 concurrent users, after optimization, the number of queries per second was 1.88 times as much as before, the throughput per second was 1.74 times as much as before, and the system response time was 69.5% lower than that before optimization. From the above results, it can be seen that the solution proposed had significant performance improvements in cluster load balancing, massive structured and unstructured data retrieval efficiency, and system throughput, which provided the necessary theoretical foundation and technical realization for the storage and management of forest ecological data.