Guide · read In recent years, big data menacingly infiltrate all walks of life and bring about a radical change. It is becoming more and more important to recognize that it is more important than mastering vast amounts of data to master the specialized processing of meaningful data. If big data is likened to an industry, the key to making this industry profitable is increasing the "processing power" of the data and "adding value" to the data through "processing," which is the ability of key technologies for big data to play. According to the processing of big data, the key technologies of big data can be divided into big data collection, big data preprocessing, big data storage and management, big data analysis and data mining from the aspects of data storage, processing and application. Other links. This article focuses on the key technologies of big data to sort out readers. A text combing Big Data four major aspects of the fifteen key technologies %? Part1. Big Data Collection Data acquisition is the first step in the life cycle of big data. It obtains various types of structured, semi-structured and unstructured mass data through RFID radio frequency data, sensor data, social network data and mobile Internet data. Because there may be tens of thousands of concurrent access and operation of users, it is necessary to adopt a collection method specifically for big data, which mainly includes the following three types: A. Database Acquisition Some businesses use traditional relational databases like MySQL and Oracle to store data. Talking about more tools Sqoop and structured ETL tools between the database, of course, the current open source Kettle and Talend itself also integrates big data integration content can be achieved and hdfs, hbase and mainstream Nosq data synchronization between the databases and integrated. B. Network Data Acquisition Network data acquisition is mainly through the use of web crawler or web site public API, etc., from the site to obtain data and information process. In this way, unstructured data and semi-structured data on the network can be extracted from the webpage and stored as a unified local data file in a structured manner. C. Document Collection Of course, for ELK (Elasticsearch, Logstash, Kibana combination) is to process the log, but there are also full incremental real-time file based on the template configuration file collection, talk about more flume for real-time file collection and processing Acquisition achieved. If it is just log collection and analysis, then use ELK solution is completely enough. %? Part2. Big data preprocessing The world of data is large and complex, there will be incomplete, false, outdated. To get high-quality analysis and mining results, you must improve the quality of the data during the data preparation phase. Big data preprocessing can clean, fill, smooth, merge, normalize and check the collected raw data, and convert those disorganized data into a relatively simple and easy-to-handle configuration, which will lay the foundation for later data analysis basis. Data preprocessing mainly includes: data cleansing, data integration, data conversion and data protocol four major parts. A. Data cleaning Data cleanup mainly includes missing value processing (lack of interesting attributes), noise data processing (errors in the data, or data deviating from the expected value), inconsistent data processing. The main cleaning tools are ETL (Extraction / Transformation / Loading) and Potter's Wheel. The missing data can be processed by means of global constants, mean values ​​of attributes, filling of possible values, or directly ignoring the data; noise data can be binned (raw data is grouped and the data within each group is smoothed), clustering, computer Manual inspection and regression methods to remove noise; for inconsistent data can be manually corrected. B. Data Integration Data integration refers to the data stored in multiple data sources combined into a consistent data repository. This process focuses on solving three problems: pattern matching, data redundancy, data value conflict detection and processing. Data from multiple data sets can have different entity names due to differences in naming. Usually, entities are identified by using metadata to distinguish between entities with different origins. Data redundancy may come from inconsistent data attributes named in the solution process for the numerical properties can be measured using the Pearson product moment Ra, b, the greater the absolute value indicates the stronger the correlation between the two. The problem of data value conflicts is mainly manifested by different data values ​​of different entities with different sources. C. Data Transformation Data conversion is the process of inconsistencies in handling extracted data. Data conversion generally includes two types: The first category, the name of the data and the format of the unity, that is, data granularity conversion, business rules and the calculation of a unified naming, data formats, units of measurement, etc .; the second category, the data warehouse exists in the source database may not exist in the data, Combine, split or calculate fields. Data conversion actually also includes the data cleaning work, the need to wash the abnormal data according to business rules to ensure the accuracy of subsequent analysis results. D. Data Protocol Data reduction is to minimize the amount of data to the maximum extent possible while preserving the original appearance of the data as much as possible, mainly including data aggregation, dimensional conventions, data compression, numerical conventions, and conceptual hierarchies. Data reduction techniques can be used to get the specification of the dataset, making the dataset smaller but still close to keeping the original data intact. In other words, mining on the statutory dataset will still result in nearly the same results as the original dataset. %? Part3. Big Data Storage Big data storage and management to use the memory to collect the data stored to establish the appropriate database for management and call. There are three typical types of big data storage technologies: A. The new database cluster MPP architecture The new database cluster based on MPP architecture focuses on big data in the industry, adopts Shared Nothing architecture, and through a series of big data processing technologies such as column storage and coarse-grained index, combined with efficient distributed computing mode of MPP architecture, Of the support, operating environment and more for low-cost PC Server, with high performance and scalability features, applications in the field of enterprise analytics to obtain a very wide range of applications. Such MPP products can effectively support PB-level structured data analysis, which is beyond the reach of traditional database technologies. For a new generation of enterprise data warehouse and structured data analysis, the best choice is the MPP database. B. Hadoop-based technology expansion and packaging Hadoop-based technology expansion and encapsulation, derived from Hadoop related big data technology to deal with the traditional relational database more difficult to deal with the data and scenarios, such as for unstructured data storage and calculation, take full advantage of the advantages of Hadoop open source, With the continuous improvement of related technologies, its application scenarios will also be gradually expanded. At present, the most typical application scenario is to expand and package Hadoop to support the storage and analysis of large data on the Internet. There are dozens of NoSQL technologies in it, and are further subdivided. The Hadoop platform excels at unstructured, semi-structured data processing, sophisticated ETL processes, complex data mining and computational models.