An article lets you understand big data acquisition techniques

Big data opens up an era of mass production, sharing and application of data that has brought about tremendous changes in technology and commerce. McKinsey research shows that in the medical, retail and manufacturing sectors, big data can increase labor productivity by 0.5 to 1 percentage point a year. The penetration of big data in the core areas is evident to all, but surveys show that up to 99.4% of unused information is in large part due to the inability to capture high-value information. Therefore, in the era of big data, how to collect useful information from big data has become one of the key factors in the development of big data. So what is big data acquisition technology? This issue for everyone to introduce big data acquisition technology, so that we can easily understand the big data collection. % What is data collection? Data Acquisition (DAQ), also known as data acquisition, refers to the process of automatically acquiring information from analog and digital units under test, such as sensors and other equipment under test. Data classification In the new generation data system, the new data sources not considered in the traditional data system are summarized and classified into two categories: online behavior data and content data. ? Online behavior data: page data, interactive data, form data, session data. Content data: application logs, electronic documents, machine data, voice data, social media data, etc. Major sources of big data: 1) Business data 2) Internet data 3) Sensor data %? Data collection and big data collection difference An article on dry goods lets you understand big data collection techniques %? The lack of traditional data collection Traditional data collection is a single source, and the amount of data stored, managed and analyzed is relatively small, mostly handled by relational databases and parallel data warehouses. In terms of parallel computing to enhance data processing speed, the traditional parallel database technology to pursue a high degree of consistency and fault tolerance, according to CAP theory, it is difficult to ensure its availability and scalability. %? Big data collection new method System log collection method Many Internet companies have their own mass data collection tools, used for system log acquisition, such as Hadoop's Chukwa, Cloudera's Flume, Facebook's Scribe, etc., these tools are distributed architecture that can meet hundreds of megabytes per second log data Collect and transmit needs. Network data collection methods Network data acquisition means through the web crawler or website public API and other means to get data information from the website. The method can extract unstructured data from the web page, store it as a unified local data file, and store it in a structured manner. It supports pictures, audio, video and other files or attachments collection, attachment and body can be automatically associated. In addition to what is included in the network, network traffic can be collected using bandwidth management techniques such as DPI or DFI. Other data collection methods Data such as enterprise production and operation data or subject research data that require high confidentiality may be collected through cooperation with a company or a research institute using a specific system interface or the like. %? Big Data Acquisition Platform Finally, we introduce several widely used large data acquisition platform for your reference. 1) Apache Flume Flume is Apache's open source, highly reliable, highly scalable, easy-to-manage, customer-scale data acquisition system. Flume uses JRuby to build, so it depends on the Java runtime environment. 2) Fluentd Fluentd is another open source data collection framework. Fluentd uses C / Ruby development, using JSON files to unify log data. Its pluggable architecture supports a variety of different types and formats of data sources and data output. Finally, it also provides high reliability and good scalability. Treasure Data, Inc provides support and maintenance for this product. 3) Logstash Logstash is the one in the famous open source data stack ELK (ElasticSearch, Logstash, Kibana). Logstash was developed with JRuby and all runtime dependent JVMs. 4) Splunk Forwarder Splunk is a distributed machine data platform that has three main roles: Search Head is responsible for data search and processing, information extraction for search, Indexer for data storage and indexing, Forwarder for data collection, cleaning, deformation, And sent to the Indexer.

This entry was posted in on