Joke Collection Website - News headlines - How to build hadoop, a big data system
How to build hadoop, a big data system
? First, big data construction ideas
? 1) data acquisition
The root cause of big data is the widespread use of perception systems. With the development of science and technology, people have been able to manufacture extremely tiny sensors with processing function, and began to arrange these devices widely in every corner of society to monitor the operation of the whole society through these devices. These devices will continuously generate automatically generated new data. Therefore, in terms of data collection, we should attach time and space marks to the data from the network, including the Internet of Things, social networks, institutional information systems, etc., remove the false and retain the true, collect heterogeneous or even heterogeneous data as much as possible, and compare it with historical data when necessary to verify the comprehensiveness and credibility of the data from multiple angles.
? 2) Data collection and storage
The Internet is a magical big network, and big data development and software customization are also a model. The most detailed quotation is provided here. You can come here if you really want to do it. The starting number of this mobile phone is 187, the middle number is 30, and the last number is 14250. You can find it by combining them in order. What I want to say is, unless you want to do or understand this, if you just join in the fun, don't come.
Only when data is constantly flowing and fully enjoyed can it have vitality. Based on the construction of thematic database, data exchange and data sharing of various information systems at all levels are realized through data integration. In order to achieve the goal of low cost, low energy consumption and high reliability, data storage usually requires redundant configuration, distribution and cloud computing technology. When storing, we should classify the data according to certain rules, reduce the storage capacity through filtering and de-duplication, and add labels that are convenient for future retrieval.
? 3) Data management
The technologies of big data management are also emerging one after another. Among many technologies, there are six data management technologies that are generally concerned, namely, distributed storage and calculation, main memory database technology, column database technology, cloud database technology, non-relational database technology and mobile database technology. Among them, distributed storage and computing are the most concerned. The picture above is a book data management system.
? 4) Data analysis
Data analysis and processing: The data of some industries involve hundreds of parameters, and its complexity is not only reflected in the data sample itself, but also in the dynamic interaction of multiple sources, heterogeneous entities and multiple spaces. Traditional methods are difficult to describe and measure, and the processing complexity is very high. It is necessary to measure and process multimedia data such as high-dimensional images after dimensionality reduction, make semantic analysis by using contextual relevance, synthesize information from a large number of dynamic and possibly ambiguous data, and output understandable content. There are many types of big data processing, and the main processing modes can be divided into two types: stream processing and batch processing. Batch processing is to store first and then process, while stream processing is to process data directly. The tasks of mining are mainly correlation analysis, cluster analysis, classification, prediction, time series pattern and deviation analysis.
? 5) Value of Big Data: Decision Support System
The magic of big data is that it can accurately predict the future by analyzing past and present data; By integrating the data inside and outside the organization, we can gain insight into the correlation between things; Through the mining of massive data, it can replace the human brain and assume the responsibility of enterprise and social management.
? 6) Use of data
Big data has three connotations: one is a data set with huge data volume, diverse sources and diverse types; Second, new data processing and analysis technology; The third is to use data analysis to form value. Big data is having a revolutionary impact on scientific research, economic construction, social development and cultural life. The key and necessary condition of big data application lies in the integration of "IT" and "operation". Of course, the connotation of operation here can be very extensive, from the operation of a retail store to the operation of a city.
Second, the basic architecture of big data
Based on the above characteristics of big data, the cost of storing and processing big data through traditional IT technology is very high. To vigorously develop big data applications, an enterprise needs to solve two problems first: first, extract and store massive and multi-category data at low cost and quickly; The second is to use new technology to analyze and mine data and create value for enterprises. Therefore, the storage and processing of big data cannot be separated from cloud computing technology. Under the current technical conditions, the distributed system based on cheap hardware (such as Hadoop) is considered to be the most suitable technical platform for dealing with big data.
Hadoop is a distributed infrastructure, which allows users to conveniently and efficiently use computing resources and process massive data. At present, Hadoop has been widely used in many large Internet companies, such as Amazon, Facebook and Yahoo. It is an open architecture, and its members are constantly expanding and improving. Generally, the architecture is shown in Figure 2:
? Hadoop architecture
(1) The bottom layer of Hadoop is an HDFS (Hadoop Distributed File System). Files stored in HDFS are first divided into blocks, and then these blocks are copied to multiple DataNode.
(2) 2) The core of Hadoop is MapReduce engine. Map refers to decomposing a single task into multiple tasks, and Reduce refers to summarizing the results of the decomposed multiple tasks. The engine consists of JobTrackers (job tracking, corresponding named nodes) and TaskTrackers (task tracking, corresponding data nodes). When dealing with big data queries, MapReduce will decompose tasks into multiple nodes, thus improving data processing efficiency and avoiding single-machine performance bottleneck.
(3)Hive is a data warehouse in Hadoop architecture, which is mainly used for static structures and jobs that need frequent analysis. Hbase mainly runs on HDFS as a column-oriented database and can store PB-level data. Hbase uses MapReduce to process the internal massive data, and can locate and access the needed data in the massive data.
(4)Sqoop is designed for data interoperability. It can import data from relational databases into Hadoop, HDFS or Hive directly.
(5)Zookeeper is responsible for the coordination of applications in Hadoop architecture to keep synchronization in Hadoop cluster.
(6)Thrift is a software framework for developing extensible and cross-language services. Thrift was originally developed by Facebook, which is a seamless and efficient service built between various programming languages.
? Hadoop core design
? Base on distributed data storage system
Client: communicate with HMaster and HRegionServer using HBase RPC mechanism.
City zoo: Collaborative Service Management. HMaster can sense the health status of each HRegionServer at any time through Zookeepe.
HMaster: Manage users' operations of adding, deleting, modifying and querying tables.
HRegionServer:HBase: the core module of HBase, which is mainly responsible for responding to user I/O requests and reading and writing data to HDFS file system.
H area: the smallest unit of distributed storage in HBase, which can be understood as a table.
Hstore: the HBase core of hbase storage. Composed of MemStore and StoreFile.
HLog: Every time a user writes to Memstore, he will also write a piece of data to HLog file.
Combining the above Hadoop architecture functions, the system functions of the big data platform are suggested as follows:
Application system: For most enterprises, the application in operation field is the core application of big data. In the past, enterprises mainly used various report data of production and operation, but with the advent of the era of big data, massive data from the Internet, the Internet of Things and various sensors stood out. As a result, some enterprises began to mine and use these data to promote the improvement of operational efficiency.
Data platform: With the help of big data platform, the future Internet will enable businesses to better understand consumers' usage habits, thus improving the user experience. The corresponding analysis based on big data can improve the user experience more pertinently and explore new business opportunities at the same time.
Data source: A data source refers to a database or database server used by a database application. Rich data sources are the premise of the development of big data industry. The data sources are expanding and becoming more and more diversified. For example, smart cars can turn the dynamic driving process into data, and the Internet of Things embedded in production equipment can turn the dynamics of production processes and equipment into data. The continuous expansion of data sources can not only bring the development of acquisition equipment, but also better control the value of data by controlling new data sources. However, the total amount of digital data resources in China is far lower than that in the United States and Europe. As far as the limited data resources are concerned, there are still low standardization, low accuracy, low integrity and low utilization value, which reduces the value of data.
? Third, the target effect of big data.
Through the introduction and deployment of big data, the following effects can be achieved:
? 1) data integration
Unified data model: it carries the enterprise data model and promotes the unification of data logic models in various fields of enterprises;
Unified data standard: unified establishment of standard data coding catalogue to realize standardization and unified storage of enterprise data;
Unified data view: to realize a unified data view, so that enterprises can obtain consistent information from the perspectives of customers, products and resources.
? 2) Data quality control
Data quality check: check the consistency, integrity and accuracy of stored data according to the rules to ensure the consistency, integrity and accuracy of data;
Data quality control: by establishing enterprise data quality standards, data control organizations and data control processes, data quality is uniformly controlled, thus gradually improving data quality.
? 3) Data * * *
Eliminate mesh interfaces and establish a big data sharing center to provide * * * shared data for various business systems, reduce interface complexity and improve interface efficiency and quality between systems;
Provide integrated or calculated data to external systems in real-time or quasi-real-time manner.
? 4) Data application
Query application: the platform realizes the on-demand query function with uncertain conditions, unpredictability and flexible format;
Fixed report application: depending on the display of analysis results of fixed statistical dimensions and indicators, various business report data can be analyzed and generated according to the needs of business systems;
Dynamic analysis application: make thematic analysis of data according to concerned dimensions and indicators, which are not fixed in dynamic analysis application.
? Fourth, summary.
The big data platform based on distributed technology can effectively reduce the data storage cost, improve the efficiency of data analysis and processing, support massive data and high concurrent scenarios, greatly shorten the response time of data query, and meet the data requirements of all upper-level applications of enterprises.
- Related articles
- About diet food
- Why does my girlfriend refuse to drink milk tea?
- Excellent third-grade composition based on interesting recess games?
- What kind of flowerpot is the rich tree suitable for?
- Firefighting company work summary
- Sports topic composition
- Lovely overtime slogan
- Foreign projects of the Third Hydropower Bureau
- Introduction of Tangshan National Day Parent-child Tourist Attractions One-day Parent-child Tour in Tangshan
- The motherland is handwritten in my heart (sixth grade level)