Joke Collection Website - Public benefit messages - Summary of Work in Big Data Industry in Recent Two Years

Summary of Work in Big Data Industry in Recent Two Years

Today, we mainly review the development of big data front-end of big data industry companies in the past two years. I just changed my job recently, and I share my experience with you. If you have any suggestions, please leave a message in the comments section. Thank you.

Today's theme is mainly from the perspective of big data development, to the necessity of big data governance, to the imagination of graphical modeling, and finally to the control of data quality, and then to the application of big data visualization. Bloggers sum up their two years' experience and wonder if there is any deviation in their understanding of my research results. I hope you can give me some advice.

Big data development

There are several stages in big data development:

1. data acquisition raw data

2. Data aggregation cleans up the merged available data.

3. Transform and map the classified thematic data.

4. Data application provides api intelligent system application system, etc.

data acquisition

There are two ways to collect data: online and offline. Generally, online data are collected by reptiles, reptiles or existing application systems. At this stage, we can make a big data collection platform, relying on automatic crawler (using python or nodejs as crawler software), ETL tools, or a user-defined extraction and conversion engine to grab data from files, databases and web pages. It will be very convenient if this step is completed by an automated system. And the target data source can be managed more conveniently.

The difficulty of data collection lies in multiple data sources, such as mysql, postgresql, sqlserver, mongodb, sqllite and so on. There are also local files, excel statistical documents and even doc files. How to organize them into our big data process regularly and systematically is also an indispensable part.

Data aggregation

Data aggregation is the most critical step in the big data process. You can add data standardization here, you can also do data cleaning and data merging here, and you can also archive the data in this step, and sort and classify the confirmed available data through a monitorable process. All the data generated here are data assets of the whole company, and when it reaches a certain amount, it is fixed assets.

The difficulty of data aggregation lies in how to standardize data, such as table name standardization, table label classification, table use, data volume, whether there is data increment, etc. Is the data available? We need to make great efforts in business, and if necessary, we should introduce intelligent processing, such as automatic labeling according to the content training results and automatic allocation of recommended table names and table field names. And how to import data from raw data.

Data transformation and mapping

How to provide data assets after data aggregation to specific users? This step is mainly to consider how to apply the data and how to combine the two. Three? The data table is converted into data that can provide services. Then update the increment regularly.

After the previous steps, there is not much difficulty in this step. How to convert data is the same as how to clean up data and standard data, convert the values of two fields into one field, or count a chart data according to multiple available tables, and so on.

Data application

There are many ways to apply data, both external and internal. What if you have a lot of data assets in the early stage and provide them to users through restful API? Or provide the streaming media engine KAFKA for supply and consumption? Or directly synthesize thematic data for your own application query? The requirements for data assets here are relatively high, so the preliminary work has been done well, and the degree of freedom here is very high.

Summary: Difficulties in the development of big data

The difficulty in the development of big data is mainly monitoring. How to plan the work of developers? Developers randomly collected a bunch of junk data and connected it directly to the database. In the short term, these problems are relatively small and can be corrected. However, when the amount of assets is increasing, it is a time bomb, which will detonate at any time, thus triggering a series of impacts on data assets. For example, data confusion leads to a decrease in the value of data assets and a decrease in customer trust.

How to monitor the developer's development process?

The answer can only be an automation platform. Only the automation platform can make developers feel comfortable, accept new transactions and abandon the manual era.

This is the advantage of front-end development engineers in the big data industry. How to make an interactive visual interface? How to turn the existing workflow and work requirements into a visual operation interface? Can intelligence replace some brainless operations?

In a sense, in the development of big data, I personally think that front-end development engineers occupy a more important position, second only to big data development engineers. As for background development, system development is the third. Good interaction is very important. How to convert data and how to extract data, to a certain extent, there are pits that ancestors stepped on, such as kettles, Kafka, pipes, and many solutions. The key is how to interact. How to turn it into a visual interface? This is an important subject.

The existing friends have different emphases and think that the front-end role is dispensable. I think this is wrong. The background is really important, but there are many solutions in the background. The actual location of the front end is more important, but there is basically no open source solution. If we don't pay enough attention to front-end development, the problems we face are poor interaction, poor interface and poor experience, which leads to the rejection of developers. However, there are many visual knowledge points, which requires high quality of developers.

Big data governance

Big data governance should run through the whole process of big data development, and it plays an important role. Here are some points:

Data consanguinity

Data quality review

All-platform monitoring

Data consanguinity

From the perspective of data consanguinity, data consanguinity should be the entrance of big data governance. Through a table, we can clearly see its ins and outs, the splitting of fields, the cleaning process, the cycle of the table, and the change of data volume, all of which should start from the data consanguinity. I personally think that the whole goal of big data governance is this data consanguinity, from which we can monitor the overall situation.

Data consanguinity is based on the big data development process and revolves around the whole big data development process. The history of each development step and the history of data import should be recorded accordingly. When data assets have a certain scale, data consanguinity is essential.

Data quality review

In data development, after each model (table) is created, there must be a process of data quality audit. In a large-scale system environment, approval should also be added in key steps, such as data conversion and mapping involving customer data provision. A perfect data quality audit system should be established to help enterprises find data problems at the first time, and when data problems occur, they can also see the problems at the first time and solve the problems from the root.

All-platform monitoring

Monitoring actually includes many points, such as application monitoring, data monitoring, early warning system, work order system and so on. We need to monitor every data source and data table we take over in real time. In case of power failure or power failure, we can notify the specific person in charge by phone or SMS as soon as possible. Here we can learn from the experience of some automated operation and maintenance platforms. Monitoring is roughly equal to operation and maintenance, and the protection of data assets provided by good monitoring is also very important.

Big data visualization

Big data visualization is not only the display of charts, big data visualization is not only the display of charts, big data visualization is not only the display of charts, the important things are said three times, and the data development of big data visualization classification belongs to application class and development class.

In development, big data visualization plays the role of visual operation. How to build a model through visualization mode? How to realize the operability of data quality by dragging or three-dimensional operation? It is unrealistic to draw two tables and add a few buttons to realize complex operation flow.

In the application of visualization, there are many ways to convert and display data, and charts are one of them. Usually more work or data analysis, how to express data more intuitively? This requires a deep understanding of data and business in order to make a suitable visualization application.

Intelligent visualization platform

Visualization can be re-visualized, such as superset, and charts can be realized by operating sql. Some products can even intelligently classify data, recommend chart types and develop visualization in real time. This function is the existing development direction of visualization. We need a lot of visual content to produce output for the company, such as clothing industry, sales department: the influence of import and export, the influence of color matching on users, and the influence of season on selection. Production Department: What is the trend of fabric price? Statistics of productivity and efficiency? By analogy, each department can have a big data screen, and they can plan their own big screen at will through the platform. Everyone can pay attention to their own domain dynamics every day, which is the specific significance of big data visualization application.

Write it at the end

I wrote a lot and summed up what I have seen, heard and learned in the past two years. Some children's shoes will ask, isn't it technology? Why is there no code? Bo mainly said that the code blog is mainly about learning and writing, and has nothing to do with work. Code is my personal skill and an important skill to realize my personal thoughts. However, code has little to do with business. At work, people who know business can write better code because they know what the company wants. If your business is poor, it doesn't matter, as long as your code is good and you work well according to other people's instructions. Technology and business complement each other, and bloggers will summarize the improvement of the code later.

I was anxious after writing, and the code was not standardized enough. At present, technology stack js, java, nodejs, python.

The main business Js proficiency is 80%. I am studying Ruan Yifeng's es6 (which looks almost the same) and the source code of vuejs (which is a bit stranded). Vuejs is medium, css and layout can be said to be ok, d3.js and go.js are both in use and can work. Nodejs, express and koa are all fine. I have read some source codes of express and written two middleware.

Java and python are capable of doing projects. At present, we don't want to spend too much energy on them, just want to keep them at a useful level.

In the next few years, work hard and learn more about artificial intelligence and big data development. There should be some heat in the future.

Finally, I encourage you, and I hope you can give me some planning suggestions. In a threesome, we must learn from each other.