Joke Collection Website - Public benefit messages - What is the experience of working in the big data industry for two years?

What is the experience of working in the big data industry for two years?

What is the experience of working in the big data industry for two years?

Write in front

After the baptism of the rainstorm in Guangzhou in early summer this year, everything became clear, with new jobs, new people and things. Laziness makes me more anxious, and anxiety makes me progress. Everyone's anxiety about programmers should be the same. The pace of the times is too fast. Software development in this environment will definitely eliminate those who don't know how to learn and are lazy. I hope to encourage you.

In this article, I mainly reviewed the development of big data front-end of big data industry companies in the past two years. I just changed my job recently and want to share my experience with you.

This paper mainly starts with the development of big data, the necessity of big data governance, the imagination of graphical modeling, the control of data quality, the application of big data visualization, etc., and summarizes two years' experience and my learning achievements. I wonder if there is any misunderstanding. I hope you can give me some suggestions.

Big data development

There are several stages in big data development:

1. Data collection (raw data)

2. Data aggregation (cleaning and merging available data)

3. Data transformation and mapping (classified and extracted special subject data)

4. Data application (providing api intelligent system and application system, etc.). )

data acquisition

There are two ways to collect data: online and offline. Generally, online data are collected by reptiles, reptiles or existing application systems.

At this stage, we can make a big data collection platform, relying on automatic crawler (using Python or Node.js as crawler software), ETL tools, or a user-defined extraction and conversion engine to grab data from files, databases and web pages. If this step is completed by the automation system, it can be very convenient to manage all the original data and collect data from the data, which can standardize the work of developers and manage the target data source more conveniently.

The difficulty of data collection lies in multiple data sources, such as mysql, postgresql, sqlserver, mongodb, sqllite and so on. There are also local files, excel statistical documents and even doc files. How to organize them into our big data process regularly and systematically is also an indispensable part.

Data aggregation

Data aggregation is the most critical step in the big data process. You can add data standardization here, you can also do data cleaning and data merging here, you can also archive the data in this step, and sort and classify the confirmed available data through a monitorable process. All the data generated here are data assets of the whole company, and when it reaches a certain amount, it is fixed assets.

The difficulty of data aggregation lies in how to standardize data, such as table name standardization, table label classification, table use, data volume, whether there is data increment, etc. Is the data available?

All these require great efforts in business, and if necessary, intelligent processing should be introduced, such as automatic labeling according to the content training results, automatic allocation of recommended table names and table field names, and how to import data from original data.

Data transformation and mapping

How to provide data assets after data aggregation to specific users? This step mainly considers how to apply the data and how to convert two or three data tables into one data that can provide services. Then update the increment regularly.

After the previous steps, there is not much difficulty in this step. How to convert data is the same as how to clean up data and standard data, convert the values of two fields into one field, or count a chart data according to multiple available tables, and so on.

Data application

There are many ways to apply data, both external and internal. If there are a lot of data assets in the early stage, are they provided to users through restful API? Or provide the streaming media engine KAFKA for supply and consumption? Or directly synthesize thematic data for your own application query? The requirements for data assets here are relatively high, so the preliminary work has been done well, and the degree of freedom here is very high.

Difficulties in the development of big data

The difficulty of big data development is mainly monitoring and how to plan the work of developers. Developers randomly collected a bunch of junk data and connected it directly to the database. In the short term, these problems are relatively small and can be corrected. However, when the amount of assets is increasing, it is a time bomb, which will detonate at any time, thus triggering a series of impacts on data assets. For example, data confusion leads to a decrease in the value of data assets and a decrease in customer trust.

How to monitor the developer's development process?

The answer can only be an automation platform. Only the automation platform can make developers feel comfortable, accept new transactions and abandon the manual era.

This is the advantage of front-end development engineers in the big data industry. How to make an interactive visual interface? How to turn the existing workflow and work requirements into a visual operation interface? Can intelligence replace some brainless operations?

In a sense, in the development of big data, I personally think that front-end development engineers occupy a more important position, second only to big data development engineers. As for background development, system development is the third.

Good interaction is very important. How to convert data and how to extract data, to a certain extent, there are pits that ancestors stepped on, such as kettles, Kafka, pipes, and many solutions. The key is how to interact. How to turn it into a visual interface? This is an important subject.

The existing friends have different emphases and think that the front-end role is dispensable. I think this is wrong. The background is really important, but there are many solutions in the background. The actual location of the front end is more important, but there is basically no open source solution. If we don't pay enough attention to front-end development, the problems we face are poor interaction, poor interface and poor experience, which leads to the rejection of developers. However, there are many visual knowledge points, which requires high quality of developers.

Big data governance

Big data governance should run through the whole process of big data development, and it plays an important role. Here are some points:

Data consanguinity

Data quality review

All-platform monitoring

Data consanguinity

From the perspective of data consanguinity, data consanguinity should be the entrance of big data governance. Through a table, we can clearly see its ins and outs, the splitting of fields, the cleaning process, the cycle of the table, and the change of data volume, all of which should start from the data consanguinity. I personally think that the whole goal of big data governance is this data consanguinity, from which we can monitor the overall situation.

Data consanguinity is based on the big data development process and revolves around the whole big data development process. The history of each development step and the history of data import should be recorded accordingly. When data assets have a certain scale, data consanguinity is essential.

Data quality review

In data development, every model (table) has a process of data quality audit at the end of creation, and in the environment of large system, the key steps need to be increased for approval. For example, in the step of data transformation and mapping, which involves the data provision of customers, it is necessary to establish a perfect data quality audit system to help enterprises find the problems in the data at the first time, and also see the problems at the first time when the data has problems, and solve the problems from the root, instead of blindly querying SQL repeatedly by connecting to the database.

All-platform monitoring

In fact, monitoring includes many points, such as application monitoring, data monitoring, early warning system, work order system and so on. We need to monitor every data source and data table we take over in real time. In case of emergency or power failure, we can notify the specific person in charge by phone or SMS as soon as possible. Here we can learn from the experience of some automated operation and maintenance platforms. Monitoring is almost equal to operation and maintenance, and the protection of data assets provided by good monitoring is also important.

Big data visualization

Big data visualization is not only the display of charts, but also the display of charts.

It is important to say it three times. In the data development of big data visual classification, some belong to application class and some belong to development class.

In development, big data visualization plays the role of visual operation. How to build a model through visualization mode? How to realize the operability of data quality by dragging or three-dimensional operation? It is unrealistic to draw two tables and add a few buttons to realize complex operation flow.

In the application of visualization, there are many ways to convert and display data, and charts are one of them. Usually more work or data analysis, how to express data more intuitively? This requires a deep understanding of data and business in order to make a suitable visualization application.

Intelligent visualization platform

Visualization can be re-visualized, such as superset, and charts can be realized by operating SQL. Some products can even intelligently classify data, recommend chart types and develop visualization in real time. This function is the existing development direction of visualization. We need a lot of visual content to produce output for the company, such as clothing industry, sales department: the influence of import and export, the influence of color matching on users, and the influence of season on selection. Production Department: What is the trend of fabric price? Statistics of productivity and efficiency? By analogy, each department can have a big data screen, and they can plan their own big screen at will through the platform. Everyone can pay attention to their own domain dynamics every day, which is the specific significance of big data visualization application.

label

I wrote a lot and summed up what I have seen, heard, learned and thought in the past two years.

Some children's shoes will ask, isn't it technology? Why is there no code? I want to say that code is to be learned, to be written, and has nothing to do with work. Code is my personal skill and an important skill for me to realize my personal ideas. But the code has little to do with the business. At work, people who know business can write better code because they know what the company wants. If your business is poor, it doesn't matter, as long as your code is good and you work well according to other people's instructions. Technology and business complement each other, and bloggers will summarize the improvement of the code later.

There is a lot of anxiety after writing, and the code is not standardized enough. At present, technology stack JS, Java, Node.js, Python.

The main business JS is 80% proficient, and I am studying Ruan Yifeng's es6 (almost) and the source code of vuejs (a little stranded). Vuejs is medium, css and layout can be said to be ok, d3.js and go.js are both in use and can work. Node.js, express and koa are all fine. I have read some source codes of express and written two middleware.

Java and Python are capable of doing projects. I don't want to spend too much energy on them at present, just want to stay at a useful level.

I plan to learn more about artificial intelligence and big data development in the next few years, and it should be hot in the future.

Finally, I encourage you to have a threesome and there will be a teacher.