Joke Collection Website - Mood Talk - What does a big data network engineer mainly do?

What does a big data network engineer mainly do?

The job content of a big data engineer depends on which part of the data flow you work on.

From data upstream to data downstream, it can be roughly divided into:

Data collection -> data cleaning -> data storage -> data analysis and statistics -> data visualization? and other aspects

Of course, the work content is to use tool components (Spark, Flume, Kafka, etc.) or code (Java, Scala, etc.) to implement the above functions.

Let’s talk about it in detail,

Data collection:

The buried code of the business system will generate some scattered original logs at all times, and you can use Flume monitoring to receive these Distributed logs realize the aggregation of distributed logs, that is, collection.

Data cleaning:

The original log data is all kinds of strange

Some fields may have abnormal values, that is, dirty data. In order to ensure that the "data analysis and statistics" downstream of the data can obtain relatively high-quality data, these records need to be filtered or field data backfilled.

Some log field information may be redundant. Downstream does not need to use these fields for analysis. At the same time, in order to save storage overhead, these redundant field information needs to be deleted.

Some log field information may contain sensitive user information and needs to be desensitized. If the user name only retains the surname, replace the first name with the '*' character.

Data storage:

The cleaned data can be put into the data warehouse (Hive) for downstream offline analysis. If the downstream "data analysis and statistics" have relatively high real-time requirements, the logs can be recorded into kafka.

Data analysis statistics:

Data analysis is the downstream of the data flow and consumes data from the upstream. In fact, it is to count various report data from log records. Simple report statistics can be calculated using SQL in kylin or hive. Complex reports require statistical analysis at the code level using Spark and Storm. Some companies seem to have a position called BI that is dedicated to this area.

Data visualization:

Use data tables, data graphs and other intuitive forms to display the data of upstream "data analysis and statistics". Generally, certain decisions of the company will refer to the data in these charts~

Of course, the construction and maintenance of big data platforms (such as CDH, FusionInsight, etc.) may also be part of the work content of big data engineers~< /p>

Hope it helps! ~