Joke Collection Website - Mood Talk - Introduction to the basic knowledge of python big data mining series knowledge organization (introductory tutorial includes source code)

Introduction to the basic knowledge of python big data mining series knowledge organization (introductory tutorial includes source code)

Python has been very popular in the big data industry for the past two years. As a pythonic, it has to get involved in big data analysis. Let’s talk about them below.

Overview of Python data analysis and mining technology

The so-called data analysis means analyzing known data and then extracting some valuable information, such as statistical averages and standard deviations and other information, the amount of data for data analysis may not be too large, and data mining refers to analyzing and digging out a large amount of data to obtain some unknown and valuable information, such as from website users and user behavior. Dig out information about users' potential needs to improve the website, etc.

Data analysis and data mining are inseparable. Data mining is an improvement of data analysis. Data mining technology can help us better discover the patterns between things. Therefore, we can use data mining technology to help us better discover the patterns between things. For example, discover the potential needs of users, realize personalized push of information, discover the patterns between diseases and symptoms, and even diseases and drugs.

To do well in advance, you must first sharpen your tools

Let’s first talk about the modules of data analysis:

Let’s talk about the basic use of these modules. .

Installation and use of numpy module

Installation:

Download address is: http://www.lfd.uci.edu/~gohlke/pythonlibs/

The package I downloaded here is version 1.11.3, the address is: http://www.lfd.uci.edu/~gohlke/pythonlibs/f9r7rmd8/numpy-1.11.3 mkl-cp35-cp35m- win_amd64.whl

After downloading, use pip install "numpy-1.11.3 mkl-cp35-cp35m-win_amd64.whl"

The installed numpy version must have the mkl version , which can better support numpy

Simple use of numpy

Generate random numbers

Mainly use the random method under numpy.

pandas

Just use pip install pandas

Directly upload the code:

Let’s take a look at the output of pandas. This line Which column is the number? The number in the first column is the number of rows. Locate one through the first row and which column:

The common methods are as follows:

Let’s take a look at pandas pair Statistics of data, let’s talk about the information of each row

Transpose function: convert the number of rows into the number of columns, and convert the number of columns into the number of rows, as shown below:

Import data through pandas

pandas supports a variety of input formats. Here I will simply list the most commonly used ones in daily life. For more input methods, you can check the source code and the latter's official website.

CSV file

If the output is displayed after the csv file is imported, it will be output according to the default rows of the csv file. It will output as many columns as there are. For example, if I have five columns of data, then When prinit outputs the results, it displays five columns

excel table

It depends on the xlrd module, please install it.

The same old, original output shows the original results of excel, except that a row number is added at the beginning of each row

Read SQL

Depends on PyMySQL, so it needs to be installed. When pandas takes sql as input, it needs to specify two parameters, the first is the sql statement, and the second is the sql connection instance.

Reading HTML

Depends on the lxml module, please install it.

For HTTPS web pages, it depends on BeautifulSoup4 and html5lib modules.

Reading HTML will only read the table in HTML, that is, only read

What is displayed is displayed through a python list, and row and column identifiers are added.

Read txt file

When the output is displayed, row and column identifiers are added at the same time

scipy

The installation method is to download first whl format file, and then install it through pip install "package name". The download address of the whl package is: http://www.lfd.uci.edu/~gohlke/pythonlibs/f9r7rmd8/scipy-0.18.1-cp35-cp35m-win_amd64.whl

matplotlib data visualization analysis

We can install this module directly using pip install. There is no need to download whl in advance and install it through pip install.

Please look at the code below:

Let’s talk about modifying the style of the graph

Regarding the graph types, there are the following types:

About Colors include the following types:

Regarding shapes, there are the following types:

We can also slightly modify the picture and add some styles. Next, modify the dot picture to be red. point, the code is as follows:

We can also draw a dotted line graph, the code is as follows:

You can also add a title, x, y-axis labels to the graph, the code is as follows

Histogram

The data of each segment can be well displayed by using histogram. Let's make a histogram using random numbers.

The Y-axis is the number of occurrences, and the X-axis is the value (or range) of this number.

You can also specify the histogram type through the histtype parameter:

Graphical differences cannot be described in detail in language, but you can try it with confidence.

For example:

Sub-picture function

What is the sub-picture function? Sub-pictures are multiple small pictures that can be displayed in a large artboard, and each small picture is a sub-picture of the large artboard.

We know that the plot function is used to generate a graph, and the subgraph is subplog. The code operation is as follows:

We can now draw a bunch of data, and anomalies can be easily found based on the graph. Let's practice it through a csv file. This csv file is the number of article reads and comments on a certain website.

Let’s first talk about the file structure of this csv. The first column is the serial number, the second column is the URL of each article, the third column is the number of reads of each article, and the The four columns are the number of comments per article.

Our requirement is to use the number of comments as the Y-axis and the number of reads as the X-axis, so we need to obtain the data in the third and fourth columns.

We know that the way to obtain data is to obtain the value of a certain row through the values ??method of pandas. After slicing the value of this row, we obtain the values ??with subscripts 3 (number of readings) and 4 (number of comments). However, here It’s just the value of one row. We need the number of comments and readings under this csv file. What should we do? If you are smart, you will say, I customize 2 lists, I traverse this csv file, and add the number of readings and the number of comments to the corresponding lists. Isn’t that enough? Haha, there is actually a faster method, which is to use the T transpose method. In this way, you can directly obtain the number of comments and readings through the values ??method. At this time, I will leave it to you to use the pylab method in matplotlib. Picture, then it’s OK. Once you understand the idea, start writing.

Let’s take a look at the code: