Joke Collection Website - Mood Talk - Introduction to the basic knowledge of python big data mining series knowledge organization (introductory tutorial includes source code)
Introduction to the basic knowledge of python big data mining series knowledge organization (introductory tutorial includes source code)
Python has been very popular in the big data industry for the past two years. As a pythonic, it has to get involved in big data analysis. Let’s talk about them below.
Overview of Python data analysis and mining technology
The so-called data analysis means analyzing known data and then extracting some valuable information, such as statistical averages and standard deviations and other information, the amount of data for data analysis may not be too large, and data mining refers to analyzing and digging out a large amount of data to obtain some unknown and valuable information, such as from website users and user behavior. Dig out information about users' potential needs to improve the website, etc.
Data analysis and data mining are inseparable. Data mining is an improvement of data analysis. Data mining technology can help us better discover the patterns between things. Therefore, we can use data mining technology to help us better discover the patterns between things. For example, discover the potential needs of users, realize personalized push of information, discover the patterns between diseases and symptoms, and even diseases and drugs.
To do well in advance, you must first sharpen your tools
Let’s first talk about the modules of data analysis:
Let’s talk about the basic use of these modules. .
Installation and use of numpy module
Installation:
Download address is: http://www.lfd.uci.edu/~gohlke/pythonlibs/
The package I downloaded here is version 1.11.3, the address is: http://www.lfd.uci.edu/~gohlke/pythonlibs/f9r7rmd8/numpy-1.11.3 mkl-cp35-cp35m- win_amd64.whl
After downloading, use pip install "numpy-1.11.3 mkl-cp35-cp35m-win_amd64.whl"
The installed numpy version must have the mkl version , which can better support numpy
Simple use of numpy
Generate random numbers
Mainly use the random method under numpy.
pandas
Just use pip install pandas
Directly upload the code:
Let’s take a look at the output of pandas. This line Which column is the number? The number in the first column is the number of rows. Locate one through the first row and which column:
The common methods are as follows:
Let’s take a look at pandas pair Statistics of data, let’s talk about the information of each row
Transpose function: convert the number of rows into the number of columns, and convert the number of columns into the number of rows, as shown below:
Import data through pandas
pandas supports a variety of input formats. Here I will simply list the most commonly used ones in daily life. For more input methods, you can check the source code and the latter's official website.
CSV file
If the output is displayed after the csv file is imported, it will be output according to the default rows of the csv file. It will output as many columns as there are. For example, if I have five columns of data, then When prinit outputs the results, it displays five columns
excel table
It depends on the xlrd module, please install it.
The same old, original output shows the original results of excel, except that a row number is added at the beginning of each row
Read SQL
Depends on PyMySQL, so it needs to be installed. When pandas takes sql as input, it needs to specify two parameters, the first is the sql statement, and the second is the sql connection instance.
Reading HTML
Depends on the lxml module, please install it.
For HTTPS web pages, it depends on BeautifulSoup4 and html5lib modules.
Reading HTML will only read the table in HTML, that is, only read
What is displayed is displayed through a python list, and row and column identifiers are added.
Read txt file
When the output is displayed, row and column identifiers are added at the same time
scipy
The installation method is to download first whl format file, and then install it through pip install "package name". The download address of the whl package is: http://www.lfd.uci.edu/~gohlke/pythonlibs/f9r7rmd8/scipy-0.18.1-cp35-cp35m-win_amd64.whl
matplotlib data visualization analysis
We can install this module directly using pip install. There is no need to download whl in advance and install it through pip install.
Please look at the code below:
Let’s talk about modifying the style of the graph
Regarding the graph types, there are the following types:
About Colors include the following types:
Regarding shapes, there are the following types:
We can also slightly modify the picture and add some styles. Next, modify the dot picture to be red. point, the code is as follows:
We can also draw a dotted line graph, the code is as follows:
You can also add a title, x, y-axis labels to the graph, the code is as follows
Histogram
The data of each segment can be well displayed by using histogram. Let's make a histogram using random numbers.
The Y-axis is the number of occurrences, and the X-axis is the value (or range) of this number.
You can also specify the histogram type through the histtype parameter:
Graphical differences cannot be described in detail in language, but you can try it with confidence.
For example:
Sub-picture function
What is the sub-picture function? Sub-pictures are multiple small pictures that can be displayed in a large artboard, and each small picture is a sub-picture of the large artboard.
We know that the plot function is used to generate a graph, and the subgraph is subplog. The code operation is as follows:
We can now draw a bunch of data, and anomalies can be easily found based on the graph. Let's practice it through a csv file. This csv file is the number of article reads and comments on a certain website.
Let’s first talk about the file structure of this csv. The first column is the serial number, the second column is the URL of each article, the third column is the number of reads of each article, and the The four columns are the number of comments per article.
Our requirement is to use the number of comments as the Y-axis and the number of reads as the X-axis, so we need to obtain the data in the third and fourth columns.
We know that the way to obtain data is to obtain the value of a certain row through the values ??method of pandas. After slicing the value of this row, we obtain the values ??with subscripts 3 (number of readings) and 4 (number of comments). However, here It’s just the value of one row. We need the number of comments and readings under this csv file. What should we do? If you are smart, you will say, I customize 2 lists, I traverse this csv file, and add the number of readings and the number of comments to the corresponding lists. Isn’t that enough? Haha, there is actually a faster method, which is to use the T transpose method. In this way, you can directly obtain the number of comments and readings through the values ??method. At this time, I will leave it to you to use the pylab method in matplotlib. Picture, then it’s OK. Once you understand the idea, start writing.
Let’s take a look at the code:
- Previous article:Ling Yunzhi's conversation
- Next article:A classic sentence, suitable for people who are not feeling well.
- Related articles
- Customs of Baoding people
- When talking to her, she said nothing but bowed her head and smiled. What do you mean?
- QQ Personality Space talks about 2021 with memories. Talks about the mood space of recalling the past.
- What colors are pure yogurt, white peaches, oats and highland barley?
- Say a word about the helplessness of life.
- APP commonly uses 8 navigation forms.
- Draw a simple drawing of the meaning of the four-character idiom
- Reflections on reading Wolf and Sheep in the Same Cage
- 99 Love Letters in Tribal Confessions
- Talk about the circle of friends who pick strawberries.