Joke Collection Website - News headlines - The Basis and Content of Normalization Theory of Relational Database

The Basis and Content of Normalization Theory of Relational Database

The relational database schema is composed of a set of relational schemas, which are composed of a set of attribute names. The design of relational database is how to group a given set of interrelated attribute names and form a relationship between each group of attribute names. However, the grouping of attributes is not unique. Different groups correspond to different database application systems, and the efficiency is often very different.

In order to make the database design reasonable, reliable, simple and practical, the relational database design theory-standardization theory has been formed for a long time.

6. 1 The role of relationship standardization

Standardization is a process of replacing the original relational model with a simpler form and a more standardized structure.

If the data of two or more entities are stored in one table, the following three problems will occur:

High data redundancy

Insertion exception

Delete exception

The so-called data redundancy is the phenomenon of repeatedly storing the same data in the database. Data redundancy will not only waste storage space, but also cause data inconsistency.

Insertion exception refers to the situation that when inserting data into irregular data table, useful data cannot be inserted due to the restriction that the main code cannot be empty due to entity integrity constraint.

Deleting an exception means that when a tuple in an irregular data table contains some useful data, it is difficult to delete it.

(Take P98 payroll as an example)

The solution to the above three problems is to decompose non-standard relationships into multiple relationships, so that each relationship only contains the data of one entity.

(Give an example solution)

Of course, there is another problem with the improved relational model. When querying employees' salaries, two relationships need to be connected to query, and the cost of relationship connection is also very high.

So, what kind of relationship needs to be broken down? What is the theoretical basis of the decomposition relationship model? Can the above three problems be completely eliminated after decomposition? Answering these questions requires theoretical guidance. Next, we will discuss:

6.2 Functional Dependence

6.2. 1 Attribute Relationship

There are two kinds of connections between entities: one is the connection between entities; The other is the relationship between the internal attributes of the entity. In the chapter of database modeling, the former category is discussed, and here we will learn the second category.

Similar to the first category, the relationships between attributes in an entity can be divided into three categories: 1: 1, 1:n and m:n:

Example: employee (employee number, name, ID number, position and department)

1, one-to-one relationship (1: 1)

Let X and Y be two attributes (sets) of the relation R. If there is at most one value in Y corresponding to any specific value in X, and vice versa, there is at most one value in X, it is said that there is a one-to-one relationship between X and Y. ..

For example, in this case, there is a one-to-one relationship between the employee number and the ID number.

2. One-to-many relationship (1:n)

Let X and Y be two attributes (sets) of the relation R. If for any specific value in X, multiple values can be found in Y, and for any specific value in Y, at most only one value in X corresponds to it, then the attribute X and Y are said to be a one-to-many relationship. ..

For example, there is a one-to-many relationship between employee number and job title in employee relations.

3. Many-to-many relationship (many-to-many)

Let X and Y be two attributes (sets) of the relation R. If there are n values in Y corresponding to any specific value in X and m values in X corresponding to any specific value in Y, the attributes X and Y are said to have a one-to-many (m:n) relationship. ..

For example, in employee relations, professional titles and departments are many-to-many.

The three relationships between the above attributes are actually the reflection of interdependence and mutual restriction between attribute values, so they are called data dependence between attributes.

There are three data dependencies * * *:

Functional dependence

Multivalued dependency (MVD)

Connection dependency (JD)

Among them, functional dependence and multivalued dependence are the most important.

Functional dependence

Functional dependency is the connection between attributes. In the relation R, X and Y are two attributes or attribute groups of R.. If all relationships r exist and y has only one specific value for each specific value of x, the attribute y function is said to depend on the attribute x ... In other words, the attribute x function determines the attribute y, which is expressed as x → y, where x is called the determining factor and y is called the determined factor.

The above definition can be summarized as follows: if the value of attribute X determines the value of attribute Y, then the function of attribute Y depends on attribute X ... In other words, if the value of X is known, the value of Y can be obtained, so it can be said that X determines Y.

If the y function does not depend on x, it is recorded as x → y.

X Y

If X→Y, Y→X, note:

Among the three relationships between attributes learned earlier, each one has no functional dependency.

U if the relationship between x and y is 1: 1, there is a functional dependency relationship x ←→ y.

U if the relationship between x and y is 1:n, there is a functional dependency: X→Y or Y→X (multiple factors are the decisive factors).

U if the relationship between x and y is m:n, there is no functional dependency.

Note that the functional dependency between attributes does not mean that one or some subsets of R meet the above restrictions, but that all subsets of R must meet the restrictions in the definition. As long as there is a specific relationship R (a subset of R) that does not meet the conditions in the definition, it will destroy the functional dependence and make it untenable.

The relation subset here refers to the collection of some tuples of R. For example, the student relationship in Geological Exploration College only contains the data of students in Geological Exploration College, so it is a subset of the student relationship in Chang 'an University.

Definition of code

In front of us, we defined the code intuitively. Now we use the concept of functional dependence to make a more accurate formal definition of the code:

Let k be an attribute or group of attributes in the relational pattern R(U, f), and k' be an arbitrary subset of k. If K→U, but there is no k ′→ u, then k is a candidate key of R.

If there are multiple candidate codes, select one of them as the primary key;

Attributes contained in any candidate code are called primary attributes;

Attributes that are not included in any code are called non-primary attributes or non-key attributes.

In the relational model, the simplest case is that a single attribute is a code called a single key. The most extreme case is that the entire property group is a code called All-Key.

The situation of single code has been encountered many times before. The following is an example of the complete code:

Sign a contract (actor's name, production company, film name)

Foreign code: There are two relationships between R and S, where X is the attribute or attribute group of R, X is not the code of R, but X is the code of S (or has the same meaning as the code of S), then X is called the foreign key of R, which is simply called foreign code or foreign key.

Such as: employee (employee number, name, gender, professional title, department number)

Department (department number, department name, telephone number, person in charge)

The "Department Number" in employee relations is the external code of employee relations.

It should be noted here that saying that X is not a code of R in the definition does not mean that X is not the main attribute of R, and X is not a code, but it can be the constituent attribute of the code or the main attribute of any candidate code.

Such as: students (student number, name, gender, age ...)

Courses (course number, course name, teacher ...)

Course selection (student number, course number, grade)

In the course selection relationship, (student number, course number) is the code of the relationship, and the student number and course number are the attributes of the main code respectively (but not separate codes). They are the main codes of student relationship and curriculum relationship respectively, so they are two external codes of course selection relationship.

The relationship between relationships can be established by the values of main code and external code that exist in two or more relationships at the same time. If you want to query an employee's department, you only need to query the record with the same department number as the employee in the department table. Therefore, the main code and external code provide a way to express the connection between relationships.

6.2.4 Functional Dependence and Uniqueness of Code

From the formal definition of the above code, we can say that a code is composed of one or more attributes, which can uniquely identify the smallest attribute group of a tuple.

Codes are always unique in relationships, that is, a code function uniquely determines a line. If the value of the code is repeated, the whole tuple will be repeated. Otherwise, it violates the entity integrity Rules. The repetition of tuples means that there are two identical entities, which is obviously impossible, so the code is not allowed to take values repeatedly.

Therefore, only when an attribute or attribute group can determine every other attribute in the relationship through a function, but any proper subset of the attribute group can't do this, the attribute or attribute group is the code of the relationship.

Functional dependency is a concept of data-related rules. If the function of attribute b depends on attribute a, then if the value of a is known, then the value of b can be found completely. This does not mean that the value of b can be calculated from the value of a, but that there can only be one value of b logically.

6.3 Normalization of relational model

First, the non-standardized relationship

When a table has data items that can be subdivided, the table will be denormalized. There are two situations in the denormalization table:

There are combined data items in the table (P 102 Table 6-4).

There are multivalued data items in the table (P 103 Table 6-5).

Example:

Employee number

(full name)

salary

basic wage

Pay by post

Seniority salary

1002

Zhang San

1000

Eight hundred

200

Employee number

(full name)

professional title

name of a department

Department address

degree

Year of graduation

00 1

Zhang San

professor

computer

1305

university

postgraduate

1963

1982

So what is a standardized relationship?

When all components in a relationship are inseparable data items, the relationship is normalized. That is to say, when there are no combined data items and multi-valued data items in the table, only inseparable data items exist, the table is standardized.

Two-dimensional tables can be divided into five paradigms according to their normalization degree from low to high, which are called 1NF, 2NF, 3NF(BCNF), 4NF and 5NF respectively. The higher the degree of standardization, the lower the subset, namely:

1NF 2NF 3NF BCNF 4NF 5NF

Second, first normal form (1NF)

Definition: 1: If the relational schema R does not contain multi-valued attributes, then R satisfies first normal form, and it is recorded as:

R∈ 1NF

1NF is the minimum requirement for the relationship, and the relationship that does not meet 1NF is nonstandard.

The method of transforming non-normalized relation into normalized relation 1NF is simple, as long as the above table is expanded horizontally and vertically respectively. Table below:

Employee number

(full name)

basic wage

Pay by post

Seniority salary

1002

Zhang San

1000

Eight hundred

200

1005

Lisi

1200

900

150

Employee number

(full name)

professional title

name of a department

Department address

degree

Year of graduation

1002

Zhang San

professor

computer

1305

university

1963

1002

Zhang San

professor

computer

1305

postgraduate

1982

1005

Lisi

lecturer

New store

2206

university

1989

Although the above table conforms to 1NF, there is still a problematic relationship. There are a lot of data redundancy and potential data update anomalies in the table. The reason is that (employee number, education) is the code in the right table, but the name, title, department name and department office address have nothing to do with education, but are only part of the code. Therefore, the above table needs to be further standardized.

Third, the second paradigm (2NF)

Definition 1: Let X and Y be two different attributes or attribute groups of the relation R, X'→Y Y. If X has a proper subset X', so that X ′→ Y holds, it is said that the Y-part function depends on X, and it is recorded as: X P→ Y (part). On the other hand, it is said that the complete function of y depends on x, which is recorded as: X F→ Y (full).

Definition 2: if a relation R∈ 1NF and all its non-principal complete functions depend on any candidate code of r, then r belongs to the second normal form and is recorded as R∈2NF.

Note: The so-called candidate code in the above definition also includes the main code, because the code should be a candidate code to be designated as a code.

Such as relational schema:

Employee (employee number, name, job title, project number, project name, project role)

(employee number, project number) is the code of this relationship, employee number → name, employee number → title, project number → project name …

So (employee number, project number) P→ title, (employee number, project number) P→ project name.

Therefore, the above employee relations do not meet the requirements of the second paradigm. It has three problems: inserting exceptions, deleting exceptions and modifying exceptions.

Among them, the modification exception is like this. When the project name changes in employee relations, because there are many people involved in the project, everyone has a record. If you want to modify the project information, you must modify the information of everyone involved in the project, which increases the workload and may lead to omissions, which may lead to the destruction of data consistency.

The above employee relations can be broken down into the following three types:

Employees (employee number, name and title)

Participate in the project (employee number, project number, project role)

Project (project number, project name)

The above three relationships all meet the requirements of definition 2, so they all meet 2NF.

Inference: If the relational pattern R∈ 1NF, and each of its candidate codes is a single code, then R∈2NF.

Problems such as data redundancy and abnormal update may still exist in the relational model that conforms to the second normal form. Ruyu

Employee information (employee number, name, title, department name, department office address)

Although it also conforms to 2NF, when a department has 100 employees, the department address in the tuple will be repeated 100 times, and there is high data redundancy. The reason is that in the relationship, the office address does not directly depend on the employee number, but because the employee number function determines the department name, and the department name function determines the office address, which makes the office address function depend on the employee number. This dependence is a process of transferring dependence.

Therefore, the relational model of the above employee information needs to be further standardized.

Fourth, the third paradigm (3NF)

Definition 1: In the relation R, X, Y and Z are three different attributes or attribute groups of R. If X→Y, Y→Z, but Y→X, and Y is not a subset of X, the Z transfer function is said to depend on X. ..

Definition 2: If the relational schema R∈2NF, and none of its non-main attributes is transitively dependent on any candidate code, then R is called the third normal form, and is denoted as R∈3NF.

Inference 1: If the relational schema R∈ 1NF, and each of its non-primary attributes is neither partially dependent nor transitively dependent on any candidate code, then R∈3NF.

Inference 2: The relational pattern without non-main attributes must be 3NF.

Verb (abbreviation of verb) Improved 3NF-BCNF (Boyer -CODD paradigm)

Definition: Let the relational model R(U, F)∈ 1NF. If any function of f depends on X→Y(Y X), and x contains a code of r, it is called R∈BCNF.

In other words, in relational schema R, if each determinant on which a function depends contains a code, then R∈BCNF

Inference: If R∈BCNF, then:

All the non-main attributes in R are completely dependent on each code;

All the main properties in R are completely dependent on each code that does not contain it.

There are no attributes in R, and a complete function depends on any set of non-code attributes.

Theorem: If R∈BCNF, then R∈3NF must hold.

Proof: (Combined with the definition of transitive dependency, using reduction to absurdity)

Note: When R∈3NF, R may not belong to BCNF. Because compared with BCNF, 3NF relaxes the restrictions, and it allows the decisive factor to contain no code. For example:

Communication (city name, street name, postal code):

F = {(city name, street name) → postal code, postal code → city name}

The complete function of the non-first-class postal code depends on the code, and there is no transmission dependence, so it belongs to 3NF, but the postal code is also the decisive factor, and it does not contain the code, so the relationship does not belong to BCNF.

Another example is:

Teaching (students, teachers, courses) is abbreviated as teaching (S, T, C).

Regulation: One teacher can only teach one class, and each class can be taught by more than one teacher; Once a student chooses a course, the teacher is fixed accordingly.

F={T→C,(S,C)→T,(S,T) →C}

The candidate codes of this relationship are (s, c) and (s, t), so these three attributes are all primary attributes. Since there is no non-main attribute, then this relationship must be 3NF. However, because the determinant T contains no code, it is not BCNF.

There is still the problem of data redundancy in relational mode teaching, because the main attribute depends on some functions of the code.

Exactly: f = {t → c, (s, C)P→T, (s, t) p → c}

Therefore, the teaching relationship can be decomposed into the following two BCNF relationship models:

Teachers (teachers, courses) students (students, teachers)

The "incompleteness" of 3NF lies in the possible partial and transitive dependence of the main attribute on the code.

If a relational schema reaches BCNF, it has been completely separated within the scope of functional dependence, eliminating data redundancy, insertion and deletion anomalies.

6.4 Multivalued Dependence and the Fourth Normal Form

First, the multi-value dependence (multimedial dependence)

Course c

Teacher t

Reference book b

physics

Li Yong

general physics

physics

Li Yong

Optical principle

physics

Li Yong

Physical problem set

physics

Wang Jun

general physics

physics

Wang Jun

Optical principle

physics

Wang Jun

Physical problem set

mathematics

Li Yong

mathematical analysis

mathematics

Li Yong

differential equation

mathematics

Li Yong

advanced algebra

mathematics

Zhang ping

mathematical analysis

mathematics

Zhang ping

differential equation

mathematics

Zhang ping

advanced algebra

Computational mathematics

Zhang ping

mathematical analysis

Computational mathematics

Zhang ping

Computational mathematics

Computational mathematics

Zhou Feng

mathematical analysis

Computational mathematics

Zhou Feng

Computational mathematics

Course c

Teacher t

Reference book b

physics

Li Yong

Wang Jun

general physics

Optical principle

Physical problem set

mathematics

Li Yong

Zhang ping

mathematical analysis

differential equation

advanced algebra

Computational mathematics

Zhang ping

Zhou Feng

mathematical analysis

Computational mathematics

For example, a course in school is taught by many teachers, who use the same set of reference books. Each teacher can teach multiple courses and each reference book can be used by multiple courses. The following is a non-standardized table to show the relationship among teacher T, course C and reference book B.

Turn the above table into a standardized two-dimensional table teaching, such as the right table.

The code of relational pattern teaching (c, t, b) is (c, t, b), that is, all keys. So teach ∈BCNF. According to the above semantic provisions, if a lecturer is added to a course, tuples equal to the number of corresponding reference books will be added to the teaching table. Similarly, when reference books are to be deleted from a course, a corresponding number of tuples must be deleted.

It is inconvenient to add, delete and modify data, and the redundancy of data is also very obvious. If we examine this relational model carefully, we will find that it has a data dependence called multivalued dependence.

Definition: Let R(U) be a relation pattern on attribute set U, X, Y and Z are subsets of U, and Z = U-X-Y X-Y. If any relation R of R(U) is given a pair of (x, z) values, there is a set of Y values corresponding to it, and this set of Y values only depends on the X value and has nothing to do with the Z value. Then it is said that the multivalued value of Y depends on X, or that the multivalued value of X determines Y, which is recorded as: X→→Y y. ――

For example, in relation model teaching, for the value of a (c, b) (physics, general physics), there is a set of t values {Li Yong, Wang Jun}, which only depends on the value in course C (physics), that is to say, for another (physics and optical principles), its corresponding t value is still {Li Yong, Wang Jun}, so the value of t has nothing to do with the value of B.

Another equivalent formal definition of multivalued dependency is:

Let the relational pattern R(U) be a subset of U, where Z = U-X-Y, R is any relation of R, t 1, t2 is any two tuples of R, if t 1[X]=t2[X], and there are two tuples t3 and t4 in R.

t3[X]=t4[X]=t 1[X]

t3[Y]=t 1[Y],t3[Z]=t2[Z],

t4[Y]=t2[Y],t4[Z]=t 1[Z]

If it holds, then x →→→ y y.

In other words: if X→→Y Y y holds in R(U), as long as there are two tuples t 1 and t2 in any relation R of R, then the two new tuples t3 and t4 obtained by exchanging these two tuples in Y (or Z) must also be tuples in relation R. ..

In the definition, if z =ф Ф (empty set), X→→Y Y y is called trivial multivalued dependency, otherwise it is nontrivial multivalued dependency.

Multi-valued dependencies have the following properties:

1. symmetry: if x→→→ y, then x→→→→ z, where z = u-x-y x-y.

2. transitivity: if x →→→→→→ y, Y→→Z Z z, then x →→→→→→→→ z-y.

3. If X→→→→→ Y and X→→→→→ Z, then X→→→→→→→→ YZ.

4. If X→→→→ Y and X→→→→ Y ∩ Z, then X→→→→→→→ Y ∩ Z.

5. if x →→→ z-y, X→→Z-Y-z, then x→→→→→→ y-z, x→→→→→→→→ z-y.

Compared with functional dependencies, multivalued dependencies have the following two basic differences:

The validity of (1) multivalued dependency is related to the range of attribute set.

If x →→→→→ y stays on u, it must stay on v (xy v u); On the other hand, it is not, that is, X→→Y Y y holds true on V(V U), but not necessarily on u, because the definition of multivalued dependency involves not only attribute groups x and y, but also other attributes Z (z = u-x-y) in u.

Generally speaking, if X→→→ Y on R(U) holds true for V (u), it is said that X→→→ Y is an embedded multivalued dependence of R(U).

However, in the relational model R(U), the effectiveness of functional dependence X→Y depends only on the values of two attribute sets X and Y. As long as the values of tuples on X and Y make X→Y hold in any relation R of R(U), then X→Y also holds in any attribute set V(XY V U).

(2) If the functional dependence X→→Y holds for R(U), then X→Y' holds for any Y' Y. However, if the multivalued dependence X→→→ Y holds for R(U), it cannot be asserted that X→→→ Y' holds for any Y'Y'. ..

Constraint rule of multivalued dependency: in the relationship with multivalued dependency, deleting a tuple casually will destroy its symmetry. Therefore, in order to maintain the multivalued dependency, other related tuples must be deleted to maintain their symmetry. This is the constraint rule of multivalued dependency. At present, RDBMS does not have the ability to maintain this constraint, which needs to be realized by programmers in programming.

Functional dependency can be regarded as a special case of multivalued dependency, that is, functional dependency must be multivalued dependency. However, multivalued dependencies do not necessarily have functional dependencies.

Second, the fourth paradigm (4NF)

Definition: If the relational pattern R∈ 1NF contains a code for each nontrivial multivalued dependency of r →→→→→ y (y x), then R is called the fourth normal form, that is, R∈4NF.

Course c

Teacher t

Reference book b

physics

Li Yong

general physics

physics

Li Yong

Optical principle

physics

Li Yong

Physical problem set

physics

Wang Jun

general physics

physics

Wang Jun

Optical principle

physics

Wang Jun

Physical problem set

mathematics

Li Yong

mathematical analysis

mathematics

Li Yong

differential equation

mathematics

Li Yong

advanced algebra

mathematics

Zhang ping

mathematical analysis

mathematics

Zhang ping

differential equation

mathematics

Zhang ping

advanced algebra

Computational mathematics

Zhang ping

mathematical analysis

Computational mathematics

Zhang ping

Computational mathematics

Computational mathematics

Zhou Feng

mathematical analysis

Computational mathematics

Zhou Feng

Computational mathematics

Teaching relationship

When the relational schema R∈4NF, all nontrivial multivalued dependencies in R are actually functional dependencies. Because each determinant contains a code, R must belong to BCNF.

In fact, 4NF limits the existence of nontrivial and nonfunctional multivalued dependencies between attributes of relational schema. On the contrary, the nontrivial multivalued dependency allowed by 4NF is actually a functional dependency.

The teaching relationship in the example belongs to BCNF, but not to 4NF. Because its code is (C, T, B), there are nontrivial multi-valued dependencies C →→→ T and C →→→→→ B in the relationship, but C does not contain the code, but is only a part of the code.

Course c

Reference book b

physics

general physics

physics

Optical principle

physics

Physical problem set

mathematics

mathematical analysis

mathematics

differential equation

mathematics

advanced algebra

Computational mathematics

mathematical analysis

Computational mathematics

Computational mathematics

CB relation

Course c

Teacher t

physics

Li Yong

physics

Wang Jun

mathematics

Li Yong

mathematics

Zhang ping

Computational mathematics

Zhang ping

Computational mathematics

Zhou Feng

CT relation

In order to make the teaching relationship conform to 4NF, it must be decomposed into two relationship models: CT(C, t) and CB(C, b). As shown in the right table:

It can be clearly seen from the table that there is still data redundancy in the relationship teaching that conforms to BCNF, while the decomposed relationship CT and CB only have trivial multi-value dependencies, so they conform to 4NF, and they have eliminated data redundancy. It can be said that BCNF is the most standardized paradigm among the relational patterns with only functional dependence, while 4NF is the most standardized paradigm among the relational patterns with multi-valued dependence.

If there are connection dependencies in the relational schema, even if it conforms to 4NF, it may still encounter problems such as data redundancy and abnormal update. Therefore, for the relational model reaching 4NF, it is necessary to eliminate the possible connection dependencies in order to further reach the relational model of 5NF.

The content about connection dependency and 5NF has exceeded the requirements of the syllabus of this course, so I won't introduce it here.