Joke Collection Website - Blessing messages - Risk control of data analysis

Risk control of data analysis

In the early 1990s, American credit card financial companies, led by American Express, began to use data modeling to improve risk control capabilities and solve problems such as precision marketing. Discover, Capital? One followed closely.

In 1995, AMEX's risk control model began trial operation, and the risk control system was officially launched in 1997. In the following years, AMEX maintained rapid growth and reduced non-performing loans to The lowest in the industry

In 2008, discover moved its global data analysis center to Shanghai. The risk control talents flowing out from this center have filled the major mutual financial companies in China

Business types: mortgage loans (mortgage loans, car loans), credit loans (such as Yiren loans), consumer installment loans (mobile phone and home appliances) etc.), small cash loans (500/1000/1500), etc.

Businesses involved in risk control: 1) Data collection: including credit data, operator data, crawlers, website tracking, and historical loan data , blacklist, third-party data, etc.

?2) Anti-fraud engine: mainly includes anti-fraud rules and anti-fraud models.

?3) Rule engine: often referred to as strategy. It mainly uses data analysis methods to count the bad debt rates in different fields and various intervals, and then selects people with better credit for lending

4) Risk control model & scorecard: There is no significant difference between the model algorithms. Instead, they are divided according to different time points when they occur (before loan/during loan/after loan), that is, the way the target is generated is different. Usually in the credit field, the target variable is defined by the number of days overdue. Card A can use the maximum number of overdue days in the customer's history, and Card B can use the largest overdue loan among multiple installments. C-cards have different establishment methods due to different uses

5) Collection: It is the ultimate means of risk control. This link can generate a lot of data that is helpful to the model, such as text descriptions of collection records, reach rates, fraud labels, etc.

1) Crawlers can crawl mobile APP information. We can divide mobile APPs into 4 categories: tools, social networking, entertainment, and finance. Calculate the number of each type of APP, so that there are 4 characteristics

2) From the operator data, you can know how many calls the customer made, how many text messages they sent, how much data they used, and whether they have any debts Fees and other information

3) Credit report is often a simple credit score. Generally, the higher the score, the better the customer quality

4) From the basic information Obtain user portraits, such as age, gender, and household registration from the ID card.

The upgraded version of the blacklist is the rules engine. Yet it is generated by experience. For example, the insurance company may refuse to purchase return insurance for people who return goods 5 times in a row or if the return rate reaches 80%. Rules usually require a lot of effort to maintain and be constantly updated and modified, otherwise they will cause a large number of misjudgments. If the amount or number of suspected cash withdrawals exceeds a certain amount, it is recommended to deny access or pay special attention to it. If the number of loan applications applied within XX days is greater than a certain value, it is recommended to reject it

For example, we can set an access rule, such as the occupation is civil servant, doctor, lawyer, etc.

You can also set the direct lending principle, for example, if the sesame score is greater than 750 points

How to determine the target variable: Take A card as an example, mainly through roll-rate and vintage. For example, we can define customers who are overdue for more than 60 days in 8 months as bad customers, and customers who are not overdue in 8 months as good customers. Customers who are overdue for eight months and within 0-60 days are considered uncertain customers and are excluded from the sample.

1) Preliminary preparations: Different models are aimed at different business scenarios. Before starting the modeling project, you need to have a clear understanding of the business logic and requirements

2) Model design : Including model selection (scorecard or integrated model), single model or model segmentation. Whether rejection inference is needed and how to define the observation period, performance period, and good and bad users. Determine the data source

3) Data pulling and cleaning: Fetch data from the data pool according to the definition of the observation period and performance period, and perform data cleaning and stability verification. Data cleaning includes anomalies, deletions, and duplications.

Stability verification mainly examines the stability of variables in time series. The indicators include PSI, IV, average value/variance, etc.

4) Feature engineering: mainly the preprocessing and screening of features. The scorecard is mainly screened by IV. In addition, feature construction will be carried out based on the understanding of the business, including feature intersection (multiplication/division/Cartesian product of two or more features), feature conversion, etc.

5) Model establishment and evaluation: Scorecard Logistic regression can be used. If you only need to make two-class predictions, you can choose xgb. After the model is built, you need to evaluate the model and calculate auc,ks. And perform cross-validation on the model to evaluate the generalization ability

6) Model online deployment: Configure model rules in the risk control background. For some complex models such as xgb, the model file is generally converted into pmml format. and encapsulated. Upload files and configuration parameters in the background

7) Model monitoring: In the early stage, the main purpose is to monitor the stability of the overall model and variables. The main measurement standard is PSI (population stability? index). In fact, psi is the difference between the actual and expected proportions of each score interval after dividing it into intervals by scores. If it is less than 10%, there is no need to update the model. If it is less than 25%, you need to focus on the model. If it is greater than 25%, the model needs to be updated. The calculation model psi generally uses equal frequency and can be divided into 10 boxes

1. What are the meanings and differences of A card, B card and C card?

A card (application score card): that is, application score card , during the customer application processing period, predict the risk probability of default and default within a certain period of time after the customer opens an account, effectively eliminating applications from customers with bad credit and non-target customers. At the same time, risk pricing is carried out for customers - determining the limit and interest rate. The data used is mainly the user’s past credit history, long loans, consumption records and other information.

B card (behavior score card): Behavior score card. During the account management period, it predicts the future credit performance of the account based on various behavioral characteristics displayed in the history of the account. The first is to prevent and control loan risks, and the second is to adjust the user's limit. The data used is mainly the user’s login, browsing, consumption behavior and other data on this platform. There are also loan performance data such as loan repayments and overdue loans.

C card (collection? score? card): collection score card, which predicts the probability of response to collection strategies for overdue accounts, so as to take corresponding collection measures

The difference between the three cards:

The data requirements are different: A card can generally perform credit analysis for 0-1 years of loans. Card B is an analysis based on larger data after the applicant has performed certain behaviors. C card has greater data requirements and needs to include attribute data such as customer response after collection

Different characteristics: A card uses mostly background information of the applicant, such as basic information filled in by the customer, and Third Party Information. And this model is generally more cautious. B Card takes advantage of many transaction-based features.

2. Why choose the logistic regression model in the field of risk control, and what are its limitations

1) First of all, because the sensitivity of logistic regression to customer group changes is not as good as other high-complexity models, so Good robustness

2) The model is intuitive, the meaning of the coefficients is easy to explain, and easy to understand

The disadvantage is that it is easy to underfit and the accuracy is not very high. In addition, the data requirements are relatively high, and deletions, anomalies, and feature linearity are all sensitive

3. Why use IV instead of WOE to filter features

Because IV considers the samples in the group The effect of proportion. Even if the WOE of this group is high, if the proportion of samples in the group is small, the final predictive ability of this feature may still be very small

4. ROC and KS indicators (ks is 0.2-0.75, auc is 0.5-0.9 is better)

The ROC curve regards TP and FP as the abscissa and ordinates, while the KS curve regards TP and FP as the ordinates, and the abscissa is the threshold. KS can find the group with the largest difference in the model. If it is greater than 0.2, it is considered to have relatively good prediction accuracy.

ROC can reflect the overall differentiation effect

5. Binning method and badrate monotonicity

Currently in the industry, many people use greedy algorithms for binning, such as best_ks, chi-square Packaging etc. Badrate monotonicity is only considered during the binning process of continuous numerical variables and ordinal discrete variables (such as education/size). As for why we should consider the monotonicity of badrate, it is mainly due to business understanding. For example, the more overdue history, the greater the badrate.

6. Why different risk control models generally do not use the same features

People who are rejected are because certain features perform poorly. If the same characteristics are used for repeated screening, then as time goes by, these people will no longer be included in the samples modeled in the future. In this way, the sample distribution of features changes.

7. What are the unsupervised algorithms used in risk control

Clustering algorithm, graph-based outlier detection, LOF (local outlier factor), isolated forest, etc.

8. Chi-square binning

Chi-square binning is a data discretization method based on merging. The basic idea is that adjacent intervals have similar class distributions and will be merged. And Chi-square binning The square value is a standard for measuring the similarity between two intervals. The lower the chi-square value, the more similar it is. Of course, it is impossible to merge indefinitely. We set a threshold for it. It is obtained based on the degree of freedom and confidence. For example, the number of categories is N , then the degree of freedom is N-1. And the confidence degree represents the probability of occurrence. Generally, 90% can be taken.

9. best-ks binning

Contrary to chi-square binning, best-ks binning is a step-by-step splitting process. Sort the feature values ??from small to large, the value with the largest KS is the cut point, and then divide the data into two parts. Repeat this process until the number of boxes reaches our preset threshold.

10. Reject inference (reject? inference)

The application scorecard uses the historical data of approved credit customers to build a model, but this model will ignore those that were originally rejected. The impact of certain customer segments on the scorecard model. The model needs to be modified by rejecting the inference to make the model more accurate and stable. Alternatively, changes in the company's rules may allow customers who were rejected in the past to now come through. Suitable for scenarios with medium and low pass rates.

Commonly used method: hard truncation method---first use the initial model to score rejected users and set a threshold. A score higher than this is marked as a good user, and a score below this is marked as a bad user. Then add the marked rejected users to the sample to retrain the model. Allocation method---This method is suitable for scorecards. The samples are divided into groups according to their scores, and the default rate of each group is calculated. Then the rejected users are scored and grouped according to the previous method. Based on the sampling ratio of each group's default rate, the defaulting users in this group are randomly selected and designated as bad users, and the remaining ones are marked as good users. Then add the marked rejected users to the sample for retraining

11. How to ensure the stability of the model during the modeling process

1) In the data preprocessing stage, the variables can be verified in the time series To improve stability, methods include: calculating the difference in monthly IVs, observing changes in variable coverage, PSI differences at two time points, etc. For example, we selected the data set from January to October, drew on the idea of ??K-fold verification, and obtained 10 sets of verification results. Observe whether there is a relatively large trend change in the model as the month goes by

2) In the variable screening stage, eliminate variables that are contrary to business understanding. If it is a scorecard, you can eliminate variables with too strong discrimination. The model will be too affected by this variable and the stability will decrease

3) Do cross-validation, one is cross-validation on time series, and the other is cross-validation on time series. The first is K-fold cross-validation

4) Choose a model with good stability. For example, xgb? random forest, etc.

12. How to deal with high-dimensional sparse features and weak features

For high-dimensional sparse features, logistic regression is better than gbdt. The penalty term of the latter is mainly tree depth and number of leaves, which is not severe for sparse data and is easy to overfit. Using the logistic regression scorecard, you can discretize the features into 0 and non-0, and then perform woe encoding.

If scorecard modeling is used, weak features will generally be discarded. The number of features in the scorecard should not be too many, generally less than 15.

However, xgb does not have high requirements for data and has good accuracy. The cross-combination of some weak features may have unexpected effects.

13. After the model is put online, it is found that the stability is not good, or the online differentiation effect is not good. How to adjust it?

If the model stability is not good, first check whether it was considered when modeling. Characteristic stability. If variables with poor stability are found in the early stage of the model, consider discarding or replacing them with other variables. In addition, analyze the distribution difference between online and offline users and users during modeling, and consider adding a step of rejecting inference during modeling to make the distribution of modeling samples closer to the actual overall applicant users

Online effects If it is not good, it can be analyzed from the perspective of variables. Eliminate variables with poor results and explore new variables to add to the model. If a model has been online for a long time and the user's attributes are gradually drifting, then re-acquire the data and make the model

14. How to cold start the risk control model

When the product is first launched, there is no accumulated user data, or users do not perform well or poorly. At this time, you can consider: 1) Do not make models, only make rules. With business experience, make some hard rules, such as setting user access thresholds, considering users' credit history and long-term risks, and connecting to third-party anti-fraud services and data product rules. It can also be combined with manual review to conduct risk assessment on the user's application materials. 2) Use data from similar models to model.

15. Sample imbalance problem

In addition to adjusting class weights, sampling methods are mainly used to solve it. Common ones include naive random oversampling, SMOTE, and ADASYN (adaptive comprehensive oversampling)

16. Operator data processing

According to the call date, the call record can be divided into nearly 7 days, the past half month, the past January, the past three months, the past six months and other time windows. It can also be divided into working days, holidays, etc. based on specific dates. Depending on the call time, the day can be divided into early morning, morning, afternoon, and evening. As for phone numbers, one idea is to divide them into provinces and cities according to their place of origin. Another idea is to label the numbers. Based on the marks of Phone State, Baidu Mobile Guard, and Sogou Number Pass, we can distinguish express takeaways, harassing calls, and financial institutions. , intermediaries, etc. It even distinguishes whether the number is a blacklisted user, an applicant or a rejected user based on business accumulation. The user's calls with different number tags can reflect the user's calling habits and life characteristics

17. Stepwise regression

When the relationship between independent variables is more complex, for the variables When the trade-off is difficult to grasp, we can use the stepwise regression method to screen variables. The basic idea of ??stepwise regression is to introduce variables into the model one by one. Each time a variable is introduced, an F test is performed, and a t test is performed on the selected variables. When the originally introduced variables are no longer significant after the subsequent variables are introduced, the original variables are The variable is deleted. To ensure that only significant variables are included in the regression equation before each new variable is introduced

18. In logistic regression, why is feature combination (feature crossover) often done

Logic Regression is a generalized linear model, and feature combination can introduce nonlinear features to improve the expression ability of the model

Some cited articles: /content/qita/775233 ? /article/jXwvkaB9t7mPWHxj9ymu /developer/article/1489429 /developer/ article/1059236 /taenggu0309/Scorecard--Function