Landslide susceptibility assessment using statistical models: A case study in Badulla district, Sri Lanka

. Landslides are the most recurrent and prominent natural hazard in many areas of the world which cause significant loss of life and damage to properties. By generating landslide susceptibility maps, the hazard zones can be identified in order to produce an early warning system to reduce the damage. In this study, the predictive abilities of two statistical models, Logistic regression (LR) model and Geographically Weighted logistic regression (GWLR) model, were compared. As a case study, a data set collected for nine relevant causative factors over the period from 1986 to 2014 was taken from Badulla district, Sri Lanka, which is highly affected by landslides. The performance of each model was tested by using the Area under the curve (AUC) value of Receiver operating characteristic curve (ROC), and the GWLR model was selected as the best fitted model. The probabilities obtained for each pixel in the study area using the selected model were classified into three classes (Low, Medium and High) based on standard deviation method in GIS software. Finally, a landslide susceptibility map was generated related to the above three classes, from which high risk areas can be identified to take necessary actions.


INTRODUCTION
Landslide is defined as a "movement of mass of soil or rock down a slope" (Courture, 2011), and it is a common natural hazard in many areas of the world.Annually landslides cause many fatalities and property losses each year.In addition, the damages can also caused on the environment due to loss of wildlife and forest cover.However, the damages to ecosystems due to landslides cannot be easily estimated.Therefore, landslide susceptibility analysis and mapping are important to identify the landslide hazard zones, and risk management can be practiced to reduce the destruction to some extent.
When identifying the factors which are responsible for landslides, slope aspect, slope angle and elevation of a location can be considered as some key natural factors.Slope aspect is the direction of a slope which identifies the steepest downslope direction at a location.Several researchers have shown the relationship between aspect and landslide occurrence (DeGraff andRomesburg, 1980, Marston et al., 1998).Also, according to the knowledge of experts, the probability of occurrence of landslides is high when the slope angle is in between 15 and 45 and it is relatively low when it is more than 45 .Moreover, landslides may occur in flat areas as well due to excessive pressure placed on the respective area.The elevation also affects the landslide occurrences, since it determines the severity of climate condition in mountain regions.
Slope curvature is another triggering factor, which measures the rate of change of slope.Using profile and plan curvature, the slope curvature can be identified.The profile curvature is the rate of change of slope parallel to slope gradient.A negative value of profile curvature implies that the surface is upwardly convex, a positive profile conveys that the surface is upwardly concave, and a value of zero indicates that the surface is flat.Plan curvature is the rate of change of slope perpendicular to slope gradient which influences convergence and divergence of flow.In this case, a positive value implies that the surface is upwardly convex whereas a negative value indicates the surface is upwardly concave.Both profile and plan curvature are two key factors responsible for landslides since these factors affect the Ceylon Journal of Science 46(4) 2017: 27-41 DOI: http://doi.org/10.4038/cjs.v46i4.7466acceleration and deceleration of flow which influences erosion.
Another key factor of landslide occurrence is the landuse type due to human activity.The data related to landuse can be extracted from satellite images, and interpreted visually, supported by supervised classification.Theoretically, barren land and cultivations are more prone to landslides.Further, the forest areas tend to decrease landslide occurrences due to the natural protection provided by the thick vegetation cover.
Landslides may also occur on the slopes intersected by roads.The cutting toe of the steep slope, filling along the road and frequent vibrations caused by vehicles are some human activities in mountain areas that raise the landslide susceptibility.According to Lee et al. (2005), the potential of landslides also increases with a decrease in distance to rivers.Therefore, the distance to streams is another factor to be considered in the susceptible landslide analyses.
Many qualitative and quantitative methods are available to analyze data related to landslides.The different approaches considered in the literature can be identified as a heuristic, physically-based modeling, probabilistic and statistical.The heuristic approach is based on the opinion of geomorphologic experts.Experts define the weighting value for each causal factor in this method.Determining exact weighting values may cause incorrectness of results since the method involves subject-oriented process.In physically based modeling approach, slope stability model is obtained by calculating safety factor, and it is used to determine the landslide hazard assessment.The probabilistic approach is conducted by considering remote sensing data, field observations, data collected from an interview, and results based on historical analysis.An obstacle to this method is that the results do not present estimation of temporal changes in landslide distribution.In the statistical approach, the combinations of causal factors are statistically determined and then forecasts future landslide areas based on past landslides.Weights of evidence method, Information value method, Frequency ratio method, Logistic regression analysis and discriminant analysis are some of the methods that have been used under this approach.
In recent years, there have been many studies of landslide hazard mapping using Geographic Information System (GIS), and most of these studies have applied probabilistic methods.Lee et al. (2005) have carried out research on landslide susceptibility mapping in Baguio City, Philippines using the frequency ratio method, logistic regression and GIS.The study was aimed to determine a relationship between landslide occurrences and conditioning factors, rating the predictors, and then to generate the landslide susceptibility map using the selected model and GIS.They identified landslide locations using aerial photographs and field surveys.The factors that they have considered were the slope, aspect, curvature, the distance from drainage, lithology, the distance from fault and land cover.The results were finally validated using known landslide locations, and it was found that the logistic regression model had higher predictive value (80 %) than that of the frequency ratio method (78.4%).
Modeling of road related and non-road related landslide hazard for a large geographical area was carried out by Gorsevski et al. (2006) by using logistic regression and receiver operating characteristic (ROC) analysis.The variables selected for this study were elevation, slope, aspect, profile curvature, plan curvature, tangent curve, flow path length, and upslope area.The landslide data set was subdivided to road related, and non-road related landslides and two separate logistic regression models were obtained and compared.It was noted that spatial prediction and predictor variables for those models were different.Using ROC method, the performance of the predictive rule of two models was measured.Dieu et al. (2012) have conducted research on landslide susceptibility assessment in Vietnam by using three methods; data mining approaches, the support vector machines (SVM), and Decision Tree (DT) and Naïve Bayes (NB) models.They have used a landslide inventory having 118 landslides records, and then partitioned it randomly into 70% as training dataset and remaining data to validate results.Landslide susceptibility index was calculated by considering slope angle, slope aspect, relief amplitude, lithology, soil type, land use, distance to roads, distance to rivers, distance to faults and rainfall as landslide causal factors.
Finally, based on validation results and landslide susceptibility maps, it was shown that SVM is a powerful tool for landslide susceptibility assessment with high accuracy with compared to other two methods.
The main objective of this study is to assess and evaluate the landslide susceptibility using two types of statistical models; logistic regression (LR) model, which is mostly used in literature, and Geographical Weighted logistic regression (GWLR) model to address the spatial variability, respectively.These two methods were applied to a dataset gathered from Badulla district, Sri Lanka.

Study Area
Sri Lanka is situated in Asia-Pacific region, and is located at the northern latitude of 5 0 55" and 9 0 51" and eastern longitude of 79 0 41" and 81 0 53", having an area of 65,610 km 2 .There are 25 districts organized into 9 provinces in the country.According to the statistics, Asia-Pacific region is considered as one of the most disasterprone regions in the world (Report on disaster mitigation in Asia and the Pacific, 1991), since landslides in Sri Lanka spread over about 30% of land area and into several districts namely Badulla, Nuwareliya, Matara, and Kandy etc.
In 1986, 51 lives were lost and approximately 10,000 families were displaced in Sri Lanka as a result of landslides.In a most recent incident occurred in 2014, 13 died, and 1,875 were affected in Meeriyabedda in Badulla district.These natural disasters cause major threats to the country"s economy through significant losses of life, infrastructure and crops, etc. in every year.The National Building Research Organization (NBRO) in Sri Lanka is currently preparing maps which illustrate the distribution of landslides prone districts via Geographic Information System (GIS).To draw these maps they use qualitative methods which assign weights for causative factors on expert"s knowledge.Figure 1 shows the spatial distribution of landslides in Sri Lanka from 1974 to 2008.According to this figure, it is clear that Badulla district is more prone for landslides compared to other districts in Sri Lanka.Therefore, Badulla District was selected as the study area which is situated in 6°59"05"N and 81°03"23"E, and encompasses approximately about 2861km 2 area.Fifteen district secretariat (DS) divisions (Bandarawela, Uvaparanagama, Ella, Welimada, Hali-Ela, Passara, Haldummulla, Lunugala, Kandaketiya, Haputhale, Soranathota, Badulla, Meegahakivula, Ridimaliyadda, Mahiyangana) are in this district, and cultivation is the predominant occupation of the people in the area.

Data Collection
The landslide location map [Figure 2  Based on the classification of NRMC, five types of landuse, namely cultivated areas, builtup areas, bare areas, rock lands and other types, and 11 soil types were identified in the study area (Table 1).
Using 1:50000 contour map of the study area, the 10 m DEM (Digital Elevation Model) was created first using the Arc GIS software.Secondly, with the help of the DEM, the elevation, slope degree, curvature and slope aspect maps were obtained using GIS tools.Figure 3 showed three of these factor maps (elevation, slope, aspect).Distances to roads and streams were obtained using road network map, stream map and buffer tool in Arc GIS.Then, 1:50000 landuse map and 1:50000 soil map were rasterized.Finally, the landslide occurrence data set was created by overlaying all factor maps with location map using Arc GIS software, and added a new variable to that data set as landslide occurrence which has value 1 for the landslide starting points.
As a control, a similar number of locations were selected randomly as landslide nonoccurrence points from each DS divisional areas where no landslides have been rcorded in the past by using random point generation GIS tools.Landslide occurrence value is set to 0 for those points to indicate non-occurrence.
The final data (504 observations) set was obtained by combining the above two data sets in arc GIS platform, and the variable landslide occurrence was taken as the response variable which is a binary variable, and the causative factors (variables); slope (degree), aspect (degree), elevation (degree), profile curvature (degree), plan curvature (degree), distance to roads (meters), distance to streams (meters), landuse types, and soil types were taken as predictors variables.Among the above variables landslide occurrence, soil types and landuse types are categorical and the rest of the variables are continuous.For model building process 75% of the data were chosen, and remaining data were used for model validation.

Statistical Methods
First, the preliminary analysis was carried out to understand patterns and trends of the data.Graphical presentations were used to identify the distribution of the occurrence of landslides among each factor.The correlations among continuous causative factors were obtained by using Pearson Correlation test.All continuous factors were converted to categorical factors and then the relationship between landslide occurrence and each causative factor was assessed using Pearson Chi-square test.The Moran's I test is used to identify the spatial correlation of a location with respect to the other locations comes under a continuous variable.For a given variable , the Moran"s I is defined as (1) where is an observation for a variable for DS division , is the mean of the variable for all DS divisions, n is the total number of DS divisions and is an element of a matrix of spatial weights among th and th locations.Moran"s I is ranged from -1 to +1 (Saefuddin et al., 2012) hypothesis of this Moran"s I test is that there is no spatial correlation.In this study, two types of statistical models were fitted and their performances were compared.The first model is the Logistic Regression (LR) model for which the predictor variables are assumed to be uncorrelated.R software was used for this analysis.If a spatial correlation exists among predictor variables the second model, the Geographical Weighted logistic regression (GWLR) (Brunsdon et al., 1996) can be used.In both models the response variable is a binary variable which represents the occurrence or non-occurrence of landslides.
The LR model is given by where is the probability of occurrence of landslides, is the th predictor variable, is the intercept, and is the coefficient of the th predictor variable.The maximum likelihood (ML) method was used to estimate unknown coefficients.The ML estimator for coefficients can be obtained by maximizing the likelihood function.The significance of each estimated coefficient was assessed using Wald test at a given significance level.To select the best fitted model the backward elimination method was applied by following Gorsevski et al. (2006).
The Geographical Weighted logistic regression (GWLR) model given below is used to incorporate the spatial correlation.
(3) where and are local model parameters specific to a location at ( ) coordinate.According to this method, separate models are fitted for each selected location, and model parameters are defined as specific to the location.When estimating parameters at a particular location, a weight is given to each data point in the sample such that the observations near to that location are given greater weight than observations further away.
Several methods are available to determine weighting function of GWLR for estimating parameters (Saefuddin et al., 2012).The weighting function used in this study is called the spatially adaptive weighting scheme (Fotheringham et al., 2002), and it is defined as follows: (4) where is the distance between locations and , and is the bandwidth.To select the optimum bandwidth, the Corrected Akaike Information Criterion (AICc) is adapted.
According to Fotheringham et al. (2002), the local t test shown below assesses the spatial variability of the predictor variables.The test statistics of the local t test is (5) where is the standard error of for the j th parameter estimates.According to this test the local parameter estimates of GWLR model has a significant spatial variation when > 2 and/or <-2.The performance of LR and GWLR models were assessed and compared using Area Under the Curve (AUC) value of Receiver Operating Characteristic Curve (ROC).The ROC curve was plotted for the probability of false (correctly predicted event) versus the probability of a false positive (falsely predicted), and it is used to visualize the performance of a binary classifier which summarizes its performance to a single measure.AUC value is ranged from 0 to 1.When AUC value is close to1 the model under the study can be interpreted as a suitable model for predictive purposes.Figure 4    Based on AUC of ROC curve the best model was selected, and then using GIS, for each pixel in the study area the probability of landslide occurrence was calculated.The Standard deviation classification method (Esri, 2015) in Arc GIS was used to obtain susceptibility classes to draw maps.In this method first the mean () and standard deviation () of predicted probabilities, related to the events of landslide occurrences, are taken to obtain three susceptibility classes (low, medium, high).Class breaks are created with equal ranges.According Esri (2015), these ranges are usually based on intervals of 1, ½, ⅓, or ¼ standard deviations using mean values and the standard deviations from the mean.Here it was considered ½ standard deviation.Finally, the landslide susceptibility map was drawn for the study area using these classified probabilities.

Preliminary Analysis
Landslide frequencies of the corresponding locations in each DS division from 1986 to 2014 were obtained (Table 2).According to the results, the highest frequency was recorded in Bandarawela DS division, followed by Uva Paranagama DS division, whereas the Mahiyanganaya DS division was free from landslide incidences during this period.
According to the mean and standard deviation of each continuous causative factor of landslide occurred locations (Table 3), it is clear that the Landslide occurrence was widely spread in elevations ranged between 1000m and 1250m, and slopes in between 10 and 15 in the study area.Further, landslides were highly occurred in areas with 0.0 profile and plan curvature, and 50 slope aspect.Moreover, it can be identified that locations which are < 200m away from roads and <1500m away from streams recorded high frequency.
Also, cultivated areas (52.17%) and builtup areas (30.44%) showed high percentage frequency (Table 4) of landslide occurrences.Since there is no deep root to hold the soil, soil mass may move in cultivated lands, and it may cause landslides.
When considering the distribution of soil types in landslide occurred areas (Table 5), it can be noted that Soil type 7 (Red-Yellow Podzolic soils in steeply dissected hilly and the rolling terrain) and Soil type 9 (Red-Yellow Podzolic soils & Mountain Regosols in mountainous terrain) showed the highest (52.17%) and the second highest (38.7%) frequency of landslide occurrences, respectively.
The correlation test for continuous variables (Table 6) has indicated that Slope and Elevation, Elevation and Distance to streams, Slope and Distance to streams, Elevation and Distance to roads, and Profile curvature and Plan curvature are significantly (p-value < 0.05) correlated.Further, the variance inflation factors (VIF) of the continuous causative variables were obtained to identify whether there is any effect of multicollinearity.Since all VIF values are small (< 2) a remedial measure for multicollinearity is not necessary.The Chi-squared test was performed to identify the relationship between the landslide occurrence with the related factors, and it was found that significant (p-value < 0.05) relationships exists between landslide occurrence with Soil type, Elevation, and Slope.Then Moran"s I test was performed (Table 7) to identify a spatial correlation among the continuous factors of data, and the results showed that the relevant variables have significant (p-value< 0.05) spatial correlation.

Model fitting
First, the LR model was fitted using backward elimination (Table 8), and it was noted that the parameters of elevation, slope and distance to roads are significant (p < 0.05).The fitted model with estimated parameters is given below. (6) where is a binary variable such that for non-occurrence of landslides and for occurrence of landslides.Note that when elevation and slope increase, the probability of landslide occurrence also increases since elevation and slope are positively related to the response variable .According to the odds ratio, it was noted that when one unit of the elevation increases, odds of a landslide occurrence increases by 64.7%.Moreover, it can be identified that odds of landslide occurrence increases by 29.6% when one unit of the slope increases.Further, it was shown that the variable distance to roads negatively affects, and as increasing of one unit of the distance to roads, odds of landslide occurrence decreases by 80.9%.
Since the causative factors have significant (p-value < 0.05) spatial correlation according to the Moran"s I test, the GWLR model was fitted to incorporate this spatial relationship by using GWR4 software.Using this method a local model for each location in the sample was fitted by considering the neighbourhood area, and the minimum AICc criterion was used to obtain the optimal bandwidth size which is 102.0 m for the selected model.
The final GWLR model obtained by taking the average of coefficients with significant parameters (Table 9) is given below: (7) In the above equation, elevation, slope and distance to roads have higher coefficients with compared to the other variables which indicate that these three variables have high impact with landslide occurrences.To assess whether the spatial variation in the measured relationship is substantial, the geographically variability t-test (Table 10) was applied to each estimate of parameters of the selected model.
Note that the values of the test statistics from elevation up to Plan curvature are less than -2 and from Profile curvature to Distance to streams are greater than 2. This implies that there exists a significant spatial variation in each coefficient.Now, to assess the predictive ability of each model using test data set, the AUC value of ROC was obtained by plotting false positive rate (fpr) versus true positive rate (tpr) for the LR model (Figure 5  Based on the AUC value of ROC the landslide occurrence and non-occurrence can be classified only 64.2% with the logistic regression model whereas it is 73.5% for the GWLR model. According to the literature (Gorsevski, 2006), the AUC value of ROC is considered as a good classifier if the value is closer to 1. Therefore, GWLR model is selected as the best classifier among the two models.

Generating landslide susceptibility map for the study area
The probabilities of landslide occurrences (Y=1) were obtained by using GWLR model for each pixel in the study area, and these values were then exported to the GIS software.Using the standard deviation method (see statistical methods section) available in the GIS, the probabilities were used to classify the susceptibility classes (low, medium, high).The range of probabilities for each susceptibility class is given in Table 11.
Finally, the landslide susceptibility map (Figure 6) for Badulla district was drawn using the susceptibility class of each pixel.According to this map, it can be identified that certain areas in Badulla, Uva Paranagama, Lunugala and Hali Ela DS divisions are highly susceptible for landslides.

LIMITATIONS
In this study, only the terrain variables were considered as factors of land slides.However, there are other factors such as precipitation, earthquakes and geological structure etc. which could be considered.Due to the cost and lack of available data, those factors were not considered in the present study.

CONCLUSION
This study reveals a significant positive relationship between elevation and slope, and a significant negative relationship among the variables distance to roads with landslide occurrences.Two statistical models were fitted to predict the probability of landslide occurrence for each location in the study area.When comparing the performance of resulted models using AUC of ROC value, it was identified that GWLR model is a good classifier than the LR model.According to the standard deviation and range of each estimated parameter in the GWLR model, it can be concluded that the effect of the causative factors varies with each location.Generated landslide susceptibility map is useful to identify landslide-prone areas, and necessary actions can be taken to reduce the harm before this natural disaster occurs.Also, the fitted GWLR model can be used to identify the probability of landslide occurrences of a certain location in the study area if the measurements of predictor variables in the model are available.Further, when considering risk management and other policies the identified significant factors of the model are useful to reduce the destruction to some extent.

Figure 1 :
Figure 1: Spatial distribution of landslides in Sri Lanka from 1974 to 2008, Source: Sri Lanka National Report on Disaster Risk, Poverty and Human Relationships.
(a)], the soil map, contour map, landuse map, stream map and road network map [Figure 2 (b)] were collected form the NBRO, Survey Department and Natural Resource Management Centre (NRMC).The starting points of landslides based on the sample location map in the study area from 1986 to 2014 were used for the analysis.The maps including soil, contour, landuse, stream and road network maps were as of 2014.By following Dieu et al. (2012), the values for the predictor variables for each of the landslide starting points have been obtained based on the current environment.

Figure 2 :
Figure 2: Landslide location map and road network map for Badulla district.
. Negative values of Moran"s I indicate negative spatial autocorrelation and positive values indicate positive spatial autocorrelation.Further, a zero value indicates a random spatial pattern, which indicates the spatial independence of the corresponding variable.The null Aspect map Plan curvature map Slope map shown below shows an example of AUC of ROC curve.

Figure 4
Figure 4(a) depicts the ROC curve of an almost perfect classifier where the performance curve almost touches the "perfect performance" point in the top left corner.The performance of model related to Figure 4(b) is higher than that represents in Figure 4(c).

Figure 5 :
Figure 5: ROC plots for fitted LR and GWLR models.

Table 1 :
Soil types identified in the study area.

Table 2 :
Percentage frequency of landslides occurred locations in each DS division from 1986 to 2014.

Table 3 :
Means and standard deviations of continuous variables of landslide occurred locations in each DS division.

Table 4 :
Frequency of Land use types of landslide occurred locations in each DS divisions.

Table 5 :
Number of landslide locations related to each Soil type for DS divisions

Table 7 :
Moran"s I test statistics for continuous variables.

Table 8 :
Significance and Odds ratios of coefficients in LR model obtained by backward elimination method.

Table 9 :
Summary statistics for coefficients of GWLR model.

Table 10 :
Test statistics for measuring Geographically variability.