Part 2 in a series on understanding how to evaluate IAM solutions – let’s dive into contexts and data science!
In my last blog we covered the basic rules that can be used to evaluate IAM solutions that leverage User Behavior Analytics (UBA) and Machine Learning to adapt to risk. As a refresher, here are the four rules of thumb, which we briefly covered in that blog post:
- The more contexts the tool is able to collect data from, the more the fidelity and reliability of the tool. If you have garbage going in, you will have garbage coming out.
- Automation is key: how efficiently and seamlessly the tool stitches data and processes coming from various sources for the various contexts.
- Automation goes hand in hand with Orchestration: how well the tool is able to interact with other tools and workflows.
- UBA based continuous authentication must lead to better end user experience!
In this blog we’ll dig deeper into the first rule, which is all about collecting data from the right sources and then leveraging data science to pick the right features and transformation techniques. The ultimate goal is to design a solution which leverages UBA to detect risk and further bolster access security, and as we discuss these topics, you will very quickly realize that the entire process is as much an art as it is science and that it will vary from domain to domain and dataset to dataset. So, let’s dive in!
From a view of thirty thousand feet, the world of access may seem very simple and straightforward, in which an entity tries to access a resource that may be on-prem or in the cloud. Everything else is not apparent in the beginning until you start zooming in and analyzing the impact of some of the other components at play and how they impact the riskiness of a given access session for a particular user<->resource combination. Let’s try to zoom in and identify some of the other parts of the puzzle, which help with the stitching of the many moving parts and coming up with a model for collecting an optimized set of data and establishing users’ behavior profiles. These profiles can then be augmented with additional intelligence sources. Subsequent user actions and behaviors can then be compared against their profiles for anomalous activities and determining risk. The figure below tries to capture an over-simplified version of the whole process:
Getting the right data drives the various contexts needed to help put the pixels together to identify the behavior profiles of the subjects. Let’s start with the obvious question of “how could we define a user”? This is where asking the right questions starts to matter and where art and science start to converge.
- Is this about a B2E (aka workforce)/B2C/B2B use case? For this blog, let’s assume that this is for the workforce use case
- What’s the nature of the workforce in terms of geographical distribution, remote vs office employees?
- What kinds of resources are being accessed and protected by the access management (AM) solution?
- What kind of policy does the enterprise have for mobile devices and end points in general? (For example, BYOD or not?)
- How does the enterprise grant network access to on/off-prem apps? At what layers are various network and app perimeters protected?
- Is the enterprise employing the services of a CASB to monitor fine grain aspects of access to apps in the cloud?
- Does the enterprise have DLP solutions in place to detect and prevent data leak, knowingly or unknowingly by malicious or compromised users?
And so-and-so-forth. You get the drift.
More importantly, why we are asking these questions? The answer, in terms of Data Science, is to help us extract the right features from a given dataset and to derive the various characteristics of the dataset. So, let’s try to slice and dice those questions.
Question one will help in gauging the scale of a typical dataset that would go into determining the size of the training set and extract mathematical distributions for the various access events such as log in activity, app launch activities, common factors etc. It is always a best practice to get a healthy dataset and then analyze for distributions and other statistical tests such as the normality test etc.
Question two determines the importance and variance of the location, time, context, eventually resulting in determining the weight of geolocation in definition a user’s behavior. Below is a geo-distribution for an example dataset.
Question three determines the nature of the work that the users perform and how that can then be used to detect any anomalous behavior.
Question four establishes the role of device fingerprinting in defining users’ behavior, and based on enterprise policy, we can determine how basic/rich the levels of information we can gather for the various endpoints.
Question five guides the role of network fingerprinting in discovering & profiling users’ behavior and preferences, and will also help us in determining the sources we could go to in order to collect that information.
Questions six and seven will determine the role of gathering external threat intel from sources, which compute risk based on complementing factors such as application sessions and activities related to data access. For e.g. CASBs can give us richer visibility into applications sessions and risk associated with those.
After doing the due diligence of slicing and dicing, the result is a data frame with multiple instances. This can define all the key aspects potentially required to identify the behaviors of each member of the workforce accessing resources.
Now let’s look at how to go about getting these contexts and from which potential sources.
From the Right Sources
Identifying the right sources for these contexts is key since that will determine such things as:
- The attributes associated with the contexts and how coarse and fine-grain they are.
- The frequency at which you are able to collect this data.
- Scale and performance of the collection system would depend on whether you are generating this data without having any external/3rd-party dependency or integrating with an external system (such as a partner or a syslog server, for e.g.).
The figure below captures some of the potential sources that a UBA based AM solution may get relevant data to be able to extract some of the contexts mentioned above.
Cleaning the Data:
This is by far the most crucial step in the entire data processing workflow and will decide the reliability and efficiency of the model. This is a multi-step process which requires you to become completely familiar with your data and know such things as:
- The type of variables (Continuous, Categorical). Check out this article for more information.
- If Categorical, then you would have to encode it with one of the many techniques out there. This is a very critical step in data processing and careful considerations must be given to the various pros and cons that accompany the various encoding techniques. Check out this article to whet your appetite on this topic.
Below is an illustration of a subset of data that Idaptive collects, which requires encoding.
As you can see, what started as a seemingly simple looking high-level architecture, very quickly became a bit more complex as we kept adding more dimensions to the problem at hand. Now we can start talking about reducing all the complexities introduced by all the dimensions. In Data Science parlance, that’s called dimensionality reduction and is one of the most artistic and scientific steps, all at once. We’ll look into that in the next blog of this series, in which we’ll have fun with feature selection, dimensionality reduction leveraging such techniques as Principal Component Analysis (PCA).