Jupyter Notebooks and Pandas for Security Analysis

Table of Contents

Jupyter Notebooks
#

Jupyter Notebooks are an interactive way to run python commands alongside documenting with rich text. This means we can methodically execute code, track actions undertaken, and recreate events. While the original intent was mainly directed at data scientists, as you can imagine, this is pretty appealing to security analysts.

It doesn’t matter whether you’re a threat hunter, or a forensic analysis - collecting artefacts from multiple sources, and correlating them is a pretty common process. Jupyter notebooks allow us to do this, record the results and potentially manipulate our data to extrapolate additional information.

There are additional benefits, such as being able to repeat the steps against other datasets, being able to visualise our data and using it as a detailed log of the actions taken.

Getting Started
#

I won’t go into too much detail about this, there’s plenty of guides to get people started out there, whether that’s through getting access to a JupyterHub instance, spinning up a Docker container, or running locally. At a fundamental level the most basic instance is installing the following python packages and running the following commands:

$ python3 -m venv venv
$ source venv/bin/activate
(venv)$ python3 -m install jupyter jupyterlab pandas numpy
$ jupyter lab

Data Processing Pipeline
#

Once we have data we need to organise it so we can query it. This is typically called the Data Processing Pipeline.

Acquisition
#

We can acquire data from many sources, maybe we’re scraping a Web API, maybe we’re querying a SIEM, or maybe we’re simply ingesting a CSV of data. For this scenario I’m using a dataset generated by the Open Threat research community which you can also download from the following Security-Datasets Github Page.

I’ve created a Demo Notebook that can be found here so that you can follow along and see what I mean in action.

Cleansing
#

Unfortunately our data may have extra fields or our data may not be in a manner that’s useful to us. This means that we initially need to import and clean up our data to get it into a nice pandas dataframe, or change the data type so that we can leverage the built in functions better. A great example is a timeline, if we’re dealing with logs we may want to filter on specific periods of time. A pandas dataframe may have a column for the timestamp, but this will typically be a string field, instead of the python native DateTime format.

After importing if we inspect the TimeCreated field we can see the following:

>>> df['TimeCreated'].head(5)
0    2020-10-19 03:30:41.104
1    2020-10-19 03:30:45.323
2    2020-10-19 03:30:45.324
3    2020-10-19 03:30:46.251
4    2020-10-19 03:30:46.330
>>> df.TimeCreated.dtype
Name: TimeCreated, dtype: object

We can filter down using strings and that will work

>>> df[(df['TimeCreated'] > '2020-10-19 03:30:40') & (df['TimeCreated'] <= '2020-10-19 03:30:42')]
>>> df[df['TimeCreated'] == '2020-10-19 03:30:41.104']

But if we convert the field to a DateTime object and set it as our index we now have our events sorted by time, and filtering can be done slightly easier (but more important programmatically).

>>> df.index = pd.to_datetime(df['TimeCreated'],format="%Y-%m-%d %H:%M:%S.%f").dt.tz_localize('UTC')
>>> df.index.dtype
datetime64[ns, UTC]
>>> df.loc[time(3,30,40):time(3,30,42)]

Transformation
#

There’s a huge advantage to using a Pandas dataframe, and that is the ability to transform it in multiple ways, including mathemtically using NumPy, without risk of altering the raw data underneath. This comes at the cost of memory for large datasets unfortunately, but we can apply some function to carve out the more specific subset of data to create new easier to manage dataframes.

We can then overlay these new smaller dataframes with more detailed information from other sources to help augment our data.

Taking our previous example from before using our newly created time based index, we can take a slice of the period of time we’re interested in:

day = date(2020,10,19)
starting_time = time(3,30,40)
ending_time = time(3,30,47)
start_time = datetime.combine(day, starting_time).replace(tzinfo=timezone.utc)
end_time = datetime.combine(day, ending_time).replace(tzinfo=timezone.utc)

df_interesting_events = (df.index > start_time) & (df.index <= end_time)
df.loc[df_interesting_events,'interesting'] = True

Analysis - Masks
#

We have used the functionality a little and hinted at it, but a powerful component of dataframes are the ability to filter. We’re not limited to filtering based on a range of data. We can programmatically create a Pandas series, which can reflect the rows of a Pandas DataFrame that we’re interested in. Think of it like a list made up of True and False statements, which say whether we want to see that value or not. Because this is just a list of values we’re interested in, it means we can use Boolean logic to create more elaborate filters, or join them together in complex ways.

We can use a few different methods to specify the mask of interest, below we have created three masks. One that identifies psexec in the ImagePath field, one that looks for EventID 7 and one that looks for AccountName set to LocalSystem.

Then we can specify that we want to see where psexec or where we’ve flagged interesting, but not where the EventID is set to 7:

mask_psexec = df['ImagePath'].str.contains('PSEXESVC', na=False, case=False)
mask_event_id_7 = df['EventID'].astype(str) == '7'
mask_interesting = df['AccountName'] == 'LocalSystem'
df[(mask_psexec|mask_interesting)&~mask_event_id_7]

Analysis - Grouping
#

While we have a timeline of events, potentially we may want to group those events together into a bucket. This can be really useful for things like password spraying attacks, where we want to highlight a growth over an average period.

Below we’ve grouped activity into a 3 second bucket, then counting the different EventID seen at that time. We can see around the 03:30:45 second mark, that there’s a large amount of EventID 7 events that might warrant further investigation.

>>> df['EventID'].groupby(pd.Grouper(freq='3S')).value_counts()
timestamp                  EventID
2020-10-19 03:30:39+00:00  104          1
                           1102         1
2020-10-19 03:30:42+00:00  10          14
                           13           5
                           11           1
                           12           1
2020-10-19 03:30:45+00:00  7          130
                           12          13
                           13          12
                           10          10
                           4658         8
                           17           7
                           18           7
                           11           6
                           1            4
                           5            4
                           4656         4
                           4663         4
                           4688         4
                           4689         4
                           4690         4
                           4703         4
                           5156         4
                           9            3
...
2020-10-19 03:30:48+00:00  10          14
                           13           4
                           22           2
2020-10-19 03:30:51+00:00  13           2
Name: EventID, dtype: int64

Further Analysis
#

Then it’s up to you how you want to proceed, you could decide to visualise the data with MatPlotLib (or libraries that wrap around it like Plotly and Seaborn) to glimpse new data, group and sort it. You can score it, average it, quantise it and more. There’s no limit to what you can do with a Pandas Data Frame.

Storage
#

After acquiring, cleaning and then transforming your data you don’t necessarily want to have to redo that process again on the same data. If for instance you had queried multiple API’s that could take quite a lot of time to execute. So the sane solution is to store that data for easier retrieval in the future.

The simplest solution is pickle, an inbuilt serialiser for python objects that can be written to disk.

>>> df.to_pickle('df.pk')
>>> df = pd.read_pickle('df.pk')
>>> print(f"Dataframe consists of {len(df.index)} rows, consuming {df.memory_usage(deep=True,index=True).sum()} bytes of memory")
Dataframe consists of 286 rows, consuming 1416261 bytes of memory

There are other places we can store data, including sqlite databases, or more advanced databases, but for basic data a pickle file will do the job.

Advanced
#

We’ve only examined the basics to get someone started with a Jupyter Notebook. There’s a great deal of depth for where you can proceed to from here as any Data Scientist could inform you.

Instead of storing the raw data, we may want to form more complex mappings, such as connecting data through relationships. Graph databases like Neo4J can let you plot data in Nodes, then connect them with relationships. You can then query this data using Cypher, much like you would a traditional SQL database.

We can take this even further by introducing modelling, generating probabilities with a state machine to predict the next state or potential outcome, which in turn leads down the pathway of Machine Learning using libraries like SciPy’s sklearn.

Once you start playing with data, you realise that there’s a lot that can be done with it, opening new pathways to highlight security incidents, automate some components incident response, or take new approaches for Threat Hunting.