03 Apr, 2019

Machine Learning vs. Cryptocoin Miners -- Part 2

by Jonn Callahan

This blog is part 2 of a post I had written last year on the topic. After finally getting some free time to return to this project, I return bearing a solution. However, I’ve also drastically overhauled my methodology, so it’s fairly safe to skip the previous post unless you’re curious about the kind of pre-processing I’m doing on the Flow Logs themselves.

Brief Recap

As a brief recap, my goal is to consume AWS Flow Logs and determine if a cryptocoin miner is running on the network. However, the title is a bit misleading: I won’t be using any of the deep learning techniques commonly associated with machine learning, such as neural networks. Instead, I’ll be focusing on an algorithmic approach, identifying and quantifying patterns discovered in the data.

For test data, I’m working with 12-24 hours worth of mining traffic from 12 different miners of varying sizes. I also have a set of non-mining traffic, derived from roughly 1,000 ENIs. Across these ENIs, there are ~82,000 unique uni-directional streams and, after correlating, ~38,000 unique bi-directional streams.

Previously, after massaging all the Flow Logs into workable data, I immediately reduced my 8-dimensions of data to just two using principal component analysis (PCA). I realized after the fact that this was a terrible mistake and I did nothing but reduce the amount of information available to me.

Finally, it’s worth re-iterating the metrics exposed via Flow Logs. All logs are returned in the space-delineated format of:

version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status

Additionally, remember that Flow Logs are not your standard 5-tuple pcaps you get from a traditional network monitoring device. Instead, they are aggregates of traffic seen over a ten minute period. And with that brief preface, let’s dig in.

Let’s Make Some Graphs!

So instead of reducing dimensionality via PCA, let’s play with the four metrics that are most likely to reveal a pattern: the number of bytes sent by both the source and destination and the number of packets sent by the source and destination. Specifically, let’s graph them out and see if there are any discernible patterns we can use to build our model.

Source Packets x Source Bytes

Destination Packets x Destination Bytes

Destination Bytes x Source Bytes

Destination Packets x Destination Packets

Each frame in these gifs represents a different set of mining traffic. Just from a quick glance, we can clearly see a correlation between all our metrics, but specifically between the source metrics (first gif) and a bit weaker of a correlation with the destination metrics (second gif).

Additionally, several parallel striations seem to emerge with the source metrics. A “honeycomb” pattern also looks to be emerging within the destination metrics, though this doesn’t become apparent until you reach a high amount of point density.

What I found surprising, though, was the presence of points at (0,0). I’m not entirely sure what would cause a miner to send and receive zero packets and zero bytes. It’s possible that the mining pool went down at one point, though I didn’t investigate at all to determine if that was actually the case. Regardless, we have enough to work with.

Attempt 1

There are many out there whose knee-jerk reaction will be to throw linear regression at the problem and call it a day. But I propose an alternate solution: convex hulls.

Convex Hull

The idea is first to define the convex hull that encompasses all our mining data. Then for each set of non-mining traffic we have, determine how many points lie within the hull. Here are the convex hulls for both the source and destination metrics:

Source Metrics Convex Hull

Destination Metrics Convex Hull

While we could restrict our membership rate to an absolute 100%, this greatly increases our chances of a false positive, depending on how representative the test data is. I’d be hard-pressed to believe that I have perfectly representative test data, so let’s pick an arbitrary 98% membership rate, so that we don’t end up with false negatives due to one or two stray points.

And with that, our ~38,000 bi-directional streams of non-mining traffic is reduced to ~19,000. While not bad, that’s certainly not good. Here’s a quick example of a set of traffic that was correctly stripped out:

True Negative

But here’s one that was missed:

False Positive

This was a re-occurring trend among false positives: points at or near (0,0).

Attempt 2

While our mining data contained (0,0) points, they only make up 1-3% of the overall traffic. Due to this, I decided to simply ignore all (0,0) points. In doing this, I recalculated the convex hull and again piped all my non-mining traffic through it.

New Convex Hull True Negative

Et voilà! This appears to have done the trick. This one minor change filtered out nearly all of the ~38,000 samples. Here’s the only four that were missed:

Final Four False Positives

It doesn’t take more than a glance to realize the last four false positives look nothing like our mining traffic samples; they just happen to have a small number of points lying within our convex hull. Moving forward, any number of techniques that quantify the shape of the data would easily get rid of these final four false positives.

Real Talk: I’ve Been Lying To You

The previous convex hull graphs are not actually representative of what I was doing. While we could define two hulls, one for source metrics and another for destination metrics, we don’t need to. The convex hull algorithm leverages the distance between points. Thanks to the magic of Euclidean distances, we can find the distance between any two points in n-dimensions. This allows us to define a single four-dimensional convex hull comprised of both the source and the destination metrics. Note: for those considering applying this same trick to hundreds of dimensions, instead of four, beware of the curse of dimensionality!

The Code Itself

The full (ish) code is available here on Github. Note that this isn’t an actual implementation, but essentially just a lift’n’shift of a few pieces from my sandbox code. I pulled out the superfluous functionality and just left in the stuff that worked, with hopefully descriptive enough variable names that you should be able to follow along.

First of note, the preprocess.load_and_build_enis() function pulls all the Flow Logs out of a local sqlite database (for easy reference when testing) and organizes them into the following dict structure:

{
    "eni-123" : {
        "172.31.16.139::172.31.16.21" : [
            [src_bytes...], 
            [src_pkts...], 
            [dst_bytes...], 
            [dst_pkts...],
            [src_ports...], 
            [dst_ports...], 
            [src_port_switch_event_intervals...], 
            [dst_port_switch_event_intervals...], 
            [src_port_switch_pkt_intervals...], 
            [dst_port_switch_pkt_intervals...]
        ]
    }
}

The list under the traffic ID key (x.x.x.x::x.x.x.x) has ten items, and each item is a list itself. Each element in that list corresponds to aggregated metrics across Flow Log entries that had matching timestamps. So the first element of the src_bytes list will be the total number of bytes sent by the first IP address from any port to any port of the second IP address over a given 10-minute window. The “interval” fields can be ignored as they were used with a previous strategy that hasn’t been discussed here at all as it didn’t pan out.

The build_model.build_hull() function consumes this dictionary for each ENI that performed cryptocoin mining:

ret_data = preprocess.load_and_build_enis(enis=mining_enis)

src_bytes_all = []
src_pkts_all = []
dst_bytes_all = []
dst_pkts_all = []

for eni, v in ret_data.items():
    for traffic_id, data in v.items():
        src, dst = traffic_id.split("::")
        if src not in pool_ips and dst not in pool_ips:
            continue

        src_bytes, src_pkts, dst_bytes, dst_pkts, src_ports, dst_ports, _, _, _, _ = data

        src_bytes_all.extend(src_bytes)
        dst_bytes_all.extend(dst_bytes)
        src_pkts_all.extend(src_pkts)
        dst_pkts_all.extend(dst_pkts)

src_bytes_all = np.asarray(src_bytes_all)
dst_bytes_all = np.asarray(dst_bytes_all)
src_pkts_all = np.asarray(src_pkts_all)
dst_pkts_all = np.asarray(dst_pkts_all)

The x_x_all lists house each “column” of data before getting converted to ndarrays. These are then glued together into a single (x, 4) shape ndarray object, with x corresponding to the number of rows (entries in the traffic Id list):

mining_pts = np.column_stack(
    (src_bytes_all, src_pkts_all, dst_bytes_all, dst_pkts_all)
)

Finally, we need to strip out all of our (0, 0, 0, 0) points and find our convex hull:

from scipy.spatial import ConvexHull

# lets just ignore (0,0,0,0)
mining_pts = mining_pts[~np.all(mining_pts == 0, axis=1)]

hull = ConvexHull(mining_pts)
hull_pts = mining_pts[hull.vertices]
return hull_pts

Now with our convex hull shape in hand, we can run the non-mining data through it and see what gets through. The code for this is extremely similar to the model building itself, so let’s skip to the end where we have our (x, 4) shape ndarray:

target_pts = np.column_stack(
    (src_bytes_all, src_pkts_all, dst_bytes_all, dst_pkts_all)
)

# strip out [0,0,0,0] cols
target_pts = target_pts[~np.all(target_pts == 0, axis=1)]
if len(target_pts) == 0:
    continue

Again, we strip out all our (0, 0, 0, 0) points and also do a quick check to ensure we didn’t strip out everything in the process. The final step is just to take my target_pts point cloud and determine how many of them reside within the previously computed convex hull:

membership_rate = get_hull_membership_rate(target_pts, hull_pts)
if membership_rate >= min_membership_rate:
    if eni not in boundary_matches.keys():
        boundary_matches[eni] = []
    boundary_matches[eni].append(traffic_id)

The magic of the get_hull_membership_rate() function relies on the interesting relationship between Delaunay tessellation and convex hulls:

# https://stackoverflow.com/a/16898636
def get_hull_membership_rate(target_pts, hull_pts):
    """
    Test if points in `p` are in `hull`

    `p` should be a `NxK` coordinates of `N` points in `K` dimensions
    `hull` is either a scipy.spatial.Delaunay object or the `MxK` array of the 
    coordinates of `M` points in `K`dimensions for which Delaunay triangulation
    will be computed
    """
    del_tesselation = Delaunay(hull_pts)
    member_mask = del_tessellation.find_simplex(target_pts) >= 0
    return target_pts[member_mask].size / target_pts.size

Attempt 3 and On

I decided to stop here for the time being as no matter what techniques I choose to use next, they’re all but guaranteed to strip out the final four false positives. However, it is worth at least mentioning the multitude of options I have going forward. So far, with the convex hull, we’ve defined the area in which mining traffic points will lie. The next step will likely involve quantifying the shape of the data itself. Here are a few options:

  • Area of the convex hull
  • Mean/median nearest-neighbor
  • Spatial homogeneity within the convex hull
    • Ripley K/L algorithms (accurate but expensive)
    • Number of points within a set of concentric circles with the origin set to the center of the convex hull (inaccurate but cheap)
  • Quantifying the striation and honeycomb patterns found in the source and destination-only metrics
  • Temporal analysis: bursts of traffic vs. steady, regular traffic
  • Linear regression line: angle or mean/median point distance to the line
    • I actually employed this technique in part 1

Before I get my hands dirty with another filter, I need to generate more traffic. I intentionally limited the variation in the kind of mining traffic I initially captured. Specifically, if you recall, every miner was mining the same coin against the same pool; only the size of the miner changed. It may also be worth investigating the traffic generated from solo mining as well as traffic from miners that are unable to reach out to either a pool or wallet due to a security group or NACL configurations.

Optimization

While it is certainly an interesting challenge to try and build as accurate a model as possible, production-grade models do not have the luxury of running as long as they want. If a model must process 100 logs/second, but can only handle 50, it has fundamentally failed, regardless of how good the results are. I have yet to do any kind of benchmarking on the convex hull methodology, but it will likely end up being the largest contributing factor to choosing how I go about further quantifying shape and distribution.

Tooling and Learning

All code was written in Python using the numpy and scipy modules. The matplotlib module was used for graph generation. If this kind of thing interests you, I highly recommend the book Algorithms of the Intelligent Web by Douglas McIlwraith et al. (second edition).