Exercise: why so much fraud?

In order to follow this page, it's advised that you run the demo first.

Having surveyed Streambased's products, let's replicate a little workflow at the end of this demo. Let's look at the Superset fraud dashboard again:

Note how high the number of recent fraudulent transactions are (tallied since you started running the project). What's going on here?

In order to find out, let's query the data viewed as an Iceberg table. Head to this demo's Jupyter notebook.

Remembering that this is the schema for our Iceberg Table:

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|       TransactionID|              string|   NULL|
|           AccountID|              bigint|   NULL|
|     TransactionType|              string|   NULL|
|   TransactionAmount|              double|   NULL|
|            BranchID|              bigint|   NULL|
|CustomerFlaggedFraud|             boolean|   NULL|
|     TransactionTime|       timestamp_ntz|   NULL|
|     kafka_partition|                 int|   NULL|
|        kafka_offset|              bigint|   NULL|
|            kafka_ts|       timestamp_ntz|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|     kafka_partition|       |
|              Part 1|truncate(1000, ka...|       |
+--------------------+--------------------+-------+

Where might the issue lie? Let's see if certain accounts are generating excessive fraudulent transactions. We can also investigate whether any bank branches are responsible for the anomaly.

Accounts

In order to check if fraudulent transactions are clustered in specific accounts, we can aggregate the transactions per account:

spark.sql("""
SELECT AccountID, COUNT(*) AS fraudCount
FROM isk.hotset.transactions
WHERE CustomerFlaggedFraud = true
GROUP BY AccountID
ORDER BY fraudCount DESC
""").show()

You will see in the output that there are no accounts with excessive fraudulent messages:

+---------+----------+
|AccountID|fraudCount|
+---------+----------+
|     9595|         6|
|     3842|         6|
|     3133|         5|
|     1940|         5|
|     2464|         5|
|     5485|         5|
|     3437|         5|
|     9379|         5|
|     4674|         5|
|     9391|         5|
|     5901|         5|
|      587|         5|
|     3140|         5|
|     5973|         5|
|     2225|         5|
|     2252|         5|
|       45|         5|
|     1002|         5|
|     4543|         5|
|     6805|         5|
+---------+----------+
only showing top 20 rows

Your results will likely vary from this, but the result will be similar.

So let's move to investigating the branches.

Branches

In order to investigate whether these transactions are localised to a specific branch we can aggregate per branch:

spark.sql("""
SELECT b.BranchName as branchName, COUNT(*) as fraudCount FROM isk.hotset.transactions t 
JOIN isk.coldset.branches b 
ON t.BranchId = b.BranchId 
GROUP BY branchName
ORDER BY fraudCount DESC
""").show()

We will see an output similar to this:

+--------------------+----------+
|          branchName|fraudCount|
+--------------------+----------+
|Zulauf, Schmidt a...|      6193|
|Boehm, Beier and ...|       339|
|  Bartoletti-Kilback|       327|
|        Goyette-Beer|       324|
|         Lockman Inc|       319|
|   Champlin-Weissnat|       319|
|Parisian, Stanton...|       317|
|       Rempel-Parker|       317|
|    Corkery and Sons|       315|
|       Bechtelar Inc|       314|
|            Mraz Inc|       314|
|Schulist, Luettge...|       314|
|   Schamberger Group|       309|
|      Emmerich Group|       308|
|         Johnson LLC|       307|
|DuBuque, Hansen a...|       305|
|Hackett, Runolfsd...|       301|
|       Kessler Group|       296|
|      Blick-Parisian|       294|
|   Stoltenberg-Lynch|       294|
+--------------------+----------+
only showing top 20 rows

There we find the culprit! Specifically the Zulauf, Schmidt and McCullough branch.

We can now escalate the matter, update our ML models and or rules engines now that the cause has been identified.

Through this exercise we saw how we can navigate our data forgetting what lives in Iceberg and what lives in Kafka. This allows you to spend more time on the data and less time on the data architecture.

Last updated 3 months ago

hashtagAccounts

hashtagBranches

Accounts

Branches