# Exercise: why so much fraud?

In order to follow this page, it's advised that you [run the demo](/get-started/run-the-demo.md) first.

Having surveyed Streambased's products, let's replicate a little workflow at the end of this demo. Let's look at the Superset fraud dashboard again:

<figure><img src="/files/HL5F1vgfJzuwEroTZtx5" alt=""><figcaption></figcaption></figure>

Note how high the number of recent fraudulent transactions are (tallied since you started running the project). What's going on here?

In order to find out, let's query the data viewed as an Iceberg table. Head to this demo's [Jupyter notebook](http://localhost:8888/notebooks/notebooks/demo_script.ipynb).&#x20;

Remembering that this is the schema for our Iceberg Table:

```
+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|       TransactionID|              string|   NULL|
|           AccountID|              bigint|   NULL|
|     TransactionType|              string|   NULL|
|   TransactionAmount|              double|   NULL|
|            BranchID|              bigint|   NULL|
|CustomerFlaggedFraud|             boolean|   NULL|
|     TransactionTime|       timestamp_ntz|   NULL|
|     kafka_partition|                 int|   NULL|
|        kafka_offset|              bigint|   NULL|
|            kafka_ts|       timestamp_ntz|   NULL|
|                    |                    |       |
|      # Partitioning|                    |       |
|              Part 0|     kafka_partition|       |
|              Part 1|truncate(1000, ka...|       |
+--------------------+--------------------+-------+
```

Where might the issue lie? Let's see if certain accounts are generating excessive fraudulent transactions. We can also investigate whether any bank branches are responsible for the anomaly.&#x20;

### Accounts

In order to check if fraudulent transactions are clustered in specific accounts, we can aggregate the transactions per account:

```python
spark.sql("""
SELECT AccountID, COUNT(*) AS fraudCount
FROM isk.hotset.transactions
WHERE CustomerFlaggedFraud = true
GROUP BY AccountID
ORDER BY fraudCount DESC
""").show()
```

You will see in the output that there are no accounts with excessive fraudulent messages:

```
+---------+----------+
|AccountID|fraudCount|
+---------+----------+
|     9595|         6|
|     3842|         6|
|     3133|         5|
|     1940|         5|
|     2464|         5|
|     5485|         5|
|     3437|         5|
|     9379|         5|
|     4674|         5|
|     9391|         5|
|     5901|         5|
|      587|         5|
|     3140|         5|
|     5973|         5|
|     2225|         5|
|     2252|         5|
|       45|         5|
|     1002|         5|
|     4543|         5|
|     6805|         5|
+---------+----------+
only showing top 20 rows                                                              
```

{% hint style="info" %}
Your results will likely vary from this, but the result will be similar.
{% endhint %}

So let's move to investigating the branches.

### Branches

In order to investigate whether these transactions are localised to a specific branch we can aggregate per branch:

```python
spark.sql("""
SELECT b.BranchName as branchName, COUNT(*) as fraudCount FROM isk.hotset.transactions t 
JOIN isk.coldset.branches b 
ON t.BranchId = b.BranchId 
GROUP BY branchName
ORDER BY fraudCount DESC
""").show()
```

We will see an output similar to this:

```
+--------------------+----------+
|          branchName|fraudCount|
+--------------------+----------+
|Zulauf, Schmidt a...|      6193|
|Boehm, Beier and ...|       339|
|  Bartoletti-Kilback|       327|
|        Goyette-Beer|       324|
|         Lockman Inc|       319|
|   Champlin-Weissnat|       319|
|Parisian, Stanton...|       317|
|       Rempel-Parker|       317|
|    Corkery and Sons|       315|
|       Bechtelar Inc|       314|
|            Mraz Inc|       314|
|Schulist, Luettge...|       314|
|   Schamberger Group|       309|
|      Emmerich Group|       308|
|         Johnson LLC|       307|
|DuBuque, Hansen a...|       305|
|Hackett, Runolfsd...|       301|
|       Kessler Group|       296|
|      Blick-Parisian|       294|
|   Stoltenberg-Lynch|       294|
+--------------------+----------+
only showing top 20 rows
```

There we find the culprit! Specifically the Zulauf, Schmidt and McCullough branch.&#x20;

We can now escalate the matter, update our ML models and or rules engines now that the cause has been identified.

Through this exercise we saw how we can navigate our data forgetting what lives in Iceberg and what lives in Kafka. This allows you to spend more time on the data and less time on the data architecture. &#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.streambased.io/get-started/exercise-why-so-much-fraud.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
