Using Conditional Probability in SQL for Fraud Detection

Probability concepts, such as conditional probability and Bayes’ Theorem, are widely used in real-world applications like fraud detection, risk assessment, and recommendation systems. In this post, I’ll explore how to calculate the probability of fraud given a high-value transaction using SQL and apply probability principles directly to a database.

Understanding the Problem

I have a database table fraud_example, which contains transaction data with the following structure (I made it as an example):

Transaction: The amount of the transaction
Fraud: Whether the transaction was fraudulent (1 = Fraud, 0 = Not Fraud)

We want to calculate:

Probability of fraud given that a transaction amount is greater than $500.

We can formulate that as: P (Fraud | Transaction > 500)

First, let me explain briefly the probability:

In a simple word, if we did not have any information about incidence B, we could calculate simply P(A) as:

So, for calculating P(A) given that B is already happened (or P(A|B)) , since we have new information about B (we know it happened), our whole possible outcomes limited to B (instead of S), and P(A) given that B is happened actually means the proportion of (A to n(B).

Now let’s see the real example,

F: The transactions which are fraudulent

G: The transactions which are genuine

O: The transactions which are more than 500 $.

P(A)= The number of fraudulent transactions / Total transactions

P(B)= The number of genuine transactions / Total transactions

P(O) = The number of transactions > 500 / Total transactions

The number of fraudulent transactions and over 500/ total transactions

The number of genuine transactions and over 500/ total transactions

In this case what is being asked is:

P (Fraud | Transaction > 500) = P (F|O)

We also need P(O) which is: the number of transactions over 500 / total transactions

Now let’s calculate it by SQL:

To compute this probability, we break it down into key counts:

Total Transactions (nT)
Total Fraudulent Transactions (nF)
Total High-Value Transactions (amount > 500) (nO)
Fraudulent High-Value Transactions (nF and nO)

We use a Common Table Expression (CTE) to first compute these counts and then calculate the conditional probability.

Example Output

In this example:

There are 3 high-value transactions.
3 of them are fraudulent.
The probability of fraud given a high-value transaction is 0.66 (66%).

Conclusion

Using SQL, I applied conditional probability to a real-world fraud detection problem. This method can be extended to other financial risk analyses, medical diagnoses, and recommendation engines. Integrating probability theory with SQL helps businesses make data-driven decisions directly within their databases.