# Crime and political corruption analysis using data mining, machine learning and complex networks

There has been a remarkable increasing in the amount of stored data by private and public companies. On one hand, these huge amounts of data enable a detailed historical review of the processes under investigation; on the other hand, this excess of data makes harder to extract summarized information and also to make good decisions supported by well-established empirical facts. This modern phenomenon has been called a big data and understanding these systems and extracting patterns from these data requires a multidisciplinary approach. In this sense, during the course at the School of Applied Mathematics in the Institute of Mathematics and Computer Science at University of São Paulo we will address topics that involve computer science, statistics, and physics to understand these systems. Among the topics, we will focus on the following ones:

- Introduction to Python;
- Web scraping;
- Data mining;
- Machine learning;
- Complex networks.

Using these tools, we will focus on two issues that are of great relevance in Brazil: predicting homicides in cities and describing the mechanism behind political corruption networks. In the first topic, we will use machine learning techniques to predict the number of crimes in Brazilian cities. In the second topic, we will use complex networks to describe the interaction between politicians investigated in corruption scandals in Brazil from 1987 to 2014.

Any comments, questions, or concerns can be directed to:

- Luiz G. A. Alves lgaalves@northwestern.edu

# Course Syllabus

This course is broken up into several modules with each module having a set of Jupyter notebooks to help teach concepts.

## Basics, Collections and Files (Day 1)

- Jupyter Notebook
- Basic Data Types
- Flow Control
- Errors
- Lists, Tuples, and Sets
- File I/O
- Section Review (Optional)

## Imports, Plots, Functions, Dictionaries, and Web Scraping (Day 2)

- The Python Standard Library
- Data Visualization
- Functions
- Review (Optional)
- Dictionaries
- Review (Optional)
- Mini-Project
- Web Scraping

## Data Mining, Statistics, and Data Analysis (Day 3)

- Statistical analysis with Python
- Bootstrapping MC chains
- More stats with Python
- The Bootstrap
- Structured Data Analysis Pt1
- Structured Data Analysis Pt2

## Machine Learning Part I (Day 4)

- Data Loading
- Introduction to Scikit Learn
- Unsupervised Transforms
- Cross-validation and Grid Search
- Preprocessing

## Machine Learning Part II (Day 5)

- Linear Models for Regression
- Linear Models for Classification
- Trees
- Random Forests
- Gradient Boosting
- Homicides Prediction

## Complex Network and Analysis of Corruption Networks (Day 6)

- Network Basics
- Analysis of Structural Properties
- Network Vizualization and Queries on Networks
- Network Analysis from Data
- Corruption Network

## Social Network Analysis Using `igraph`

and `leidenalg`

(Extra)

## Software Installation

This bootcamp uses the Anaconda Python 3.7 distribution

**You must have Anaconda Python 3.7 installed before the first day of class**

## Downloading Course Materials

The course materials can be downloaded from the repository's github page.
Just download the zip file, unzip it onto your Desktop, and rename the directory `school-of-applied-math`

.

## Usage of Course Materials

This text and the majority of the course will conducted with Jupyter Notebook http://jupyter.org. Jupyter Notebook is a 'web-based interactive computational environment', meaning that it allows to write and execute python code in a web page from your own computers. Jupyter Notebook is a relatively new tool and we believe that is an excellent way to teach the basics of python programming and computational data analysis.

Jupyter Notebook is installed by default with the Anaconda Python distribution and can be laucnhed from the Anaconda Navigator program.

## Location and period of the course:

Period: July 1 to July 6, 2019.

Hours: 08:00 to 12:00

Location: (Institute of Mathematics and Computer Science at University of São Paulo) / University of São Paulo (rooms of block 3).

Approval Criteria: 85% of attendance and performance of proposed activities.

Target Audience: Senior year students and postgraduate students in applied mathematics, statistics, computer science and physics interested in data science.

Number of vacancies: 20

Enrollment Period: 04/15/2019 to 05/30/2019.

## References

- Downey, A. Think Python. (O’Reilly, 2012).
- Mitchell, R. Web Scraping with Python. (O’Reilly, 2018).
- Janert, P. K. Data Analysis with Open Source Tools. (O’Reilly, 2010).
- Friedman, J., Hastie, T., & Tibshirani, R. The elements of statistical learning. (Springer, 2001).
- Newman, M. Networks: An introduction. (Oxford University Press, 2010).
- Alves, L. G. A., Ribeiro, H. V., Rodrigues, F. A. Crime prediction through urban metrics and statistical learning. Physica A 515, 435 (2018).
- Ribeiro, H. V., Alves, L. G. A., Martins, A. F., Lenzi, E.K., Perc. M. The dynamical structure of political corruption networks. Journal of Complex Networks CNY002 (2018).
- Amaral, Luis A. N., Pah, Adam R., et al, NICO 101 - Introduction to Programming for Big Data
- Mueller, A., Introduction to Machine Learning with Python
- Unpingco, J, Python for Probability, Statistics, and Machine Learning
- Derzsy, N., Network Graph Analysis in Python
- Guimera, R., Mossa, S., Turtschi, A., & Amaral, L. N., The worldwide air transportation network: Anomalous centrality, community structure, and cities' global roles. Proceedings of the National Academy of Sciences, 102(22), 7794-7799 (2005).
- Guimera, R., & Amaral, L. A. N., Functional cartography of complex metabolic networks. nature, 433(7028), 895 (2005).
- Traag, V., Computational Social Science (CSS) Workshop