Takeaway Sheet: Data Pipelines and Why to Use Them

Glossary

Aggregation is the process of grouping data and making it smaller in size.

Data pipelines are special programs that run according to schedules. They collect, merge, transform, and store data automatically.

Practice


1-- Creating a table
2
3CREATE TABLE table_name (primary_key_field_name serial PRIMARY KEY,
4                          column_name1 data_type,
5                          column_name2 data_type,
6                          …);


1-- Deleting a table
2
3DROP TABLE table_name;


1-- Providing access to a table
2
3GRANT ALL PRIVILEGES ON TABLE table_name TO username;


1-- Giving permissions to work with primary keys
2
3GRANT USAGE, SELECT ON SEQUENCE table_name_id_seq TO anthony;


1# Connecting to a database
2from sqlalchemy import create_engine
3
4# Setting parameters to connect to the database.
5# You can request them from the database administrator.
6db_config = {'user': 'my_user',         # username
7             'pwd': 'my_user_password', # password
8             'host': 'localhost',       # server address
9             'port': 5432,              # connection port
10             'db': 'db_name'}           # database name
11
12# Forming the database connection string.
13connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],
14                                                   db_config['pwd'],
15                                                  db_config['host'],
16                                                  db_config['port'],
17                                                  db_config['db'])
18# Connecting to the database.
19engine = create_engine(connection_string)


1# Running the query and storing the result in a DataFrame.
2# SQLalchemy will automatically give the columns the same names as in the database table.
3# We'll just have to indicate the index column with index_col.
4
5query = ''' SELECT column1, column2, column3
6            FROM table_name
7        '''
8
9data_raw = pd.io.sql.read_sql(query, con = engine, index_col = 'column1')


1# Adding rows to a table
2# if_exists = 'replace' - coinciding rows get overwritten
3# if_exists = 'append' - coinciding rows get duplicated
4
5df.to_sql(name = 'table_name', con = engine, if_exists = 'append', index = False))


1-- Deleting rows from a table using a condition
2
3DELETE FROM table_name WHERE conditions_for_searching_records_to_be_deleted;