Takeaway Sheet: Data Pipelines and Why to Use Them
Glossary
Aggregation is the process of grouping data and making it smaller in size.
Data pipelines are special programs that run according to schedules. They collect, merge, transform, and store data automatically.
Practice
1-- Creating a table23CREATE TABLE table_name (primary_key_field_name serial PRIMARY KEY,4 column_name1 data_type,5 column_name2 data_type,6 …);
1-- Deleting a table23DROP TABLE table_name;
1-- Providing access to a table23GRANT ALL PRIVILEGES ON TABLE table_name TO username;
1-- Giving permissions to work with primary keys23GRANT USAGE, SELECT ON SEQUENCE table_name_id_seq TO anthony;
1# Connecting to a database2from sqlalchemy import create_engine34# Setting parameters to connect to the database.5# You can request them from the database administrator.6db_config = {'user': 'my_user', # username7 'pwd': 'my_user_password', # password8 'host': 'localhost', # server address9 'port': 5432, # connection port10 'db': 'db_name'} # database name1112# Forming the database connection string.13connection_string = 'postgresql://{}:{}@{}:{}/{}'.format(db_config['user'],14 db_config['pwd'],15 db_config['host'],16 db_config['port'],17 db_config['db'])18# Connecting to the database.19engine = create_engine(connection_string)
1# Running the query and storing the result in a DataFrame.2# SQLalchemy will automatically give the columns the same names as in the database table.3# We'll just have to indicate the index column with index_col.45query = ''' SELECT column1, column2, column36 FROM table_name7 '''89data_raw = pd.io.sql.read_sql(query, con = engine, index_col = 'column1')
1# Adding rows to a table2# if_exists = 'replace' - coinciding rows get overwritten3# if_exists = 'append' - coinciding rows get duplicated45df.to_sql(name = 'table_name', con = engine, if_exists = 'append', index = False))
1-- Deleting rows from a table using a condition23DELETE FROM table_name WHERE conditions_for_searching_records_to_be_deleted;