I’ve been working on an assignment for the past week in which I’ve needed to import datasets and analyze them with pandas. This also coincided with a large task at work. The assignment required a report to be written critically evaluating my results and visualizing them with MatPlotLib. The timing could not have been better…
Thankfully I started a week in advance but this still wasn’t sufficient as there were loads and loads of things to do. I've put together a list of things that I find useful to look through when I'm analyzing data-sets. it's a bit of a summary of what Series and Datasets are and some useful ways of interacting with them.
First off, let's talk about Series structures.
Series
These are structures that hold lists, basically akin to python lists except that you can associate an index with them which by default starts at 0 if it's not provided at initialization.
1. The Series type is an object:
s = Series()
2. You set the series' values/content when you instantiate it - this is like a normal python list so not very difficult to understand
s = Series([1,2,3,4,5]) # which sets the value of the series to that of a numbered list
3. You can set the series' value to any list, that is it can hold numbers, strings, tuples etc.
s = Series(['string1', 'string2'])
3. A series has an implicit index, corresponding to each value item in the series, starting from 0 for example 0='string1' 1='string2'
4. You can get access to the values of a series by indexing it using an index value, which is the same way you do it in python with its inbuilt lists.
s[0]
5. However where it changes a bit is that the list's index can be more than just positional numbers, It can be strings and the numbers don't have to be in order. You can thus define what index refers to which value in the Series. This start making a series look more like a dictionary in python, where traditional dictionary keys refer to items.
s = Series(['Stuart,'Mathews',12], index=['name', 'Surname', 'age'])
6. You can fetch several values in a series by referring to several index values, you can't do this in python. You can only specify one positional index in a normal python list. So this is a big win over normal python lists.
s[['age','Surname']] returns values [12,'Mathews']
7. You can get the Series' current index
s.index
8. By using the dictionary instead as a traditional list syntax as an initializer, you can specify each value and each index for that value during instantiation by
s = Series({'index1':value, 'index2': value2})
9. You can filter a series' values, so this will return all values in the series where the value is greater than 2
s[ s > 2]
10. You can join/append series and these operations will consult each other's matching index values.
series2+series3 # will add values where both series' index values are the same
11. Those add/join operations that don’t match each others index values can ignore the mismatch using fill_value=0
s.add(s1, fill_value=0)
Data frames
These structures are where the big work gets done and they are composed of Series' such that they are akin to a dictionary where each key, is a column and each value of the column is the vertical series' values.
So where a Series is an upgraded python list which does useful things with additional indexing capabilities(among other things), a data frame can be seen as a composition of multiple series. Data frames give you a 2-D view of data, rows and columns.
Data frames always return a new data frame if you operate on an existing data frame. This is one of the tenants of functional programming.
You can lookup by key, in addition to indexing using an index
There an implicit key starting at 0 for each element in a series but the key is used to set data frame values
- You can set the values of a data frame at instantiation, you see we're basically looking at an initialization of a dictionary of lists, which will be turned into series' for this data frame.
df = DataFrame({'column1':[value1,value2,value3], 'column2': [val1,val2,val3]})
- You look up a data frame by its key, which is a view of the series at that key (if you treat a data frame as a dictionary)
# gives you the series of values for the key /column
df['key']
- You can also initialise a data frame by initializing using lists of tuples. Each tuple is a row, and the column parameter labels those rows positionally according to the position in the provided columns
df = DataFrame([(1,2,3),(4,5,6)], columns=['col1', 'col2', 'col3'])
- You can also refer to the value by key lookup but using attributes by the key value, such that df['col'] is the same as df.col
df.col1 # same as df['col']
- You can get multiple data frame values(series) by specifying multiple lookup key values, this is similar to the capability that series' have over traditional python lists
df[['column1', 'column2']]
- You can reorder the key values(which are series') by creating a new data frame. Note how an operation creates a new Data frame instead of operating in-place on the existing data frame.
df = DataFrame(df, columns=['column2', 'column1'])
- You can set specify a series by referring to its key when setting it via scalar assignment, which puts all its sequence to the provided value:
df['column1'] = 'unknown'
- You can set a data frames series' by referring to its key when setting it, this is like the above but instead of using a scalar value, you specify a series.
df.column1 = Series([1,2,3,4,5,6])
- You can set the index for the data frame and this is where things get interesting... because now you have a cross-sectional way to refer to values in a data frame. This means you can hone in on a value by row and column. See the following key points:
df.set_index('key') # this adds a cross-section lookup along with a key to refer to a specific value/row/col combo
- The data frame's traditional index is still the same, in as much as you'd index a dictionary. If you now set the index explicitly, it refers to the row and it can be used it look up data along with the data frame's dictionary key.
df['key]['index']
- You can filter out rows in a data frame (row selection)
df[df.column1 > 2]
- Get unique values in a series that represents a data frame's column/key
df['column1'].unique() or df['column1'].unique().tolist()
Here are some useful operations when working with data frames and Series:
- Each of the keys/columns can be associated with a datatype like in relational tables.
#shows all types by key
df.dtypes
- Scalar assign a column to fill the value series with nan
ss_df.num.fillna(0))
- Remove any rows with nan
ss_df.dropna()
- Remove rows where specified column is nan
ss_df.dropna(subset=['key'])
- Changing the data type of a column in a data frame
#Modifying a Series's datatype
coursedata_df.level = coursedata_df.level.astype(int)
# Modifying a list of series within the data frame to be a specific datatype
coursedata_df[ ['level', 'points'] ] = coursedata_df[ ['level', 'points'] ].astype(float)
- Renaming a series/column
splitaddresses_df.rename(columns = {0:'Country',1:'State',2:'City'}, inplace=True)
- Replacing all values in a column and using regex pattern matching
messynumbers_df['cleanvals'] = messynumbers_df.messyvals.str.replace(',', '')
messynumbers_df.replace({'cleanvals' : "^[^\d]*([\d]*)[^\d]*$"}, {'cleanvals' : r'\1'}, regex=True)
- Filter rows and then show only two columns
result_df = course_df[course_df['points']==30][['courseCode','level']]
- SELECT ROWS where column = c7
result_df = ABCD_df[ABCD_df['C']=='c7']
- Show first two rows
course_df.head(2)
print(course_df[0:2])
- Sort values by key
result_df = course_df.sort_values(by=['courseCode'])
result_df = course_df.sort_values(by=['courseCode'], ascending=False)
result_df = course_df.sort_values(by=['points', 'level'], ascending=[True, False]) # multiple sorts on key
Here are a few advanced moves which I'll go into more detail in another article:
- Vertical joins/union
- union aligns same columns in union
concat_diffcolumns_df = pd.concat([sample1_df, sample4_df]) concat_inner_df = pd.concat([sample1_df, sample4_df], join='inner') - Inner join - horizontal join
simplemerge_df = pd.merge(bldgSample_df, lettingsSample_df, on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area']) - Left outer join - bring back all left columns
pd.merge(bldgSample_df, lettingsSample_long_df, on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'], how='left') - Right outer join
pd.merge(bldgSample_df, lettingsSample_long_df, on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'], how='right') - Full outer join
pd.merge(bldgSample_df, lettingsSample_long_df, on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'], how='outer') - Expand Directorate to all its possible values in its column, the sum of each unique value in Capital or Revenue value for each directorate
pd.crosstab(df['Capital or Revenue'], df['Directorate']) - Group by the values of Directorate, and sum the values in their groups
df.pivot_table(index=['Directorate'], aggfunc=np.sum)
So as you can see there is a lot going on in the world of data analysis in Python.