By exploring options for systematically building and deploying automated algorithmic trading strategies, this book will help you level the playing field. Machine learning ML is changing virtually every aspect of our lives. Today ML algorithms accomplish tasks …. Many industries have been revolutionized by the widespread adoption of AI and machine learning. Since all maturities appear multiple times, we need to use a little trick to get to a nonredundent, sorted list with the maturities.
Therefore, we sort the set object cf. The result is shown as Figure As in stock or foreign exchange markets, you will notice the so-called volatility smile, which is most pronounced for the shortest maturity and which becomes a bit less pronounced for the longer maturities: In [15]: import matplotlib. To conclude this example, we want to show another strength of pandas: namely, for working with hierarchically indexed data sets. The operation returns a DataFrameGroupBy object.
The resulting DataFrame object has two index levels and two columns. Its importance stems from the fact that it is quite powerful when it comes to option pricing or risk management problems.
The downside of the Monte Carlo method is that it is per se computationally demanding and often needs huge amounts of memory even for quite simple problems. Therefore, it is necessary to implement Monte Carlo algorithms efficiently. Although not needed here, all approaches store complete simulation paths in-memory. For the valuation of standard European options this is not necessary, as the corresponding example in Chapter 1 shows.
However, for the valuation of American options or for certain risk management purposes, whole paths are needed. These Monte Carlo examples and implementation approaches also appear in the article Hilpisch The examples are again based on the model economy of Black-Scholes-Merton , where the risky underlying e. The parameters are defined as in Equation and Z is a Brownian motion. To implement a Monte Carlo valuation of the European call option, the following recipe can be applied:.
Determine the time T value of the index level ST i by applying the pseudo- random numbers time step by time step to the discretization scheme in Equation Sum up the inner values, average, and discount them back with the riskless short rate according to Equation The code simulates , paths over 50 time steps. Out[20]: European Option Value 7.
The major part of the code in Example consists of a nested loop that generates step- by-step single values of an index level path in the inner loop and adds completed paths to a list object with the outer loop. Although this loop yields the same result, the list comprehension syntax is more compact and closer to the mathematical notation of the Monte Carlo estimator.
Vectorization with NumPy NumPy provides a powerful multidimensional array class, called ndarray, as well as a comprehensive set of functions and methods to manipulate arrays and implement complex operations on such objects. From a more general point of view, there are two major benefits of using NumPy: Syntax NumPy generally allows implementations that are more compact than pure Python and that are often easier to read and maintain. The generally more compact syntax stems from the fact that NumPy brings powerful vectorization and broadcasting capabilities to Python.
This is similar to having vector notation in mathematics for large vectors or matrices. For example, assume that we have a vector with the first natural numbers, 1, …, It rather returns, in this case, two times the object vector.
Vectorization brings a speedup of more than 30 times in comparison to pure Python. The estimated Monte Carlo value is again quite close to the benchmark value. The vectorization becomes obvious when the pseudorandom numbers are generated.
In the line in question, , numbers are generated in a single step, i. Vectorization Using vectorization with NumPy generally results in code that is more compact, easier to read and maintain , and faster to execute.
All these aspects are in general important for financial applications. To this end, consider the log version of the discretization in Equation , which takes on the form in Equation This version is completely additive, allowing for an implementation of the Monte Carlo algorithm without any loop on the Python level. Example shows the resulting code. Let us run this third simulation script.
The execution speed is somewhat slower compared to the first NumPy implementation. However, it shows how far one can go sometimes with NumPy vectorization. First, we plot the first 10 simulated paths over all time steps. Figure shows the output: In [29]: import matplotlib.
Second, we want to see the frequency of the simulated index levels at the end of the simulation period. Histogram of all simulated end-of-period index level values. In this case, the majority of the simluated values are zero, indicating that the European call option expires worthless in a significant amount of cases. On Wikipedia you find the following definition:. In what follows, we focus on the study of past market data for backtesting purposes, and not too much on using our insights to predict future price movements.
It also has highly liquid futures and options markets. We will read historical index level information from a web source and will implement a simple backtesting for a trading system based on trend signals.
But first we need the data to get started. To this end, we mainly rely on the pandas library, which simplifies a number of related technical issues. Since it is almost always used, we should also import NumPy by default: In [33]: import numpy as np import pandas as pd import pandas. Scientific and Financial Python Stack In addition to NumPy and SciPy, there are only a couple of important libraries that form the fundamental scientific and financial Python stack. Among them is pandas.
The sublibrary pandas. Finance site. It has also generated automatically a time index with Timestamp objects. To get a first impression, we can plot the closing quotes over time. This gives an output like that in Figure In [35]: sp['Close']. The trend strategy we want to implement is based on both a two-month i. In this example, the first line simultaneously adds a new column to the pandas DataFrame object and puts in the values for the day trend. The second line does the same with respect to the day trend.
Consequently, we now have two new columns. These have fewer entries due to the very nature of the data we have generated for these columns— i. The resulting plot in Figure already provides some insights into what was going on in the past with respect to upward and downward trends: In [38]: sp[['Close', '42d', 'd']]. Our basic data set is mainly complete, such that we now can devise a rule to generate trading signals. The rule says the following: Buy signal go long the 42d trend is for the first time SD points above the d trend.
Sell signal go short the 42d trend is for the first time SD points below the d trend. To this end, we add a new column to the pandas DataFrame object for the differences between the two trends. Although the number of entries in the two trend columns is not equal, pandas takes care of this by putting NaN values at the respective index positions: In [40]: sp[''].
In words, on 1, trading dates, the 42d trend lies more than SD points above the d trend. On 1, days, the 42d trend is more than SD points below the d trend. This is what we call regime and what is illustrated in Figure , which is generated by the following two lines of code: In [42]: sp['Regime']. Everything is now available to test the investment strategy based on the signals.
We assume for simplicity that an investor can directly invest in the index or can directly short the index, which in the real world must be accomplished by using index funds, exchange-traded funds, or futures on the index, for example. Such trades inevitably lead to transaction costs, which we neglect here. This simplified strategy allows us to work with market returns only. The investor makes the market return when he is long 1 , makes the negative market returns when he is short —1 , and makes no returns 0 when he parks his wealth in cash.
We therefore need the returns first. In Python, we have the following vectorized pandas operation to calculate the log returns. Figure compares the cumulative, continuous returns of the index with the cumulative, continuous returns of our strategy: In [45]: sp[['Market', 'Strategy']]. Although the strategy does not capture the whole upside during bullish periods, the strategy as a whole outperforms the market quite significantly.
However, we have to keep in mind that we completely neglect operational issues like trade execution and relevant market microstructure elements e.
For example, we are working with daily closing values. Such considerations for sure have an impact on the performance, but the overall result would probably persist. Also, transaction costs generally diminish returns, but the trading rule does not generate too many signals.
Financial Time Series Whenever it comes to the analysis of financial time series, consider using pandas. Almost any time series-related problem can be tackled with this powerful library. Conclusions Without going into too much detail, this chapter illustrates the use of Python by the means of concrete and typical financial examples: Calculation of implied volatilities Using real-world data, in the form of a cross section of option data for a given day, we calculate numerically the implied volatilities of European call options on the.
This example introduces some custom Python functions e. Monte Carlo simulation Using different implementation approaches, we simulate the evolution of an index level over time and use our simulated end-of-period values to derive Monte Carlo estimators for European call options.
This example illustrates the capabilities and convenience of pandas when it comes to time series analytics. One important topic is not covered: namely, object orientation and classes in Python. For the curious reader, Appendix B contains a class definition for a European call option with methods based on the functions found in the code of Example in this chapter.
This part of the book represents its core. The sheer number of topics covered in this part makes it necessary to focus mainly on selected, and partly rather specific, examples and use cases. The chapters are organized according to certain topics such that this part can be used as a reference to which the reader can come to look up examples and details related to a topic of interest. This core part of the book consists of the following chapters:.
Bad programmers worry about the code. Good programmers worry about data structures and their relationships. Although the Python interpreter itself already brings a rich variety of data structures with it, NumPy and other libraries add to these in a valuable fashion.
The chapter is organized as follows: Basic data types The first section introduces basic data types such as int, float, and string. Basic data structures The next section introduces the fundamental data structures of Python e. NumPy data structures The following section is devoted to the characteristics and capabilities of the NumPy ndarray class and illustrates some of the benefits of this class for scientific and financial applications.
The spirit of this chapter is to provide a general introduction to Python specifics when it comes to data types and structures.
If you are equipped with a background from another programing language, say C or Matlab, you should be able to easily grasp the. The topics introduced here are all important and fundamental for the chapters to come. Basic Data Types Python is a dynamically typed language, which means that the Python interpreter infers the type of an object at runtime.
In comparison, compiled languages like C are generally statically typed. In these cases, the type of an object has to be attached to the object before compile time. The built-in function type provides type information for all objects with standard and built-in types as well as for newly created classes and objects. In the latter case, the information provided depends on the description the programmer has stored with the class. Advanced Python environments, like IPython, provide tab completion capabilities that show all methods attached to an object.
You simply type the object name followed by a dot e. This then provides a collection of methods you can call on the object. Alternatively, the. The Cython library brings static typing and compiling features to Python that are comparable to those in C. In fact, Cython is a hybrid language of Python and C. A specialty of Python is that integers can be arbitrarily large.
Consider, for example, the googol number Large Integers Python integers can be arbitrarily large. It is important to note that mathematical operations on int objects return int objects. Floats For the last expression to return the generally desired result of 0.
Adding a dot to an integer value, like in 1. Here and in the following discussion, terms like float, float object, etc. The same holds true for other object types. A float is a bit more involved in that the computerized representation of rational or real numbers is in general not exact and depends on the specific technical approach taken. This becomes evident when adding 0. For certain floating-point numbers the binary representation might involve a large number of elements or might even be an infinite series.
However, given a fixed number of bits used to represent such a number—i. Other numbers can be represented perfectly and are therefore stored exactly even with a finite number of bits available. For example, the issue can be of importance when summing over a large set of numbers. The module decimal provides an arbitrary-precision object for floating-point numbers and several options to address precision issues when working with such numbers: In [15]: import decimal from decimal import Decimal In [16]: decimal.
Arbitrary-Precision Floats The module decimal provides an arbitrary-precision floating-point number object. In finance, it might sometimes be necessary to ensure high precision and to go beyond the bit double-precision standard. The basic data type to represent text in Python is the string. The string object has a number of really helpful built-in methods. In fact, Python is generally considered to be a good choice when it comes to working with text files of any kind and any size.
A string object is generally defined by single or double quotation marks or by converting another object using the str function i. Or you can split it into its single-word components to get a list object of all the words more on list objects later : In [25]: t.
If the word is not in the string object, the method returns In [27]: t. A powerful tool when working with string objects is regular expressions. Suppose you are faced with a large text file, such as a comma-separated value CSV file, which contains certain time series and respective date-time information.
More often than not, the date-time information is delivered in a format that Python cannot interpret directly. However, the date-time information can generally be described by a regular expression. Consider the following string object, containing three date-time elements, three integers, and three strings.
It is not possible to go into details here, but there is a wealth of information available on the Internet about regular expressions in general and for Python in particular.
Regular Expressions When parsing string objects, consider using regular expressions, which can bring both convenience and performance to such operations. The resulting string objects can then be parsed to generate Python datetime objects cf. Appendix C for an overview of handling date and time data with Python.
Basic Data Structures As a general rule, data structures are objects that contain a possibly large number of other objects. Among those that Python provides as built-in structures are: tuple A collection of arbitrary objects; only a few methods available list A collection of arbitrary objects; many methods available. Like almost all data structures in Python the tuple has a built-in index, with the help of which you can retrieve single or multiple elements of the tuple. It is important to remember that Python uses zero-based numbering, such that the third element of a tuple is at index position 2: In [39]: t[2] Out[39]: 'data' In [40]: type t[2] Out[40]: str.
Zero-Based Numbering In contrast to some other programming languages like Matlab, Python uses zero-based numbering schemes. For example, the first element of a tuple object has index value 0. There are only two special methods that this object type provides: count and index. The first counts the number of occurrences of a certain object and the second gives the index value of the first appearance of it:.
Lists Objects of type list are much more flexible and powerful in comparison to tuple objects. From a finance point of view, you can achieve a lot working only with list objects, such as storing stock price quotes and appending new data. In addition to the characteristics of tuple objects, list objects are also expandable and reducible via different methods. In other words, whereas string and tuple objects are immutable sequence objects with indexes that cannot be changed once created, list objects are mutable and can be changed via different operations.
You can append list objects to an existing list object, and more: In [46]: l. Here, slicing refers to an operation that breaks down a data set into smaller parts of interest : In [51]: l[] 3rd to 5th elements Out[51]: [2. Table provides a summary of selected operations and methods of the list object. Excursion: Control Structures Although a topic in itself, control structures like for loops are maybe best introduced in Python based on list objects.
This is due to the fact that looping in general takes place over list objects, which is quite different to what is often the standard in other languages. Take the following example. The for loop loops over the elements of the list object l with index values 2 to 4 and prints the square of the respective elements.
Looping over Lists In Python you can loop over arbitrary list objects, no matter what the content of the object is. This often avoids the introduction of a counter. Python also provides the typical conditional control elements if, elif, and else. A specialty of Python is so-called list comprehensions. Excursion: Functional Programming Python provides a number of tools for functional programming support as well—i. Among these tools are filter, map, and reduce.
However, we need a function definition first. The return object is a Boolean. To this end, we can also provide a function definition directly as an argument to map, by using lambda or anonymous functions:. Functions can also be used to filter a list object. In the following example, the filter returns elements of a list object that match the Boolean condition as defined by the even function: In [63]: filter even, range 15 Out[63]: [0, 2, 4, 6, 8, 10, 12, 14].
Finally, reduce helps when we want to apply a function to all elements of a list object that returns a single value only.
List Comprehensions, Functional Programming, Anonymous Functions It can be considered good practice to avoid loops on the Python level as far as possible. Dicts dict objects are dictionaries, and also mutable sequences, that allow data retrieval by keys that can, for example, be string objects.
They are so-called key-value stores. While list objects are ordered and sortable, dict objects are unordered and unsortable. An example best illustrates further differences to list objects.
There are several methods to get iterator objects from the dict object. The objects behave like list objects when iterated over: In [72]: for item in d. Sets The last data structure we will consider is the set object. With set objects, you can implement operations as you are used to in mathematical set theory.
For example, you can generate unions, intersections, and differences: In [76]: s. One application of set objects is to get rid of duplicates in a list object. NumPy Data Structures The previous section shows that Python provides some quite useful and flexible general data structures. In particular, list objects can be considered a real workhorse with many convenient characteristics and application areas.
One of the most important data structures in this regard is the array. Arrays generally structure other fundamental objects in rows and columns. In the simplest case, a one-dimensional array then represents, mathematically speaking, a vector of, in general, real numbers, internally represented by float objects.
It then consists of a single row or column of elements only. Mathematical disciplines like linear algebra and vector space theory illustrate that such mathematical structures are of high importance in a number of disciplines and fields. Some of these NaN-safe functions were added NumPy 1. The following gives a table of useful aggregation functions available in numpy. We may also wish to compute quantiles: print "25th percentile: ", np.
We can do this using tools in Matplotlib, which will be discussed further in chapter X. X: See section X. Broadcasting is simply a set of rules for applying universal functions like addition, subtraction, multiplication, and others on arrays of different sizes. We can similarly extend this to arrays of higher dimension. While these examples are relatively easy to understand, more complicated cases can involve broadcasting of both arrays. The geometry of the above examples is visualized in the following figure: The dotted boxes represent the broadcasted values: again, this extra memory is not actually allocated in the course of the operation, but it can be useful conceptually to imagine that it is.
Rules of Broadcasting Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays: 1. If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading left side.
If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape. If in any dimension the sizes disagree and neither is equal to 1, an error is raised. How does this affect the calculation? But this is not how the broadcasting rules work! That sort of flexibility might be useful in some cases, but it would lead to potential areas of ambiguity. X : a[:, np. Broadcasting extends this ability.
One commonly- seen example is when centering an array of data. Imagine you have an array of ten observations, each of which consists of three values.
Utility Routines for Broadcasting np. Comparisons, Masks, and Boolean Logic This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. In NumPy, boolean masking is often the most efficient way to accomplish these types of tasks. Example: Counting Rainy Days Imagine you have a series of data that represents the amount of precipitation each day for a year in a given city.
What is the average precipitation on those rainy days? How many days were there with more than half an inch of rain? Digging Into the Data One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range. We saw in section X. Comparison Operators as ufuncs In section X. The result of these comparison operators is always an array with a boolean data type.
Another way to get at this information is to use np. Finally, a quick warning: as mentioned in Section X. X, Python has built-in sum , any , and all functions. These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multi- dimensional arrays.
Be sure that you are using np. Boolean Operators Above we saw how to count, say, all days with rain less than four inches, or all days with rain greater than two inches.
But what if we want to know about all days with rain less than four inches AND greater than one inch? Like with the standard arithmetic operators, NumPy overloads these as ufuncs which work element-wise on usually boolean arrays. For example, we can address this sort of compound question this way: np. Here, we answer the same question in a more convoluted way, using boolean identites: np.
When would you use one versus the other? In Python, all nonzero integers will evaluate as True. For boolean NumPy arrays, the latter is nearly always the desired operation. Fancy Indexing In the previous section we saw how to access and modify portions of arrays using simple indices e. Fancy indexing is like the simple indexing above, but we pass arrays of indices in place of single scalars. The pairing of indices in fancy indexing is even more powerful than this: it follows all the broadcasting rules that were mentioned in section X.
For example: row[:, np. Generating Indices: np. Note here that the computation of these indices is an extra step, and thus using np. So why might you use np. Many use it because they have come from a language like IDL or MatLab where such constructions are familiar. But np. Example: Selecting Random Points One common use of fancy indexing is the selection of subsets of rows from a matrix. More information on plotting with matplotlib is available in chapter X.
Modifying values with Fancy Indexing Above we saw how to access parts of an array with fancy indexing. Fancy indexing can also be used to modify parts of an array. The result, of course, is that x[0] contains the value 6. Why is this not the case? With this in mind, it is not the augmentation that happens multuple times, but the assignment, which leads to the rather non-intuitive results.
So what if you want the other behavior where the operation is repeated? For this, you can use the at method of ufuncs available since NumPy 1. Another method which is similar in spirit is the reduceat method of ufuncs, which you can read about in the NumPy documentation.
Example: Binning data You can use these ideas to quickly bin data to create a histogram. We could compute it using ufunc. This is why matplotlib provides the plt.
This could mean that an intermediate result is being cached loops, best of 3: How can this be? If you dig into the np. An algorithm efficient for large datasets will not always be the best choice for small datasets, and vice versa see big-O notation in section X.
All the constructs below come from the module numpy. Consider yourself warned. Some functionality with a similar spirit is provided by the objects numpy. But what if you prefer to use np. This form is much more clear than the rather obscure np. Especially for large arrays, this can save a lot of memory overhead within your calculation. For example, because plt.
For this type of operation, I prefer using np. Functionally, it is very similar to np. Often, however, it can be cleaner to specify the indices directly.
For example, to find the equivalent of the above result, we can alternatively mix slicing and fancy indexing to write M[:, [2, 4, 3]] array [[ 2, 4, 3], [10, 12, 11]] For more complicated operations, we might instead follow the strategy above under np.
For example, imagine that we want to create an array of numbers which counts up to a value and then back down. We might use np. For example, we can turn one-dimensional arrays into two-dimensional arrays by putting another argument within the string: np. The Preferred Alternative: concatenation As you might notice, np. The result is that one-dimensional arguments are stacked horizontally as columns of the two-dimensional result: np.
As with np. Sorting Arrays This section covers algorithms related to sorting NumPy arrays. All are means of accomplishing a similar task: sorting the values in a list or array. For example, a simple selection sort repeatedly finds the minimum value from a list, and makes swaps until the list is sorted.
The reason is that as lists get big, it does not scale well. Sidebar: Big-O Notation Big-O notation is a means of describing how the number of operations required for an algorithm scales as the size of the input grows.
These distinctions add precision to statements about algorithmic scaling. Big-O notation, in this loose sense, tells you how much time your algorithm will take as you increase the amount of data. For our purposes, the N will usually indicate some aspect of the size of the data set: how many distinct objects we are looking at, how many features each object has, etc. Notice that the big-O notation by itself tells you nothing about the actual wall-clock time of a computation, but only about its scaling as you change N.
But for small datasets in particular the algorithm with better scaling might not be faster! Fast Sorts in Python Python has a list. Fast Sorts in NumPy: np. For most applications, the default quicksort is more than sufficient. To return a sorted version of the array without modifying the input, you can use np. NumPy provides this in the np. Within the two partitions, the elements have arbitrary order. Similarly to sorting, we can partition along an arbitrary axis of a multi-dimensional array: np.
Finally, just as there is a np. Using the broadcasting rules covered in section X. X along with the aggregation routines from section X. With the pairwise square-distances converted, we can now use np. We can do this with the np. At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A.
Searching and Counting Values In Arrays This section covers ways of finding a particular value or particular values in an array of data. Python Standard Library Tools Because this is so important, the Python standard library has a couple solutions you should be aware of. Unsorted Lists For any arbitrary list of data, the only way to find whether an item is in the list is to scan through it. Rather than making you write this loop yourself, Python provides the list.
Searching for Values in NumPy Arrays NumPy has some similar tools and patterns which work for quickly locating values in arrays, but their use is a bit different than the Python standard library approaches. The patterns listed below have the benefit of operating much more quickly on NumPy arrays, and of scaling well as the size of the array grows. Unsorted Arrays For finding values in unsorted data, NumPy has no strict equivalent of list. Instead, it it typical to use a pattern based on masking and the np.
To isolate the first index at which the value is found, then, you must use [0] to access the first item of the tuple, and [0] again to access the first item of the array of indices. This gives the equivalent result to list.
If np. Counting and Binning A related set of functionality in NumPy is the built-in tools for counting and binning of values. For example, to count the occurrances of a value or other condition in an array, use np. If your data consist of positive integers, a more compact way to get this information is with the np.
For this more general case, you can specify bins for your values with the np. For this reason, the counts array will have one fewer entry than the bins array. A related function to be aware of is np. Above, we saw the dictionary method: np. The next character specifies the type of data: characters, bytes, ints, floating points, etc. The last character or characters represents the size of the object in bytes.
For example, you can create a type where each element contains an array or matrix of values. X, when we discuss Cython, a C-enabled extension of the Python language. For day-to-day use of structured data, the Pandas package is a much better choice.
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a Data Frame. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs. In this chapter, we will focus on the mechanics of using the Series, DataFrame, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus. Once Pandas is installed, you can import it and check the version: import pandas pandas. For example, you can type In [3]: pd. Introducing Pandas Objects At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see through the rest of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of these data structures, but nearly everything that follows will require an understanding of what these structures are.
This first section will cover the three fundamental Pandas data structures: the Series, DataFrame, and Index. Series [0. The values are simply a familiar NumPy array: data. This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of any desired type. For example, data can be a list or NumPy array, in which case index defaults to an integer sequence: pd. Series [2, 4, 6] 0 2 1 4 2 6 dtype: int64 data can be a scalar, which is broadcast to fill the specified index: pd. Like the Series above, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. DataFrame as a Generalized NumPy array If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. DataFrame as Specialized Dictionary Similarly, we can think of a dataframe as a specialization of a dictionary. For example, asking for the 'area' attribute returns the Series object containing the areas we saw above: states['area'] California Florida Illinois New York Texas Name: area, dtype: int64 Notice the potential point of confusion here: in a two-dimesnional NumPy array, data[0] will return the first row.
For a dataframe, data['col0'] will return the first column. A DataFrame is a collection of series, and a single- column dataframe can be constructed from a single series: pd. Any list of dictionaries can be made into a dataframe.
DataFrame data a b 0 0 0 1 1 2 2 2 4 Even if some keys in the dictionary are missing, Pandas will fill them in with NaN i. Given a two-dimensional array of data, we can create a dataframe with any specified column and index names.
If left out, an integer index will be used for each. DataFrame np. We covered structured arrays in section X. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set. Those views have some interesting consequences in the operations available on Index objects. Index as Ordered Set Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
Recall that Python has a built-in set object, which we explored in section X. For more information on the variety of set operations implemented in Python, see section X.
Looking Forward Above we saw the basics of the Series, DataFrame, and Index objects, which form the foundation of data-oriented computing with Pandas.
We saw how they are similar to and different from other Python data structures, and how they can be created from scratch from these more familiar objects. Just as understanding the effective use of NumPy arrays is fundamental to effective numerical computing in Python, understanding the effective use of Pandas structures is fundamental to the data munging required for data science in Python. Data Indexing and Selection In the previous chapter, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays.
These included indexing e. If you have used the NumPy patterns mentioned above, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.
Data Selection in Series As we saw in the previous section, a Series object acts in many ways like a one- dimensional NumPy array, and in many ways like a standard Python dictionary. Indexers: loc, iloc, and ix The slicing and indexing conventions above can be a source of confusion. These are not functional methods, but attributes which expose a particular slicing interface to the data in the Series. First, the loc attribute allows indexing and slicing which always references the explicit index: data.
The purpose of the ix indexer will become more apparent in the context of DataFrame objects, below. One guiding principle of Python code see the Zen of Python, section X. DataFrame as a Dictionary The first analogy we will consider is the DataFrame as a dictionary of related Series objects. For example, if the column names are not strings, or if the column names conflict with methods of the dataframe, this attribute-style access is not possible. For example, the DataFrame has a pop method, so data.
DataFrame as Two-dimensional Array As mentioned, we can also view the dataframe as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute: data. For example, we can transpose the full dataframe to swap rows and columns: data. In particular, passing a single index to an array accesses a row: data. Here Pandas again uses the loc, iloc, and ix indexers mentioned above. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array using the implicit Python-style index , but the DataFrame index and column labels are maintained in the result: data.
For example, in the loc indexer we can combine masking and fancy indexing as in the following: data. Additional Indexing Conventions There are a couple extra indexing conventions which might seem a bit inconsistent with the above discussion, but nevertheless can be very useful in practice. First, while direct integer indices are not allowed on DataFrames, direct integer slices are allowed, and are taken on rows rather than on columns as you might expect: data[] area pop density Florida Operations in Pandas One of the essential pieces of NumPy is the ability to perform quick elementwise operations, both with basic arithmetic addition, subtraction, multiplication, etc.
Pandas inherits much of this functionality from NumPy, and the universal functions ufuncs for short which we introduced in section X. X are key to this. Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc.
We will additionally see that there are well-defined operations between one- dimensional Series structures and two-dimensional DataFrame structures. Series rng. DataFrame rng. X can be used in a similar manner. UFuncs: Index Alignment For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. For example, calling A. X , subtration between a two-dimensional array and one of its rows is applied row-wise.
You will learn how to prepare your computer for programming in Python; how to work with various data types including strings, lists, tuples, dictionaries, Booleans; how to perform mathematical operations using Python, and more. This course targets anyone interested in Python programming, Python scripting, or computer programming in general; those who want to become a highly paid Python developer; and those who want to open up doors in their IT career by learning one of the world's most popular and in-demand programming languages.
Download Example Code. Python programming masterclass for beginners — learn all about Python 3 with object-oriented concepts, five projects …. Thanks for the suggestions on adding the why-reactive.
Feel free to comment and I can add more :. Skip to content. Sign in Sign up. Instantly share code, notes, and snippets. Last active Nov 23, Code Revisions 5 Stars Forks Embed What would you like to do?
Embed Embed this gist in your website. Share Copy sharable link for this gist.
0コメント