All engineers engage with data as a fundamental aspect of their work. It is important as analyzing large amounts of data requires a solid understanding of statistics and math. Yet, working with raw data brings uncertainty as we may not fully grasp our current situation or where we are headed.
In my experience during my MSc, I frequently worked with Digital Elevation Models (DEMs), evaluating the elevation of thousands of data points. Initially, I used Excel for this task, we are all familiar with Excel. It is capable, powerful, and versatile, making it critical for making sense of data. However, there are many numerous software options such as R, which offers a range of statistical and graphical techniques for analysis and visualization; Stata, known for its robust features and functionalities; MATLAB, which is recognized for its fast and clean interface; and Python, which provides a wide range of libraries and frameworks for data manipulation, analysis, and machine learning. As I worked with heavier case study datasets I discovered Excel was not entirely suited for a computationally demanding task. Seeking a faster alternative, I turned to MATLAB. MATLAB offers built-in functions and tools for mathematical modeling, simulation, algorithm development, and data visualization. MATLAB also offered a user-friendly interface and robust capabilities for handling large datasets. However, it requires a paid license, although fortunately, university provided students with a free license. There are alternatives such as R and Stata but I was not familiarized with their language or software at the time.
Another trending alternative is Jupyther Notebook. When compared to other alternatives Jupyter Notebook is a great stepping stone, offering a dynamic space with numerous theoretical and practical exercises that teach concepts such as machine learning, neural networks, and data visualization. Jupyter Notebook can be confusing at first, particularly with extensive code, and you might prefer a smoother, more streamlined experience. For this reason, learning shortcuts can be a good idea as it facilitate navigation. You can display all shortcuts by entering command mode and pressing 'H'. Some of the most important are 'b' that add a cell below and 'a' to add one above. You call also show the documentation (named Docstring) for objects typed in the code cells by pressing 'Shift + TAB'. Comments can be added using 'Ctrl + /' or '#' before the code. The general idea is similar to other softwares, but the great thing about Jupyter Notebook is its ease of editing and modifying specific portions of code, and its ability to facilitate sharing of your analysis, which is probably its biggest advantage.
Generally, data science projects begin with learning as much as possible about the project, asking many questions, and reviewing literature and available resources. There is often not enough time to review existing literature and evaluate available resources, so it is always helpful to ask questions. The more you understand the project and the data available, the higher the expected output, making this part of the project crucial for the development of subsequent sections. The next step is to list the available databases and interconnect them. As long as these datasets are not highly unstructured, this task will likely be straightforward. Interconnecting datasets can be a simple process unless they are of different types (e.g., geo-spatial vs. ambiguous vs. qualitative data).
Generally when starting a new project in python you should create a new environment with Conda to avoid conflict with future projects following the "conda create" command.
However, this can be done under the anaconda navigator app.
You can tell in which environment you are in by looking at the first word of your Python navigator and you can switch environments following the command:
Some repositories require users to install project requirements, these are generally listed under a requirements.txt file that you can clone and call following a pip command on your terminal. For example:
I was interesting in deep-learning algorithms. This deep-learning contained a number of commandline arguments from which the project could be called. They were located under a .py file named detect.
Following the command arguments defined we were able to run the model by excuting the command which tells the tool to run under my webcam service.
The algorithm I was testing has multiple modes listed under its repository that can be called to achieve different processing times and results.
These can be called in the same way we called the first instance of the algorithm such as:
In another project, I experimented with the dplyr library for manipulating mammal species data points, including filtering columns and rows, adjusting formatting, and merging different data sources. dplyr was fun and useful and has a vast amount of documentation making it "easy" to navigate. I have exported a simplified GIS web-map here for reference.
Data Science is a deep topic that you never stop learning, it is constantly evolving and there are a numerous number of packages and functions constantly being released. The real challenge could be to identify where to look at, to have the time to work on the project, and to have the quality data to work with.
Post date: 2024-03-02
Last edit: 2024-08-10