Senior Site Reliability Engineer with AI/ML experience preferred
Company: Donato Tech
Location: San Leandro
Posted on: April 17, 2025
|
|
Job Description:
Job Title Senior Site Reliability Engineer with AI/ML experience
preferred.
Not sure what skills you will need for this opportunity Simply read
the full description below to get a complete picture of candidate
requirements.
Location: San Leandro, CA
Note: Need only locals as in-person interview required for this
role
Only locals preferred.
10+ years of Software Engineering experience, or equivalent
demonstrated through one or a combination of the following: work
experience, training, military experience, education
10+ years of experience in Production support/Site Reliability
Engineering teams with continued focus on improving Platform
health
Familiar with Agile or other rapid application development
practices
Hands-on expertise with Automated testing, Process Automation &
building dashboards using APM tools.
Experience with distributed (multi-tiered) systems, algorithms,
hands-on exp with Oracle and MongoBD databases.
Knowledge & Exposure caching tools (Redis, memcache) or messaging
tools such as MQ, Kafka.
Must have working knowledge of APM tools such as splunk, GCL, ELK,
Grafana, Prometheus etc.
Able to create Dashboards using GCL/Splunk/ELK and setup
alerts.
Working knowledge of CICD is a plus - Source control like Git,
Continuous Integration - Jenkins / UCD Release etc. .
Ability to work with Engineering teams across the ecosystem such as
Security, Networking & Infrastructure challenges which can impact
platform health & resiliency.
Shell Scripting / DevOps tools like Ansible with good knowledge of
yaml file to write playbooks .
Experience with distributed storage technologies like NFS as well
as dynamic resource management frameworks PCF, Kubernetes /
OpenShift, AWS or Azure.
Tech Stack: Java/J2EE (Spring, Spring Boot, Python, Shell
Scripting, Kafka, Oracle, MongoDB etc.).
Able to work on shift duty in a 12/7 support organization.
A proactive approach to spotting problems, areas for improvement,
and performance bottlenecks.
Bachelor's Degree in computer science, computer science
engineering, or related experience required;
Desired Qualifications:
Recognizes opportunities to adopt innovative technologies AI/ML to
enable business capabilities
Keeps up to date on current research and technology in the
industry
Recognizes the importance of collaboration to achieve
objectives
Clearly communicates ideas and concepts to others
Leads work effectively and acts on own initiative without being
prompted
Provides feedback to team members in code reviews
Drive creative changes & continuous improvements
Explores new automation techniques to refine the agility, speed and
quality of engineering initiatives and efforts
Gather and analyze metrics from both operating systems and
applications to assist in performance tuning and fault finding
Balance feature development speed and reliability with well-defined
service level objectives
Job Expectations:
You will be a core member of a SRE support team, will be utilizing
the latest technology tools to write code, test cases, working with
API specs and automate to maintain the resiliency, performance and
availability of Digital Sales & Marketing platforms.
Strong & relevant experience in supporting Web/API platforms built
using Java/java script Stack (Spring/Spring boot, Javascript
-Angular/react)
Proficiency in dealing with Legacy infrastructure along with cloud
infrastructure (on prem & 3rd party) such as OCP, PCF or Azure.
Identifying opportunities to adopt to new technologies while
improving the efficiency by removing toil and continues to drive
efficiency & optimization.
Proactive monitoring of app performance through Splunk, App
dashboards, App dynamics & Dynatrace etc.
Represent Platform engineering teams during production outages and
collaborate with engineering teams to resolve production outages.
Collaborate with stake holders across engineering function to
own/derive RCA & work towards permanent resolution.
Plan, support, execute and comply with governance
programs/processes in support of a strong control environment in
your functional area. Leverage process documentation to improve
operational controls and identify and remediate process
deficiencies.
Proactively identify, communicate, mitigate and escalate risk
originating from non-compliance of processes, operational errors,
and data integrity issues in all applicable processes.
Ability to influence SRE practices within and outside teams to
enable a strong DevOps culture within the organization
Able to work on shift duty in a 12/7 support organization.
Responsible for working with Engineering teams to maintain the SLAs
& SLOs. Constantly looking out for opportunities to improve
platform metrics & communicate the same to stakeholders.
Exposure and proficiency in different API styles such as SOAP,
REST, Micro services etc.
Working knowledge of Unix, Linux and Postman
This will be a lead/Senior engineer position where you will play a
role of mentor/leader to coach other engineers in enabling SRE
function
Keywords: Donato Tech, Pleasanton , Senior Site Reliability Engineer with AI/ML experience preferred, Professions , San Leandro, California
Click
here to apply!
|